Wireless Communication

Tampereen teknillinen korkeakoulu Julkaisuja 296 Tampere University of Technology Publications 296
Mika Kuulusa
DSP Processor Core-Based Wireless System Design
Tampere 2000
Mika Kuulusa
DSP Processor Core-Based Wireless System Design
Dr.Tech. Thesis, 156 pages

18thAugust 2000
Contact Information: Mika Kuulusa Tampere University of Technology Digital and Computer Systems Laboratory P.O.Box 553 33101 TAMPERE Tel: 03 365 3872 work, 040 727 5512 mobile Fax: 03 365 3095 work E-mail: mika.kuulusa@tut.
ABSTRACT
Thisthesisconsidersthedesignofwirelesscommunicationssystemswhichareimplemented ashighlyintegratedembeddedsystemscomprisedofamixtureofhardwarecomponentsand software. An introductionary part presents digital communications systems, classicationof processors, programmable digital signal processing (DSP) processors, and development and implementation of a exible DSP processor architecture. This introduction is followed by a total of seven publications comprising the research work. In this thesis the following topics have been considered. Most of the presented research work is based on a customizable xed-point DSP processor which has been implemented as a highly optimized hard core for use in typical DSP applications. The studied topics cover a plethora of aspects starting from the initial development of the processor architecture. Several real-time DSP applications, such as MPEGaudiodecoding and GSMspeechcoding,havebeendevelopedandtheirperformance with this particular processor have been evaluated. The processor core itself as a bare hardware circuit is not usable without various software tools, function libraries, a C-compiler, and a real-time operating system. The set of development tools was gradually rened and several architectural enhancements were implemented during further development of the initial processor core. Furthermore, the modiedHarvardmemoryarchitecturewithoneprogrammemorybankwasreplacedwith a parallel program memory architecture. With this architecture the processor accesses several instructionsinparalleltocompensateforapotentiallyslowreadaccesstime,acharacteristic which is typical of, for example, ash memory devices. The development ow for heterogenous hardware/software systems is also studied. As a case study, a congurable hardware block performing two trigonometric transforms was embedded into a wireless LAN system described as a dataow graph. Furthermore, implementation aspects of an emerging communications system were studied. A high-level feasibility study of a W-CDMA radio transceiver for a mobile terminal was carried out to serve as a justication for partitioning various baseband functions into application-specic hardware units and DSP software to be executed on a programmable DSP processor.
PREFACE
The research work described in this thesis was carried out during the years 1996 2000 in the Digital and Computer Systems Laboratory at the Tampere University of Technology, Tampere, Finland. I would like to express my warmest gratitude to my thesis advisor, Prof. Jari Nurmi, for his skillful guidance and support during the course of the research work. I gratefully acknowledge the research support received from Prof. Jarkko Niittylahti and Prof. Jukka Saarinen, the head of the laboratory. In particular I am indebted to my background mentor, Prof. Jarmo Takala, whose encouragement and open-hearted support have had a signicant role in making this thesis a reality. I would also like to thank Teemu Parkkinen, M.Sc., for our constructive teamwork. Moreover, I express sincere thanks to my dear colleagues for their valuable assistance and for making the atmosphere at the laboratory so inspiring and innovative. I would also like to thank Prof. Jorma Skytta and Jarno Knuutila, Dr.Tech, for their constructive feedback and comments on the manuscript. During the past years I have had the utmost pleasure of working in collaboration with VLSI Solution and Nokia Research Center, both in Tampere, Finland. I have had the privilege to work with the talented silicon architects at VLSI Solution. I would like to express my sincere gratitude to Prof. Jari Nurmi and Tapani Ritoniemi, M.Sc., for providing me with this exceptional opportunity. In addition I would like to thank Janne Takala, M.Sc., Pasi Ojala, M.Sc., Juha Rostrom, M.Sc., and Henrik Herranen for their enthusiastic support. Furthermore, it has been a great pleasure to work with the people at Nokia Research Center. In particular, the numerous technical sessions and workshops have been both exciting and fruitful. Theresearch work wasnancially supported by the National Technology Agency (TEKES), Tampere Graduate School in Information Science and Engineering (TISE), and Tampere University of Technology. Moreover, I gratefully acknowledge the research grants received fromtheUllaTuominenFoundation,theJennyandAnttiWihuriFoundation,theFoundation of Finnish Electronics Engineers, the Foundation of Advancement of Technology, the Foundation of Advancement of Telecommunications, and the Finnish Cultural Foundation.
iv
Preface
Most of all I wish to express my deepest gratitude to my parents Vesa and Paula Kuulusa, my brother Juha, and my sister Nina for their love, encouragement, and compassion during all these years. Without their full support it would not have been possible to accomplish this long-spanning project.
Tampere, August 2000
Mika Kuulusa
TABLE OF CONTENTS
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i iii v vii ix xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Part I
Introduction
1 3 3 4 5 5 6 9 9 12 15 15 16 17 17
1. Introduction to Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 1.2 Objectives of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2. Wireless Communications System Design . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 Digital Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . Wireless Communications Systems . . . . . . . . . . . . . . . . . . . . . . Wireless System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 2.3.2 System Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . Processor Core-Based Design . . . . . . . . . . . . . . . . . . . .
3. Programmable Processor Architectures . . . . . . . . . . . . . . . . . . . . . . 3.1 Instruction-Set Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 3.1.2 3.1.3 Memory Organization . . . . . . . . . . . . . . . . . . . . . . . . Operand Location . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory Addressing . . . . . . . . . . . . . . . . . . . . . . . . .
vi
TableofContents
3.1.4 3.2
Number Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .
18 19 19 20 22 23 25 25 26 28 32 35 35 35 37 40 41 42 42 46 49 49 51 52 55 55 58 61
Enhancing Processor Performance . . . . . . . . . . . . . . . . . . . . . . 3.2.1 3.2.2 3.2.3 3.2.4 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instruction-Level Parallelism . . . . . . . . . . . . . . . . . . . . . Data-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . Task-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4. Programmable DSP Processors 4.1 4.2 4.3 4.4
Historical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conventional DSP Processors . . . . . . . . . . . . . . . . . . . . . . . . . VLIW DSP Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5. Customizable Fixed-Point DSP Processor Core . . . . . . . . . . . . . . . . . . 5.1 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 5.2.2 5.2.3 5.3 Program Control Unit . . . . . . . . . . . . . . . . . . . . . . . .
Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Address Generator . . . . . . . . . . . . . . . . . . . . . . .
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 5.3.2 Processor Hardware . . . . . . . . . . . . . . . . . . . . . . . . . Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6. Summary of Publications 6.1 6.2 6.3
Customizable Fixed-Point DSP Processor Core . . . . . . . . . . . . . . . Specication of Wireless Communications Systems . . . . . . . . . . . . . Authors Contribution to Published Work . . . . . . . . . . . . . . . . . .
7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 7.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part II
Publications
75
LIST OF PUBLICATIONS
This thesis is divided into two parts. Part I has an introduction to the scope of the research work covered by the thesis. Part II contains reprints of the related publications. In the text these publications are referred to as [P1], [P2], ..., [P7]. [P1] M. Kuulusa and J. Nurmi, A parameterized and extensible DSP core architecture, in Proc. Int. SymposiumonICTechnology,Systems&Applications,Singapore,Sep. 1012 1997, pp. 414417. [P2] M. Kuulusa, J. Nurmi, J. Takala, P. Ojala, and H. Herranen, Flexible DSP core for embedded systems, IEEE Design & Test of Computers, vol. 14, no. 4, pp. 6068, Oct./Dec. 1997. [P3] M. Kuulusa, T. Parkkinen, and J. Niittylahti, MPEG-1 layer II audio decoder implementation for a parameterized DSP core, in Proc. Int. Conference on Signal Processing Applications and Technology, Orlando, FL, U.S.A., Nov. 14 1999 (CD-ROM). [P4] M. Kuulusa, J. Nurmi, and J. Niittylahti, A parallel program memory architecture for a DSP, in Proc. Int. Symposium on Integrated Circuits, Devices & Systems, Singapore, Sep. 1012 1999, pp. 475479. [P5] J. Takala, M. Kuulusa, P. Ojala, and J. Nurmi, Enhanced DSP core for embedded applications, in Proc. Int. Workshop on Signal Processing Systems: Design and Implementation, Taipei, Taiwan, Oct. 2022 1999, pp. 271280. M. Kuulusa, J. Takala, and J. Saarinen, Run-time congurable hardware model in a dataow simulation, in Proc. IEEE Asia-Pacic Conference on Circuits and Systems, Chiangmai, Thailand, Nov. 2427 1998, pp. 763766. M. Kuulusa and J. Nurmi, Baseband implementation aspects for W-CDMA mobile terminals, in Proc. Baiona Workshop on Emerging Technologies in Telecommunications, Baiona, Spain, Sep. 68 1999, pp. 292296.
[P6]
[P7]
LIST OF FIGURES
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Block diagram of a simplied, generalized DSP system . . . . . . . . . . . Functional block diagram of a wireless communications system . . . . . .
6 7 8 10 12 16 18 21 23 24 27 30 33 36 38 38 39 39 40 42 43 44 45
Functional block diagram of a W-CDMA transceiver for mobile terminals . System-level design process of embedded systems . . . . . . . . . . . . . . Example of an integrated DECT communications platform . . . . . . . . . Classication of processor memory architectures . . . . . . . . . . . . . . Common data memory addressing modes . . . . . . . . . . . . . . . . . . Illustration of instruction issue mechanisms in processors . . . . . . . . . . Illustration of two SIMD instructions . . . . . . . . . . . . . . . . . . . . . Block diagram of an integrated cellular baseband processor . . . . . . . . . Example of an assembly source code implementing a 64-tap FIR lter . . . Simplied block diagrams of two conventional DSP processors . . . . . . . Simplied block diagrams of two VLIW DSP processors . . . . . . . . . . Base architecture of the customizable xed-point DSP processor . . . . . . Pipeline structure of the customizable xed-point DSP processor . . . . . . Functional block diagram of the Program Control Unit . . . . . . . . . . . Illustration of the Instruction Address Generation operation . . . . . . . . . Functional block diagram of the hardware looping unit . . . . . . . . . . . Functional block diagram of two Datapaths . . . . . . . . . . . . . . . . . Functional block diagram of two Data Address Generators . . . . . . . . . Circuit layouts of a 16x16-bit twos complement array multiplier . . . . . . Circuit schematic of an RTL model of a Datapath . . . . . . . . . . . . . . Circuit layout of the VS-DSP2 processor core . . . . . . . . . . . . . . . .
ListofFigures
24 25
Graphical user interface of the instruction-set simulator . . . . . . . . . . . Comparison of three DSP processor core versions . . . . . . . . . . . . . .
46 57
LIST OF TABLES
1 2
Summary of conventional DSP processor features . . . . . . . . . . . . . . Summary of VLIW DSP processor features . . . . . . . . . . . . . . . . .
31 34
LIST OF ABBREVIATIONS
AALU A/D ADC ADPCM AGC AFC ALU ANSI ASIC ASIP ATM CDMA CISC CMOS CMP CPU DAB DAC DECT DMA
Address Arithmetic-Logic Unit Analog-to-Digital Analog-to-Digital Converter Adaptive Differential Pulse-Code Modulation Automatic Gain Control Automatic Frequency Control Arithmetic-Logic Unit American National Standards Institute Application-Specic Integrated Circuit Application-Specic Instruction-Set Processor Asynchronous Transfer Mode Code Division Multiple Access Complex Instruction-Set Computer Complementary Metal Oxide Semiconductor Chip-Multiprocessor Central Processing Unit Digital Audio Broadcasting Digital-to-Analog Converter Digital Enhanced Cordless Telecommunications Direct Memory Access
xiv
ListofAbbreviations
DRAM DSL DSP DVB-T EDA EEPROM FED FFT FHT FIFO FIR FPGA FSM GPS GSM HDL HLL IAG IC IDCT IEC IEEE ILP IP IPC
Dynamic Random Access Memory Digital Subscriber Line Digital Signal Processing Terrestrial Digital Video Broadcasting Electronic Design Automation Electronically Erasable Programmable Read-Only Memory Frequency Error Detector Fast Fourier Transform Fast Hartley Transform First in, First out Finite Impulse Response Field-Programmable Gate Array Finite-State Machine Global Positioning System Global System for Mobile Communications Hardware Description Language High-Level Language Instruction Address Generator Integrated Circuit Inverse Discrete Cosine Transform International Electrotechnical Commission Institute of Electrical and Electronics Engineers Instruction-Level Parallelism Intellectual Property Instructions per Clock Cycle
xv
IR ISO ISA ISS ITU LAN LMS MAC MCU MIMD MIPS MPEG OFDM PC PCU RAM RISC ROM RTL RTOS RTT SIMD SIR SMT SNR
Instruction Register International Organization for Standardization Instruction-Set Architecture Instruction-Set Simulator International Telecommunication Union Local Area Network Least Mean Square Multiply-Accumulate Microcontroller Unit Multiple Instruction Stream, Multiple Data Stream Million Instructions Per Second Motion Pictures Expert Group Orthogonal Frequency Division Multiplex Program Counter (or Personal Computer) Program Control Unit Random Access Memory Reduced Instruction-Set Computer Read-Only Memory Register Transfer-Level Real-Time Operating System Radio Transmission Technology Single Instruction Stream, Multiple Data Stream Symbol to Interference Ratio Simultaneous Multithreading Signal-to-Noise Ratio
xvi
ListofAbbreviations
SOC SRAM TLP UART UMTS USB VHDL VHSIC VLES VLIW VLSI W-CDMA WLAN
System-on-a-Chip Static Random Access Memory Task-Level Parallelism Universal Asynchronous Receiver/Transmitter Universal Mobile Telecommunications System Universal Serial Bus VHSIC Hardware Description Language Very High Speed Integrated Circuit Variable-Length Execution Set Very Long Instruction Word Very Large-Scale Integration Wideband Code-Division Multiple Access Wireless Local Area Network
Part I INTRODUCTION
1. INTRODUCTION TO THESIS
The eld of DSP is currently the most attractive, fastest growing segment of the semiconductorindustry. AsmicroprocessorchipspropelledthePCera,likewisestreamlined DSP processors now constitute the driving force behind the broadband communications era in the form of advanced wireless and wireline systems. Mobile phones and other wireless terminals are the ultimate mass-production devices for consumer markets. In order to illustrate the magnitude of the volume, it has been estimated that approximately 275 million mobile phones were manufactured worldwide in 1999 [Nok99]. In addition to conventional voice services, the public will soon have wireless access to real-time video anddataservicesatanytime,anywhere. Thisaccesswillmainlybeenabledbysophisticated communications engines based on the latest technologies integrated into a system on a chip. It is evident that this kind of chip will be a high-performance multiprocessor system which incorporates three to four programmable processor cores, considerable amounts of on-chip memory, optimized hardware accelerators, and various interfaces for connecting the chip to the off-chip world. Central components in these chips are programmable DSP processor cores which, in contrast to application-specic integrated circuits, provide greater exibility and faster time to market.
1.1
Objectives of Research
The objective of the research presented in this thesis was to develop a new architecture for a programmable DSP processor. The main emphasis was on creating a exible processor core thatprovidesastraightforwardmeansforoptimizingthehardwareoperationanditsfunctions specically for a given application eld. In order to achieve such freedom, one of the key concepts is the denition of central functional parameters in a DSP processor architecture. By using a distinct set of core parameters, the customization of the instruction-set architecture of the processor could be greatly facilitated. In addition, such a processor requires extension mechanisms that would permit the addition of application-specic functionality to the processor hardware. The realization of this kind of parameterized and extensible architecture was to be closely linked with the processor hardware design that was to be carried out with optimized transistor-level circuit layouts. Furthermore, the hardware implementation should achieve a number of important non-functional properties that, for
1. IntroductiontoThesis
programmable DSP processor cores, include small die area, low power consumption, and high performance. The viability of a chosen processor architecture was to be evaluated through careful analysis of real-time DSP applications. In addition, it was imperative to establish a profound view of wireless communications systems,whichistheprincipalsegmentoftheelectronicsindustrywhereprogrammableDSP processor cores are the key enabling technology. The main idea was to study a wide range of issues involving the specication, modeling, simulation, design, and implementation of emerging communications systems, such as next-generation wireless mobile cellular and local area networks.
1.2
Outline of Thesis
Thisthesisiscomprisedoftwoparts;theintroductionaryPartI,followedbyPartIIconsisting of seven publications containing the main research results. The organization of Part I is as follows: In Chapter 2 wireless communications system design is discussed. The chapter presents a concise view of digital signal processing, wireless systems, and processor core-based system design. Chapter 3 describes fundamental issues associated with programmable processor architectures. In Chapter 4 programmable DSP processors are studied in detail. This chapter gives a brief history of DSP processors and presents the architectural features that are unique to DSP processors. Moreover, two main classes of DSP processors are distinguished and their features are examined in detail. A customizable xed-point DSP processorispresentedinChapter5. ThearchitectureofthisDSPprocessorcoreisdescribed and the implementation of processor hardware and software development tools is reviewed. In Chapter 6 a summary of the publications is given and the Authors contribution to the publications is claried. Finally, Chapter 7 gives the conclusions and the thesis concludes with a discussion on future trends in wireless system design and DSP processors.
2. WIRELESS COMMUNICATIONS SYSTEM DESIGN
This chapter provides an overview of the application area covered by this thesis. The elds of wireless communications systems and digital signal processing are very broad. Thus, instead of trying to cover these extensive elds in great detail, this chapter prepares the reader with the fundamental concepts behind DSP systems, their primary application area, and the plethora of issues associated with the design of processor core-based wireless communications systems.
2.1
Digital Signal Processing

However, digital computers operate on data
Real-world signals are analog by nature.
represented by binary numbers that are composed of a restricted number of bits. In digital signal processing (DSP), analog signals are represented by sequences of nite-precision numbers, and processing is implemented using digital computations [Opp89]. Thus, as opposed to a continuous-time, continuous-amplitude analog signal, a digital signal is characterized as discrete-time and discrete-amplitude. Compared to analog systems, performing signal manipulation with DSP systems has numerous advantages: systems provide predictable accuracy, they are not affected by component aging and operating environment, and they permit advanced operations which may be impractical or even impossible to realize with analog components. For example, complex adaptive ltering, data compression, and error correction algorithms can only be implemented using DSP techniques [Ife93]. DSP systems also provide greater exibility since they are often realized as programmable systems that allow the system to perform a variety of functions without modifying the digital hardware itself. Furthermore, the tremendous advances in semiconductor technologies permit efcient hardware implementations that are characterized by high reliability, smaller size, lower cost, low power consumption, and high performance. A block diagram of a DSP system is depicted in Fig. 1. As shown in the gure, a DSP system receives input, processes it, and generates output according to a given algorithm or algorithms. The analog and digital domains interact by using analog-to-digital (A/D) and digital-to-analog (D/A) converters. A/D conversion is the process of converting an analog
2. WirelessCommunicationsSystemDesign
Input Signal
Input Filter
A/D Converter
Digital Processor
D/A Converter
Output Filter
Output Signal
0101...01 0110...11 0010...00 1010...01
0011...00 0110...10 1100...11 0011...01
Figure1. Blockdiagramofasimplied,generalizeddigitalsignalprocessingsystem. Thewaveforms and digits illustrate signal representation in the system. The A/D converter block includes asample-and-holdcircuit[Bat88]. A/D:analog-to-digital,D/A:digital-to-analog.
signal, i.e. a voltage or current, into a sequence of discrete-time, quantized binary numbers, or samples [vdP94]. Thus, the A/D conversion process and the conversion rate are referred to as sampling and sampling rate (alternatively sampling frequency), respectively. In order toavoidaliasingoffrequencyspectrainA/Dconversion,theinputsignalbandwidthmustbe limited at least to half the sampling frequency with an analog lter preceeding the converter [Opp89]. D/A conversion is the opposite process in which binary numbers are translated into an analog signal. In D/A conversion, analog ltering is required to reject the repeated spectra around the integer multiples of the sampling frequency because signal reproduction in only a certain frequency band is of interest. Sampling introduces some error in digital signals. This error is due to quantization noise and thermal noise generated by analog components [vdP94]. The main component of a DSP system, shown in Fig. 1, is the digital processor. I n practice, this part can be based on a microprocessor, programmable DSP processor, application-specic hardware, or a mixture of these. The digital processor implements one or several DSP algorithms. The basic DSP operations are convolution, correlation, ltering, transformations, and modulation [Ife93]. Using the basic operations, more complex DSP algorithmscanbeconstructedforavarietyofapplications,suchasspeechandvideocoding. Real-time systems are constrained by strict requirements concerning the repetition period of an algorithm or a function [Kop97]. Thus, a real-time DSP system is a DSP system which processes and produces signals in real-time.
2.2
Wireless Communications Systems
Currently, there is a progressive shift from conventional analog systems to fully digital systems which provide mobility, better quality of service, interactivity, and high data-rates for accessing real-time audio, real-time video, and data. These attributes are and will be
2.2. WirelessCommunicationsSystems
Transmitter Speech Audio Video Data
Source Encoding
Channel Encoding
Digital Modulator
D/A Conversion
RF Back-End Physical Channel
Receiver Speech Audio Video Data
Source Decoding
Channel Decoding
Digital Demodulator
A/D Conversion
RF Front-End
Figure2. Functional block diagram of a wireless communications system. Adapted from [Pro95]. RF:radiofrequency,D/A:digital-to-analog,A/D:analog-to-digital.
receiving concrete realization in a number of emerging technologies and standards, such as Digital Audio Broadcasting (DAB), Terrestrial Digital Video Broadcasting (DVB-T), Universal Mobile Telecommunications System (UMTS), Global Positioning System (GPS), and various Wireless Local Area Network (WLAN) and wireline Digital Subscriber Loop (DSL) schemes. The application area that has particularly beneted from the advantages of DSP is wireless communications systems. The main functions of a wireless transmitter-receiver pair are illustrated in Fig. 2. The source data is a sampled analog signal or other digital information which is converted into a sequence of binary digits. Due to limited data bandwidth of a wireless system, source encoding is used for data compression. In order to have protection against errors, channel encoding introduces some redundancy in the information in some predetermined manner so that the receiver can exploit this information to detect and correct errors. Moreover, in order to combat bursty errors, the channel encoding often involves interleavingwhich,ineffect,spreadsanerrorburstmoreevenlyforablockofdata. A digital modulator serves as an interface to the communication channel. It converts channel bits into a sequence of channel symbols which are eventually forwarded through A/D conversion to radio frequency back-end that performs nal upconversion of the analog transmission signal to the designated frequency band. In the receiver, the decoding functions are carried out in anoppositeorder. However,duetosignalpropagationthroughawirelessphysicalchannel,a received signal is degraded since it is composed of a sum of multipath components [Ahl98]. The reception is particularly challenging in mobile receivers in which receiver movement resultsinarapidlychangingradiochannel[Par92]. Inadditiontothechangingradiochannel, ananalogfront-endintroducesnon-idealitiesinthereceivedsignalthatmustbecompensated adaptively in the receiver. For these reasons, the complexity of a digital receiver is often much higher than that of a transmitter section.
Receiver Narrowband Power AGC
Gain Control
Wideband Power MultipathDelay Estimation
Multipath Profile
FromA/D Converter
PulseShaping Filtering
Complex Rake Finger Bank Mux Multipath Combiner ChannelEstimation
FED
AFC
Frequency Control
Symbol Scaling SIR Estimation
Channel Bits
Code Generators
SIR
Transmitter
Special ChipSequences Channel Bits
Symbol Mapping
Spreading/Scrambling CodeGenerators
Mux
PulseShaping Filtering
Quadrature Modulation
ToD/A Converter
Figure3. Functional block diagram of a W-CDMA transceiver for mobile terminals [P7]. A/D: analog-to-digital, D/A: digital-to-analog, AGC: automatic gain control, AFC: automatic frequency control, FED: frequency error detector, SIR: symbol-to-interference ratio, Mux: multiplexer/demultiplexer.
Interesting observations can be made by examining common DSP algorithms needed in a digital transceiver. Whereas source encoding is a complex and computation-intensive operation, source decoding is often quite simple to realize. For example, in GSM full-rate speech coding [P2] and H.263 video coding [Knu97] the encoding requires at least ve times more processor clock cycles. On the contrary, the situation is quite the opposite in channel coding: the encoding is a relatively simple task, but the decoding is far more demanding. As an example, convolutional codes are commonly employed as the channel coding in communications systems. Convolutional encoding can be easily performed with simple hardware operations but the Viterbi decoding process requires special functionality implemented as dedicated hardware or as an application-specic instruction in a DSP processor [Vit67, Fet91, Hen96]. Furthermore,demodulatorandmodulatorsectionsprimarilyutilizebasicDSPoperationsfor functions such as symbol detection and demodulation, equalization, channel ltering, and frequency synthesis [Lee94]. Although the operations are relatively simple, the processing is often performed at a sampling frequency which implies that high-performance DSP hardware may be necessary. For example, the W-CDMA receiver architecture shown in Fig. 3 contains a multipath estimator unit that requires a peak processing rate of 4 billion multiply-accumulate operations per second [P7],[Oja98]. Moreover, [P3] presents a realization of an MPEG audio source decoder for a programmable DSP processor.
2.3. WirelessSystemDesign
2.3
Wireless System Design
Wireless communications systems, such as mobile phones and other wireless terminals, are ultra high-volume consumer market products which are implemented as highly integrated systems. These systems are portable, battery-powered embedded systems that are strongly inuenced by constraints on system cost, size, and power consumption [Teu98]. Moreover, thedevelopmentofsuchanembeddedsystemshouldbefavorablycharacterizedbyattibutes such as fast design turn-around, design exibility, and reliability. Currently, system implementations are based on advanced communications platforms which employ the latest semiconductor technologies and components integrated into a system-level application-specic integrated circuit that is more commonly referred to as a system-on-a-chip (SOC) [Cha99]. This kind of chip is a high-performance multiprocessor system which incorporates various types of hardware cores: programmable processors, application-specic integrated circuit (ASIC) blocks, on-chip memories, peripherals, analog components, and various interface circuits.
2.3.1
System Design Flow
Embedded system design for wireless terminals is strongly inuenced by system-level considerations. At system level, primary inuences include wireless operating environment, receiver mobility, applications, and constraints on system cost, size, power consumption, exibility, and design time [Knu99]. In [Cam96], an embedded system is dened as a real-time system performing a dedicated, xed function or functions where the correctness of the design is crucial. Specication and design of these systems consists of describing a systems desired functionality and mapping that functionality for implementation by a set of system components [Gaj95]. As illustrated in Fig. 4, there are ve main design tasks in embedded system design: specication capture, design-space exploration, specication renement, hardware and software design, and physical design. During specication capture the primary objectives are to specify and identify the necessary system functionality and to eventually generate an executable system model. Using simulations, this model is used to verify correct operation of the desired system functionality. In addition to standard programming languages, such as C, widely adopted tools for modeling DSP algorithms are graphical block diagram-based dataow simulation environments [Buc91, Joe94, Bar91] and text-based technical computing environments [Mol88, Cha87]. These tools are often accompanied by extensive pre-designed model libraries and they provide functions for data analysis and visualization. Using these tools, the behavior of an entire system can be modeled and simulated. For example, it is possible to describe a digital transmitter-receiver chain and test it by using a realistic model of
10
SpecificationCapture
ModelCreation DescriptionGeneration
FunctionalSpecification
ExecutableModel FunctionalSimulation
Design-SpaceExploration
Transformation Allocation Partitioning Estimation
SpecificationRefinement
Memories Interfaces Arbitration Generation
ValidationVerification Simulationand Cosimulation
System-LevelDescription
MCU DSP ASIC Memory Peripherals
HardwareandSoftwareDesign
SoftwareSynthesis High-LevelSynthesis LogicSynthesis
RT-LevelDescription
C/C++Code RTLCode Memory-MappedAddressSpace
PhysicalDesign
CodeCompilation/AssemblyCoding Placement,Routing,Timing
Task: Product:
PhysicalDescription
(tomanufacturingandtesting)
Figure4. System-leveldesignprocessofembeddedsystems. Adaptedfrom[Gaj95].
the radio transmission channel. In addition, most dataow simulation environments allow heterogenous simulations with implementation-level hardware descriptions [P6]. In design-space exploration the modeled functionality is transformed and partitioned into a number of target architectures, or platforms, that contain different sets of allocated system components, such as programmable processors, ASICs, and memory. Using estimation, the objective is to nd a feasible architecture that meets the criteria for real-time operation, performance, cost, and power consumption. A software function is estimated in terms of program code size and worst-case run-time for a function, i.e. the number of processor clock cycles. For a given processor, software power consumption can be approximated if a reliable metric, such as mW/MHz, is provided for active and idle modes by the processor vendor. In contrast, an ASIC-based function is estimated with respect to the number of logic gates or transistors, die area, and power consumption. For CMOS technologies, power consumption of digital hardware circuits depends primarily on the internal activity factor, operating voltage, and operating frequency [Cha95]. However, power consumption in ASIC cores is highly dependent on the internal ne-structure and thus it is relatively hard to estimate. In practice, comparing an implementation of a function realized as an ASIC core and a program executed on a programmable processor can be very difcult and laborious if very accurate estimates are needed.
11
After design-space exploration a suitable target architecture has been formed. In specication renement a more detailed description of the system architecture is created by specifying bus structures and arbitration, system timing, and interfaces between cores and off-chip elements. This system-level description contains some implementation details, but the functionality is mainly composed of behavioral models. In hardware/software co-simulation, verication is carried out by combining hardware description language (HDL) and instruction-set simulators to permit co-simulation of a complete system. Due to the use of HDL simulators, simulation speed can become a bottleneck in the verication of complex systems. Recently, simulation environments employing C/C++ language-based models have been reported to accelerate co-simulation by a factor of three [Sem00]. addition to co-simulation, C/C++ models may soon provide a path to implementation with hardware synthesis [Gho99, DM99].
In
Hardware and software design is a concurrent task that involves description of both hardware and software components by separate design teams. This task is carried out as hardware/software co-design where the correct interaction of implementations is veried using co-simulation. For software, target components are programmable processors, such as embedded RISC and DSP processors [Hen90, Lap96]. Software is tested, proled, and debugged by executing program code in processor models that emulate the operation of a real processor. With respect to the simulation accuracy and speed, various processor models can be utilized [Cha96]. Currently, typical processor models are instruction-set simulators that allow cycle-accurate simulation of an entire processor architecture at a speed of 0.1-0.3 million instruction cycles per second [P2]. Furthermore, when a physical prototype of a processor is available, it is possible to perform software emulation in real-time using an evaluationboard,suchastheonereportedin[P5]. Hardwaredesignisbasedonmodelingthe desiredfunctionsatregister-transferlevel(RTL)byusingstandardlanguages,suchasVHDL and Verilog [IEE87, Tho91]. With the aid of logic synthesis tools, these RTL descriptions are transformed into gate-level netlists that essentially capture the ne-structure of an ASIC. As opposed to ASIC and programmable processors, an increasingly popular approach to improve exibility and performance is application-specic instruction-set processors (ASIPs). These tailored processors execute specialized functions with a customized set of resources and relatively small program kernels [Nur94, Lie94, Goo95].
In physical design a transistor-level chip layout is generated. System components are placed and wired using automatic tools according to a chip oorplan. In order to create the physical layout of a synthesized ASIC core, placement and routing of standard library cells is required [Smi97]. For programmable processors, executable program code is compiled from high-level language and assembly source codes.
12
FPGA DECTCommunicationsPlatform EMC EBM Interrupt Controller DMA Controller ARM RISCMCU Cache Ctrl OAK DSP RAM Dual Port RAM RAM DECT Shared BurstMode RAM Control
RFSection
DRAM Flash EEPROM SRAM MCU
ADCDAC
Radio IF
Bus Bridge
RAM
G726 ADPCM EchoCanceller Smartcard IF Parallel Port Peripheral BusIF ROM RAM
Codec
USBIF
UART
FPGA
Figure5. Example of an integrated DECT communications platform. System is based on three programmable processors: an embedded RISC processor, a DSP processor and an ASIP for ADPCM vocoding and echo cancellation [Cha99]. EMC: external memory controller, EBM:externalbusmaster,IF:interface.
2.3.2
Processor Core-Based Design
Earlier single-chip systems have preferred implementations based mainly on ASIC cores which, due to tailored arhitecture, have a potential for smaller power consumption, smaller die area, and especially better performance. However, the rapid advances in CMOS technologies have enabled development of large, complex systems on a chip by exploiting reusable programmable processor cores which are now characterized by low power consumption due to voltage scaling, high-performance hardware circuitry, and a diminishing die area when compared to the size of the on-chip memories. For a system developer, these pre-designed, pre-veried cores provide an attractive means for importing advanced technology into a system. Most importantly, processor core use shortens the time tomarketfornewsystemdesignsandallowsstraightforwardproductdifferentiationthrough programmability. Asanexample,Fig.5depictsanintegrationplatformforDigitalEnhanced Cordless Telecommunications (DECT) applications [Cha99, ETS92]. The system is based on three buses and contains a total of three programmable processors, various memory blocks, and a variety of digital interfaces and data converters. Typically embedded processor cores are delivered either in a soft or hard form. Soft cores are processor cores delivered as synthesizable RTL HDL code and optimized synthesis scripts and thus they can quickly be retargeted to a new semiconductor technology provided that a standard-cell library is available. Hard cores, in turn, are designed for a certain
13
semiconductor technology and delivered as xed transistor-level layouts, typically in the GDSII format. As opposed to soft cores, hard cores generally perform better in terms of die area and power consumption. However, when core portability is of primary concern, a soft core should be preferred. Another issue is the business model used by the processor core vendor. A licensable core is handed over to a system developer as a complete design [Lap96]. Thus the core licensee may have the potential to change the design if the core is soft. The most widely-used licensable processor cores are ARM, MIPS, PineDSPCore, and OakDSPCore [ARM95, Sch98, Be93, Ova94]. Hard cores are often foundry-captive cores because the core vendor has considerable intellectual property in an optimized transistor-level design. Therefore, in a chip oorplan, a foundry-captive core is introduced as a black box. For example, designs incorporating a DSP processor from the TMS320C54x family are explicitly manufactured by the core vendor [Lee97, Tex95]. According to system partitioning, different software functions should be mapped to appropriate processor types when possible. A coarse mapping to microcontroller units (MCU) and digital signal processing (DSP) processors can be performed by examining the properties of the system tasks. Whereas control-dominated software functions are better-suited to MCUs, DSP processors are an ideal target for most computation-intensive signal processing tasks. Theprocessingcapacityofanembeddedprocessorisspeciedbyitsinternal clockfrequency that effectively species the number of clock cycles per second that can be utilized for programexecution. Forfunctionsunderstrictreal-timeconstraints,theprocessorloadshould be proled to guarantee correct behavior during active operation. Generally, this requires estimationoftheworst-caserun-timesforreal-timesystemtasks. Theestimationshouldalso takeintoaccounttheoverheadresulting,forexample,frominterruptprocessing,bussharing, and memory access latencies. In this context, a metric called cycle budget is used to refer to the maximum number of clock cycles per second for a given processor. Often the term million instructions per second (MIPS) is used as a synonym for cycle budget. This loose metricisgenerallycomputedbymultiplyingtheprocessorclockfrequencybyitsinstruction issue width or the number of multiply-accumulate units. Consequently a given MIPS value assumes a single-cycle fully parallel execution of instructions at all times thus the value generally species a theoretical peak performance. Therefore, more reliable metrics for processor performance are application benchmarks, such as general computing applications and certain algorithm kernels. To conclude, the increasing demand for implementation exibility implies that functionality should be pushed towards software as much as possible while still fullling a given set of constraints, especially for performance and power consumption.
3. PROGRAMMABLE PROCESSOR ARCHITECTURES
This chapter covers various classications which can be used to differentiate programmable processors. The chapter presents a comprehensive description of the primary characteristics found in modern instruction-set architectures and discusses a number of techniques which are applied to programmable processors to enhance their instruction throughput and computational performance.
3.1
Instruction-Set Architectures
An instruction-set architecture (ISA) can be viewed as a set of resources provided by a processor architecture that are available to the programmer or a compiler designer [Heu97]. These resources are dened in terms of memory organization, size and type of register sets, and the way instructions access their operands both from registers and memory. Intheearlyphasesofprocessorevolution,designersbegantodevelopinstructionsetssothat the processor directly supported many complex constructs found in high-level languages. This approach lead to very complex instruction sets. Often execution of an instruction was a long sequence of operations carried out sequentially in a processor that had very restricted hardware resources. An execution sequence was essentially stored as a set of microcodes that correspond to low-level control programs. In retrospect today these types of processors are referred to as complex instruction-set computer (CISC) machines. CISC-type processors are typically characterized by long and variable-length instruction words, a wide range of addressing modes, one arithmetic-logic unit, and a single main memory that is used to store both program code and data. Due to the very complex control ow, the performance of CISC machines was very difcult to improve. It was shown that by decomposing one complex instruction into a number of simple instructions and by allowing parallel execution of these instructions, the performance could be improved signicantly. Moreover, data memory accesses use distinct register loads and stores, and data operations have only register operands. These are the fundamental concepts of the reduced instruction-set computer (RISC) design philosophy. Other key characteristics of a RISC machine are: xed-length 32-bit instruction word, large general-purpose register les, simplied addressing modes, pipelining, and program code generation with sophisticated software compilers [Bet97, Heu97].
16
3. ProgrammableProcessorArchitectures
ProgramandData Memory
Program Memory
Data Memory
Program Memory
Data Memory
Data Memory
Processor
Processor
Processor
a)
b)
c)
Figure6. Processor memory architectures:
a) von Neumann architecture, b) basic Harvard
architecture,andc)modiedHarvardarchitecture. [Lee88].
3.1.1
Memory Organization
All programmable processor architectures require memory for two main purposes: to store datavaluesandinstructionwordsconstitutingexecutableprograms. Inthiscontext,different memory organizations are categorized into three types of architectures: von Neumann, basic Harvard, and modied Harvard. The conguration of these memory architectures is illustrated in Fig. 6. In the past, a single memory was employed for both data and programs. This architecture is known as von Neumann architecture. However, the memory architecture poses a bottleneck in memory accesses since an instruction fetch requires a separate access and thus always blocks a potential data memory access. Consequently, the evident bottleneck was circumvented with Harvard architecture that holds separate memories for both program and data. In the basic Harvard architecture, program and data memory accesses can be made simultaneously and thusprogramexecutiondoesnothinderdatamemorytransfers. Thisarchitectureiscurrently found in virtually all high-performance microprocessors in the form of separate cache memories for instructions and data [Hen90]. However, the modied Harvard architecture is the dominant memory architecture employed in DSP processors. The memory architecture incorporates two data memories to permit simultaneous fetch of two operands. In addition, a number of variations have been reported in DSP processor systems. For example, using a special DSP instruction in a single instruction repeat loop, a third operand can be fetched from the program memory, thus effectively fetching a total of three input operands at a time [Tex97a]. Memory architectures supporting four parallel data memory transfers have been reported in [Suc98]. Moreover, some recent DSP processor architectures incorporate a supplementary program memory which contains wide microcodes to realize highly parallel instructions without enlarging the width of the native instruction word [Kie98, Suc98].
3.1. Instruction-SetArchitectures
17
3.1.2
Operand Location
With respect to locations of source and destination operands, processors can be divided into two classes: load-store and memory-register architectures [Hen90, Goo97]. Load-storearchitecture(alternatively register-registerarchitecture)performsdataoperations using processor registers as source and destination operands and data memory transfers are carried out with separate register load and store instructions. This architecture is one of the keyconceptsintheRISCprocessorarchitectures,butitisalsocommoninDSPprocessorsas thesourceoperandloadsduringDSPoperationsareoftenexecutedinparallelwitharithmetic operations. In memory-register architecture (alternatively memory-memory), input operands are fetched from the memory, a data operation is executed, and then the result is written back either to a memory location or a destination register. In contrast with the load-store architecture, the processorpipelinehastocontainanadditionalstageforreadingsourceoperands. Moreover, another stage is needed for memory write access if a data memory location can act as a destination operand. Memory-register architecture can cause a resource conict in a pipelined processor. Such a conict occurs if a location in a memory bank should be written when, at the same time, the same memory bank should be accessed to read an operand. The conict can be circumvented using pipeline interlocking in which the write operation is carried out normally but the execution of the operand fetch is delayed.
3.1.3
Memory Addressing
To access an operand residing in data memory, the processor must rst generate an address which is then issued to the memory subsystem. The generated address is referred to as an effective address [Heu97]. In programs, effective addresses can be obtained in various ways. The addressing modes found in most processors are the following: immediate, direct, indirect, register direct, register indirect, indexed, and PC-relative addressing. Common addressing modes are illustrated in Fig. 7. In immediate addressing the instruction contains a constant value that will be an operand when the instruction is executed. Thus a data memory access may not be required at all since the operand is embedded into the instruction word. Due to restricted length of the instruction word, the constant values may sometimes be selected only from a restricted number range. Moreover, the instruction word may hold a constant memory address which refers to the operand or to another memory location that contains the actual operand. These two modes are called direct addressing and indirect addressing, respectively. However, the most commonly found modes in processors are register direct addressing and register indirect addressing that employ a register that either contains the operand or its effective address. In indexed addressing (alternatively
18
Memory
a) Op
ConstantValue Operand
e) Op Reg
Memory
Constant Register Operand
b) Op ConstantAddress
Operand Memory
f) Op Reg
Memory
Constant Operand Register
c) Op ConstantAddress
Operand OperandAddress
Memory
d) Op Reg
Register Operand
g) Op
Constant Operand ProgramCounter
Figure7. Common addressing modes: a) immediate, b) direct, c) indirect, d) register direct, e) register indirect, f) indexed, and g) PC-relative addressing. A grey block represents an instructionword. [Heu97].
offset or displacementaddressing)theeffectiveaddressisformedbyaddingasmallconstant to the value stored in a register. In PC-relative addressing the explicit register utilized in the address calculation is the program counter (PC). PC-relative addressing is particularly well-suited for relocatable program modules in which the program and data sections can be placed in any memory location and accessed with valid effective addresses.
3.1.4
Number Systems
Indigitalcomputersanumericvalueisrepresentedwithadatawordcomposedofaspecied number of binary digits, or bits. Therefore, due to the nite word length all computer arithmetics are implemented as operations with a nite accuracy. Generally, the number systems found in programmable procesors can be divided into two classes: xed-point and oating-point numbers [Hwa79]. In xed-point numbers the binary point (alternatively radix point) is in a specic position of a data word. Although there are several ways to represent signed binary numbers, only the twos complement format is considered in this context. This format is clearly the dominant one of the xed-point number representations because the arithmetic operations are simple to realize in hardware. Two commonly used numbers are integer and fractional numbers. The difference between these two is that whereas integer numbers have the binary point at the extreme right, fractional numbers normally have the binary point right of the most
3.2. EnhancingProcessorPerformance
19
signicant bit, i.e. the sign bit. Assuming twos complement format and a data word x of length N, a fractional number is bounded to 1 x < 1 and a signed integer number to
N1 N1
2 x<2 . InthetechnicalliteraturefractionalnumbersareoftenreferredtoasQ15 and Q31 for 16-bit and 32-bit data words, respectively. An interesting observation from the hardwaredesignpointofviewisthatinpracticethestandardintegerandfractionalarithmetic operations can be implemented with the same hardware units with only minor adjustments. Floating-point numbers are composed of a mantissa (alternatively signicand) and an exponent in a single data word [Lap96]. The exponent is always an integer that denes the conceptual location of the binary point with respect to the value stored in the mantissa. The mantissa contains a signed value which is scaled by a factor specied by the exponent. In this context, an exponent base of 2 is assumed. Thus, a numerical value x of a oating-point number with a signed mantissa m and exponent e is computed with the expression x = m2 . In 1985, a common framework for binary oating-point arithmetic was specied in ANSI/IEEE standard 754 [ANS85]. The standard not only species oating-point number formats for 32-bit and 64-bit data words but also denes a comprehensive set of rules for how operations, rounding, and exception conditions are to be performed. The hardware required for native oating-point arithmetic is extensive. Moreover, a oating-point format typically has a data word that has at least 32 bits, which consequently results in larger data memory consumption. For these reasons the most low-costDSPprocessorsdonotimplementthe754standardforthesakeofreducedhardware cost. Instead,mostxed-pointDSPprocessorsprovidesupportforproprietaryoating-point arithmetic by incorporating additional hardware and special instructions for normalization and derive-exponent operations [Lap96]. Block oating-point numbers are an important alternative for a xed-point processor in gaining some of the increased dynamic range without the hardware overhead associated with oating-point arithmetic [Wil63]. In this scheme a single exponent is utilized for an array of xed-point values. This format lends itself particularly well to block-based signal processing that is found in applications such as digital ltering [Sri88, Kal96] and fast transforms [Eri92, Bid95].
e
3.2
Enhancing Processor Performance

3.2.1 Pipelining
In the context of processor operation, pipelining is a hardware implementation technique whereby execution of multiple instructions overlaps in time. The steps, or operations, required to execute an instruction are carried out in discrete steps in the processor pipeline. These steps are referred to as pipeline stages. Operations during the pipeline stages are
20
separated using pipeline registers. An instruction cycle is dened as the period of time that is used to shift an instruction to the next pipeline stage. This can be one or more processor clock cycles. Pipelining signicantly improves instruction throughput since ideally a program is executed in such a manner that one instruction is completed on every clock cycle. Thus increased instruction throughput translates into higher performance. This basic form of pipelined processor which sequentially issues one instruction per clock cycle is called a scalar processor (alternatively single-issue processor). To the programmer the processor pipeline can be either visible or hidden. A visible pipeline relies on the programmers knowledge that, for certain instructions, the result may not yet be available for the next instruction. In a hidden pipeline, the processor itself takes care of these situations. However, due to data and control dependencies between instructions and limited processor resources, the performance is often slightly degraded. Still, with careful design of the processor ISA the instruction throughput can be made very close to the ideal operation, i.e. a single clock cycle per instruction. In order to avoid various pipeline hazards, the pipelinedoperationoftenrequiressophisticatedhardwarestructuresfor pipelineinterlocking and forwarding (alternatively bypassing) of the computed results. Detailed treatment of this broad subject is beyond the scope of this thesis, but excellent coverage can be found in [Hen90].
3.2.2
Instruction-Level Parallelism
Another architectural approach to increasing performance in terms of the number of instructions executed simultaneously is to further increase instruction-level parallelism (ILP). Whereas pipelining of a scalar processor decomposes instruction execution into several stages, the multiple-issue ILP method extends each of the pipeline stages so that several instructions can be simultaneously executed during a pipeline stage. This, however, requires addition of multiple functional units to the processor. Machines employing such ILP are referred to as multiple-issue processors. With respect to the execution of instruction words, multiple-issue processors can be divided into two main classes: superscalar and very long instruction word (VLIW) processors. Instruction issue mechanisms are illustrated in Fig. 8. Superscalar processors fetch multiple instruction words at a time and selectively issue a variable number of instruction words on the same instruction cycle [Joh91]. Fetched instructions are stored in an instruction queue from which the program control selects a group of instructions, or an instruction packet, to be issued. Instruction scheduling refers to the way the instructions are selected from the instruction queue. In static scheduling, the instructions are selected from the beginning of the queue. In contrast, dynamic scheduling
Clock Cycle a) N N N+3 c) N+2 N+1 N N+5 N+4 N+8 N+7 N+6 N+11 N+10 N+9 N N+1 N N+2 N N+3 N+1 N+4 N+15 N+14 N+13 N+12 N+17 N+16 N+18 N+1 N+5 N+1 N+6 N+1 N+7 N+22 N+21 N+20 N+19 ... ... ... ... ... ...
21
b)
d)
N+1
N+2
N+3
N+4
N+5
N+6
N+7
...
Figure8. Illustration of instruction issue mechanisms in processors: a) scalar non-pipelined, b) scalar pipelined, c) superscalar pipelined, and d) VLIW pipelined. White blocks represent unusedissueslotsorno-operationeldsforsuperscalarandVLIWprocessors,respectively.
allowstheinstructionstobeissuedoutoforder. Thusdynamicschedulingismorecommonly called out-of-order instruction execution. A superscalar processor always contains special hardwarethatselectswhichofthecurrentlyfetchedinstructionscanbegroupedtogetherand then issued. The main drawback of superscalar operation is that this hardware can be very expensive in terms of silicon area. The superscalar approach is currently employed mostly in general-purpose processors. Classical examples of superscalar architectures include high-performanceRISCmicroprocessors,suchasPowerPC[Ken97],Alpha[Kes98],HP-PA [Kum97], and Sparc [Gre95] families, and the well-known CISC microprocessors based on the x86 ISA [Alp93, Gol99]. In contrast to the superscalar approach, VLIW processors employ signicantly wider instruction words to enforce static instruction issue and scheduling. In effect, a wide instruction word is a compiler-scheduled instruction packet that has instruction elds for all the functional units in the processor. The instruction eld either species a useful operation or the eld contains a no-operation. The main advantage of the VLIW approach is reduced implementation cost. As opposed to superscalar processors, program control hardware can bemademinimalbecausecomplicatedinstructiongroupinganddispatchmechanismsarenot needed. An obvious drawback of the VLIW approach is the lengthy instruction word which, in turn, results in a large program code size. However, this drawback has been circumvented to some extent by using compressed VLIW instructions. Compression translates a normal
22
VLIW instruction word into a variable-length word by encoding the no-operation elds in some predetermined manner. In the program execution a compressed instruction word is eventually decompressed back to the original VLIW format. DSP processors employing instruction compression have recently been reported in [Ses98, Rat98]. An alternative term for compressed VLIW instruction is variable-length execution set (VLES) [Roz99]. Interestingly,thehigh-performacex86microprocessorsemployacomplicateddecodingunit to permit multiple-issue for CISC instructions [Che98]. The decoding unit translates x86 instructions into several RISC-style primitive operations and issues them to the functional units. Recently, a novel approach to carry out this translation in software in combination with an advanced low-power VLIW architecture has been reported in [Kla00].
3.2.3
Data-Level Parallelism
In contrast to pipelining and multiple-issue techniques, data-level parallelism (DLP) can be employed to leverage the amount of work performed by an individual instruction. This approach is generally implemented in the form of single instruction stream, multiple data stream (SIMD) instructions. The basic idea is to simultaneously perform an arithmetic operation on a small array of data values. The wide acceptance of this approach is due to the observation that the data values found in multimedia applications can be represented with much less precision than the native data word width. For example, commonly utilized data types in digital audio and video processing are 16 and 8 bits, respectively [Kur98]. Generally SIMD instructions can be realized either by utilizing the existing arithmetic units at subword precision or by including several duplicates of the arithmetic units. The former alternative is especially well suited to general-purpose microprocessors that employ a wide data word, such as 64 bits [Lee95]. A wide data word can be packed with several lower precision data values and a wide arithmetic unit can be divided or split into smaller subunits that carry out several operations at the same time. For example, a 64-bit ALU can easily be implemented so that it can also perform either two 32-bit, four 16-bit, or eight 8-bit operations. Additionally, SIMD instructions often incorporate extra functionality into the basic operations, such as rounding and saturating arithmetic. Fig. 9 illustrates conceptual operation of SIMD instructions for calculation of a sum of 8-bit absolute differences and a dual sum of four 1616-bit multiplications. In particular, these SIMD instructions dramatically accelerate digital video compression and decompression, such as motion estimation and IDCT operations [Kur99]. Virtually all modern microprocessors have been enhanced with a number of SIMD instructions, mainly to accelerate processing of digital audio, video, and 3D graphics. For example, the x86 ISA was rst enhanced with multimedia extensions that perform packed integer arithmetic [Bar96, Pel96]. Later, primarily to accelerate 3D-geometry processing,
23
8x8-bits RegA RegB Op1 Op3 Op5 Op7 Op9 Op11 Op13 Op15 Op2 Op4 Op6 Op8 Op10 Op12 Op14 Op16 Op1 Op2
4x16-bits Op3 Op4 Op5 Op6 Op7 Op8
Abs Abs Abs Abs Abs Abs Abs Abs
+ +
Acc1
+ +
RegC Sum 64-bits
Acc2 2x32-bits
a)
b)
Figure9. Examples of special SIMD instructions realized using split arithmetic units at subword precision: a) a sum of eight absolute differences [Tre96] and b) a dual sum of four multiplicationoperations[Bar96]. Abs: absolutevalueoperator.
SIMD-style instructions were added that allow two parallel single-precision oating-point operations to be computed [Obe99]. SIMD enhancements for PowerPC, Sparc, and MIPS RISC processors have been reported in [Ful98, Tre96, Kut99]. It should be noted that DSP processors realize SIMD instructions often by duplication of arithmetic units because the length of the native data word is typically only 16 bits.
3.2.4
Task-Level Parallelism
Traditional uniprocessor computer systems are constructed around a single main central processingunit(CPU).Withtheaidofanoperatingsystemkernel,aprocessorrunsmultiple programthreadsbyswitchingexecutionbetweenactiveandidleprocesses. Thusatanygiven instantonlyoneprogramthreadisexecuted. Inordertoraise task-levelparallelism (TLP)in a computer system, two main alternatives have been proposed: simultaneous multithreading (SMT) and chip-multiprocessors (CMP) [Tul95, Olu96]. SMTisprimarilyintendedtoenhancetheperformanceofwide-issuesuperscalarprocessors. Whereas control and data dependencies in a single-threaded processor typically restrict the level of ILP extracted from a thread, a processor employing SMT is capable of lling unused issue slots with instructions from the other program threads. CMPs, however, use relatively simple single-threaded processor cores while executing multiple threads in parallel across multiple on-chip processor cores. These multiprocessor computer systems divide an application into multiple program threads, each of which is executed in a separate
24
External BusIF
SmartCardIF
Keypad IF
Shared Memory MU
Baseband SP
Audio SP
DSP Debug
QSPI MCORE RISCMCU UART MCU Debug RAM/ROM Memories
Protocol Timer Misc. Timers
DSP56600 DSP
RAM/ROM Memories
Figure10. Block diagram of an integrated cellular baseband processor architecture. System integrates a RISC microcontroller unit (MCU) and DSP processor which communicate using a shared memory block and messaging unit (MU) [Gon99]. IF: interface, SP: serialport,UART:universalasynchronousreceiver/transmitter,QSPI:queuedserialport interface.
processor. Thus, approaching the same paradigm from a different perspective, both the SMT and CMP systems employ a computer organization generally referred to as multiple instruction stream, multiple data stream (MIMD) [Hwa85]. From a purely architectural point of view, the SMT processors exibility makes it an attractive choice. However, the scheduling hardware to support the SMT is rather complicated and, even more importantly, the impact on the processor implementation cost is signicant. For these reasons, CMP is muchmorepromisingbecauseitcanemployalreadyexistingprocessorcoresincombination with the increasing IC capacity [Ham97]. Inthepastmultiprocessorsystemshavebeenutilizedsolelyforsupercomputingapplications, mainly due to the ultra-high implementation cost. Almost 30 years after the invention of the microprocessor the advances in the IC technology permit integration of several programmable processors and memory on a single silicon die [Bet97]. In the early 1990s the rst applications to adopt this approach were embedded DSP systems. For example, multiprocessor platforms realizing video teleconferencing and a wireline modem have been described in [Sch91, Gut92, Reg94]. However, the breakthrough of this technology to the consumermarketwasnotfeasibleuntilsuchplatformscouldbemanufacturedinhighvolume atareasonablecost. TherstcommerciallysuccessfuldesignsexploitingtheCMPapproach were digital cellular phones where two programmable processors, a microprocessor and a DSPprocessor,wereintegratedonasinglesilicondie[Gat00,Bru98,Bog96]. Suchasystem architecture is depicted in Fig. 10.
4. PROGRAMMABLE DSP PROCESSORS
Programmable DSP processors are streamlined microcomputers designed particularly for real-time number crunching. In addition to the sophisticated techniques described in the previous chapter, DSP processors embody advanced features that push the level of parallelism even further. This is made possible by exploiting the inherent ne-grain parallelism found in the fundamental algorithms, functions, behaviors, and data operations in the eld of digital signal processing. In this chapter a detailed overview of DSP processor architectures is given. The chapter concentrates on the processor cores themselves, i.e. peripherals are not considered in this context. Moreover, to make the scope of the chapter slightly narrower, the investigation is limited to xed-point DSP processor cores that do not have native hardware support for oating-point arithmetic operations.
4.1
Historical Perspective
The rst processors that were designed particularly for digital signal processing tasks emerged in the early 1980s [Lee90a]. It is arguable, however, which processor constitutes therstDSPprocessor. ThecandidatesareAMIS2811,AT&TBellLaboratoriesDSP1,and NEC PD7720 [Nic78, Bod81, Nis81]. The instruction cycle times for the S2811, DSP1, and PD7720 processors were 300, 800, and 250 ns, respectively. All these processors had a hardware multiplier and some internal memory, thus permitting development of stand-alone embedded system implementations. Although the 12-bit S2811 was announced in 1978, working devices were not available until late 1982 due to problems in fabrication technology. In 1979, the 16/20-bit DSP1 processor became available, but it was only employed for in-house designs at AT&T. The 16-bit PD7720 was released in 1980 and was one of the most heavily used devices among the early DSP processors. To summarize, depending on how one prioritizes an announcement of a new processor, a functional chip, and public commercial availability, the choice for the rst DSP processor can be justied in different ways. Other noteworthy processors to follow were Texas Instruments TMS32010 [Mag82] and HitachiHSPHD6180[Hag82],bothreleasedin1982. TheTMS32010processorwastherst member of what was to become the most widely used family of DSP processors. The HSP
26
4. ProgrammableDSPProcessors
was the rst DSP processor fabricated in a CMOS technology and it also was the rst to support a oating-point number format with a 12-bit mantissa and 4-bit exponent. Today, twenty years after the rst successful architectures, programmable DSP processors have evolved into highly specialized microcomputers which can efciently perform massive amounts of computing.
4.2
Fundamentals
The primary function provided by a DSP processor is its ability to provide execution of a multiply-accumulate (MAC) operation in one instruction cycle. Fundamentally, the MAC operation performs a multiplication of two source operands and adds this product to the results that have been calculated earlier. From the program execution point of view, the MAC operation can be decomposed into several parallel operations: multiplication of two operands, accumulation (addition or subtraction) with previously calculated products, fetching of the next two source operands, and post-modication of the data memory addresses. Thus, the MAC operation exhibits a high level of inherent parallelism that is exploited in pipelined DSP processors. Another speciality found in xed-point DSP processors are the measures which are utilized tocombatlossofprecisioninarithmeticoperationsconstrainedbyxed -pointnumberswith a nite word length. When two xed-point numbers are multiplied, the product with full precision is equal to the sum of the number of bits in the operands [Lee88]. Therefore, discarding any of these bits introduces error in the computation, i.e. loss of precision. For this reason xed-point DSP processors perform multiplications at full precision [Lap96]. In the MAC operation, intermediate results are stored in an accumulator which, in order to preventundesirableoverowsituations,providesadditional guardbits forpreservationofthe accuracy. This permits an accumulator register with n guard bits to perform accumulation of 2 values with the condence that an overow will not occur. Furthermore, accumulation operation incorporates special saturation arithmetic which, in operation, forces the result to the maximum positive or negative value in imminent overow situations [Rab72]. At some point it is necessary to reduce the precision of results, typically to t into the native data word. In truncation the least signicant bits of the full-precision result are simply discarded. In effect this rounds signed twos complement numbers down towards minus innity. A truncated value is always smaller than or equal to the original and thus adds a bias to truncated values [Cla76]. In order to avoid this bias, many DSP processors provide advanced rounding schemes, such as round-to-nearest and convergent rounding [Lap96]. Furthermore, some algoritms require xed-point multiplications and ALU operations to be performed at a higher precision than that dictated by the native data word length. For this
n
4.2. Fundamentals
27
LDC #63,d0 LOOP d0,loop_end XOR c,c,c ; LDX (i0)+1,a0 ; LDY (i2)+1,b0 loop_end: MAC a0,b0,c ; LDX (i0)+1,a0 ; LDY (i2)+1,b0 ADD c,p,c
Figure11. Assembly source code which implements a 64-tap FIR lter. Each row corresponds to an instruction word. LDC: load constant, XOR: logical exclusive-or, LOOP: initialize hardwareloop,NOP:no-operation,MAC:multiply-accumulate,LDX/LDY:loadfromX/Y datamemory [VS97].
reason it has become imperative that the datapath supports extended-precision operations, suchasa3232-bitmultiplication, whichresultsina64-bitfull-precisionresult. Inorderto support these operations, it is required that 1616-bit multiplications can be computed for a mixture of signed and unsigned operands, i.e. they can be in both signed twos complement and unsigned binary formats. InDSPalgorithmsitisquitecommonthatlongsequencesofsimilaroperationsareexecuted frequently. These sequences are most conveniently programmed as a software loop that, for a known number of iterations, requires both decrementing and testing of the loop count and a conditional branch to the beginning of the loop. Obviously this adds very undesirable overhead to the looping since on each iteration several instruction cycles are spent in the manipulation of the loop count and the branching penalty resulting from the pipelining. For these reasons DSP processors include special functionality in the form of zero-overhead hardware looping. This hardware is an independent functional unit which, by decrementing and testing a loop count register, can force a fetch from a loop start address when necessary. The hardware looping unit operates in parallel with the normal program execution and thus the looping operation adds no overhead once a hardware loop has been initialized. The peak achievable ILP can be extremely high in DSP processors, which is illustrated with a piece of assembly source code shown in Fig. 11. In the example a hardware loop is initialized and a stream of consecutive MAC operations is executed. The loop body is composed of a single instruction word which contains a MAC operation and associated data transfers. FortheDSPprocessorintheexample,theloopinstructionhasonedelayslotwhich isexploitedforclearingtheaccumulatorandloadingtheoperandsfortherstmultiplication. Conceptually, the processor performs a total of eight RISC-type instructions on every instruction cycle: multiplication, accumulation, two data moves, two address modications, decrement-and-test, and branching. Therefore, the apparent number of operations per clock cycle in this loop is quite impressive even when compared with high-end microprocessor architectures.
28
In general, DSP processors employ the modied Harvard architecture with two data memories and a separate memory for program code. This memory architecture allows three independent memory accesses to be performed simultaneously and thus both an arithmetic operation using two input operands and an instruction fetch can be performed in a single instruction cycle. Since instruction words are fetched with a separate memory bus, the program execution does not block any data memory accesses. The development of a DSP processor architecture requires careful balancing with several conicting issues involving processor implementation cost, performance, ease of programming, and power consumption. One of the most important issues in DSP processor design is the format and length of an instruction word. An instruction word explicitly species an operation or, more often, a set of operations which eventually is carried out in the stages of a processor pipeline. With respect to the size of the instruction word, three approaches can be taken: the processor can either employ a xed-length, dual-length, o r variable-length instruction word. A xed-length, RISC-style instruction word generally simplies program address generation and instruction decoding, but the program code density is relatively poor. As opposed to variable-length, a dual-length instruction word based on two alternate instruction formats offers a reasonable trade-off between increased complexity and program code density. With respect to the program execution and general processor structure, DSP processors can be divided into two main categories: conventional DSP processor and VLIW DSP processor architectures [Eyr00]. The main features and differences of these classes are studied in the following subchapters.
4.3
Conventional DSP Processors
In high-volume embedded system products, the dominant DSP processors are characterized by attributes such as relatively high performance, small die area, low power consumption, and instruction-set specialization. These conventional DSP processors are cost-efcient processing engines for signal processing tasks commonly found in battery-powered consumer products, such as mobile phones, digital cameras, and solid-state audio players. In addition, conventional DSP processors are heavily utilized in computer peripherals, automotive electronics, and instrumentation. A conventional DSP processor architecture is based on pipelined scalar program execution. In these processors the distinction between an instruction, an instruction word, and an operationisratherobscure. Theinstructionsareencodedeitherasxed-lengthordual-length instruction words. As opposed to the one-instruction one-operation RISC philosophy, conventionalDSPprocessorsemploycomplexcompoundinstructionswhichspecifyagroup
4.3. ConventionalDSPProcessors
29
of parallel operations. As an extreme example, the TMS320C54x processor has a total of 22 instructions that perform various multiplication-related operations together with parallel data memory accesses [Tex95]. Moreover, in order to encode instructions effectively, the combinations of memory addressing and operands have deliberately been limited for instructions that contain many parallel operations. Thus, from the point of view of a processorinstructionset,someconventionalDSPprocessorshaveconstructswhichresemble those found in CISC machines. Processor pipeline structure can be divided into two sections: instruction and execution pipelines. The instruction pipeline contains at least two stages for performing instruction fetch and decode functions. In some processors the instruction pipeline contains additional stages to facilitate instruction address generation or to realize pipelined access to a program memory [Tex97a]. The execution pipeline carries out the execution of the operations specied by an instruction word. This section contains one stage for DSP processors which employ a load-store architecture. However, two to four pipeline stages are needed for DSP processors that, for an arithmetic operation, permit source and destination operands to be accessed directly in the data memory. Primary computational resources in conventional DSP processors are divided into data memory addressing and datapath sections. Data memory addressing is realized with an addressing unit that, for the modied Harvard memory architecture, is composed of an address register le and address arithmetic-logic units (AALUs) which typically support variousaddressingmodesbasedontheregister-indirectscheme. Acommonaddressregister le conguration has eight address registers and two AALUs. The datapath is composed of an arithmetic register le, a multiplier, an arithmetic-logic unit (ALU), and a selection of functional units. Among the most commonly found functional units in a conventional DSP processor are a barrel shifter for arbitrary shifting of a data value, a bit-manipulation unit, a nadd-compare-select-store unit for Viterbi decoding, and an exponent encoder for counting the redundant leading sign bits of a data value. Assuming that the functional units themselves are not pipelined and no wait-states result from memory accesses, the actual execution of a xed-length instruction word is carried out in a single clock cycle. In DSP processors employing dual-length instruction words, the execution of the wider instruction word generally takes two clock cycles. Fig. 12 shows two examples of conventional DSP processors. The DSP processors are TMS320C54x and R.E.A.L. with pipeline depths of six and three stages, respectively. The TMS320C54x, shown in Fig. 12a, contains a single MAC unit and it has a relatively deep pipeline which is necessary to realize complex instructions that have memory-register or memory-memory operands. Due to high level of specialization, the TMS320C54x datapath is very complicated. The specialization has been realized by incorporating versatile interconnections between the functional units and by adding application-specic
30
pdb
pab
Fetch(Generate/Send) Fetch(Read) Decode DataAccess(Send/Modify) DataAccess(Read)

xdb ydb pdb pab
Fetch(Generate/Read) Decode
xdb ydb
8x16 Address Registers EXP
pdb
BSH MUL ALU ALU VIT AALU

xab wdb
8x16 Address Registers
4x40 Arithmetic Registers
AALU
xab wab
AALU
yab
AALU
yab
DSU
EXP
BSH
MUL P ALU
MUL P ALU
2x40 ArithmeticRegisters
a)
b)
Figure12. Examples of conventional DSP processor architectures showing the processor pipeline and datapath conguration for a) TMS320C54x and b) R.E.A.L. RD16020 processors [Lee97, Tex95, Kie98]. AALU: address arithmetic-logic unit, MUL: multiplier, ALU: arithmetic-logic unit, P: product register, EXP: exponent encoder, BSH: barrel shifter, VITViterbiaccelerator,DSU:divisionsupportunit.
functionality for the selected DSP algorithm kernels, such as least mean square (LMS) ltering, FIR ltering, and Viterbi decoding. However, the R.E.A.L. DSP processor incorporates a shallow processor pipeline which is mainly a result of the load-store memory architecture. As illustrated by Fig. 12b, the processor datapath incorporates a larger arithmeticregisterlewhichisconnectedtothevariousfunctionalunits. Inordertoimprove MAC performance, the processor contains two multiplier units which receive their input operands from special input registers. Since only two 16-bit data buses are available, FIR ltering is carried out using a special technique to calculate two successive lter outputs at the same time [Ova98]. Furthermore, the processor has a special division support unit that in combination with the barrel shifter can perform true division of data values in an iterative fashion. Various architectural characteristics for conventional DSP processors are listed in Table 1. In general, the traditional processor datapaths have included one MAC unit, but recent processors are almost exclusively dual-MAC architectures, i.e. they incorporate a second MAC unit to increase computational power. This enhancement can be considered as a SIMD-style extension of processor architectures. Due to requirements derived from various DSP applications, recent processors also permit extended-precision data operations and incorporateabarrelshifterandanexponentencoderunittosupportoating -pointarithmetic. The depth of the pipeline in conventional DSP processors is typically three or four stages.
4.3. ConventionalDSPProcessors
Conventional MAC Pipeline Instruction DSPProcessor Units Stages Word PineDSPCore 1 3 16 OakDSPCore DSP56600 uPD7701x Z893x KISS EPICS Gepard VS-DSP1 VS-DSP2 CD2455 TMS320C5x TMS320C54x D950-Core Lode R.E.A.L.RD16020 DSP16210 TMS320C55x Carmel TeakDSPCore PalmDSPCore 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 3 3 2 3 3 3 3 3 3 4 6 3 5 3 3 7 8 4 4 16 24 32 16/32 16 28-32 32 32 32 16/32 16/32 16/32 16/32/48 32 16/32 16/32 8/16/32/40/48 16/32 16/32 16/32 Data Word 16 16 16 16 16 16 12-24 8-64 8-64 8-64 16-24 16 16 16 16 16-24 16 16 16 16 16/20/24 #Accum. #Addr. Speed Registers Registers (MHz) 2x36 6 80 4x36 3x401 8x40 1x24 4x321 2x401,2 4x161,2 4x321,2 4x401,2 1x32 1x32 2x40 2x40 4x40 4x401,2 8x40 4x40 6x40 4x36 4x362 6 24 18 6 16 12 8 8 8 8 8 8 17 16 8 15 9 24 8 8 80 60 33 20 40 33 22 49 100 50 66 160 40 40 125 160 160 120 130 200
31
Ref.
[Be93] [Ova94] [Mot96] [Lap95] [Zil99] [Wei92] [Wou94] [P1,AMS98a] [Nur97,Tak98] [P5] [Lap95,Yag95] [Tex98] [Lee97,Tex97a] [SGS95] [Ver96] [Moe97,Kie98] [Ali98,Lap95] [Tex00b] [Suc98] [Oha99] [Ova99]
1 Anaccumulatorcanbesplitintotwoorthreeregisters 2 Affectedbyadjustmentof
coreparameters(valuefora16-bitdataword)
Table1. Summary of conventional DSP processor features. Processor speeds are either from the referencesorsuppliedbytheprocessorvendors.
The listed processors include at least one level of hardware looping capability. In virtually all newer processor architectures, the instructions are encoded as 16/32-bit dual-length instruction words to achieve good program code density. The TMS320C55 processor may exhibit exceptionally high density with its variable-length instruction words. In addition, the EPICS, R.E.A.L., and Carmel processors can construct wider instruction words using an internal look-up table for extensions. An interesting aspect is that a 16-bit native data word and 40-bit accumulator registers have remained as the preferred parameters even in the more recently reported DSP processor architectures. This implies that most applications can effectively be implemented with 16-bit xed-point DSP processors which, at the cost of increased instruction cycles, can also employ higher arithmetic precision. DSP processor speed is strongly dependent on the semiconductor manufacturing technology. For conventional DSP processors, operating speeds of 150-200 MHz can be expected for implementations in 0.18 m CMOS technologies [Eyr00].
32
4.4
VLIW DSP Processors
In conventional DSP processors various architectural enhancements must undergo careful analysis to nd whether the added features are justied in terms of the implementation cost and increased complexity to the processor. While increasing performance and application-specic features, an enhanced processor architecture should remain backwards code compatible, which is often very difcult to realize. In addition, due to non-orthogonal ISAs, conventional DSP processors are a difcult target for software compilation using high-level languages. To address these issues, several DSP processors based on the VLIW design philosophy have emerged quite recently [Far98]. The key concepts behind VLIW DSP processors are characterized by orthogonal instruction sets, code generation with compilers, and very high performance through increased instruction-level parallelism. As opposed to conventional DSP processors, these processors provide increased performance and ease of use at the expense of higher implementation cost and power consumption. Generally, VLIW DSP processors are deployed in computationally demanding communications systems, such as cellular base stations, digital subscriber loop modems, cable modems, digital satellite receivers, and high-denition television sets. In order to simplify instruction decoding and support wide issue, old VLIW machines use a xed-sizedinstructionwordwhoselengthistypicallybetween64and256bits,thusresulting in poor program code density. VLIW DSP processors, however, employ simple but efcient compressiontechniquestoencodeno-operationinstructionsintheunusedVLIWissueslots. In effect, these compressed VLIW instructions are issue packets which are specied at program compile-time. During program execution VLIW DSP processors identify these issue packets and conceptually reconstruct the full-length VLIW instruction words. From an architectural point of view, it is arguable if this type of multiple-issue processor should actually be referred to as compiler-scheduled superscalar rather than VLIW. Therefore, the program execution in typical VLIW DSP processors is based on pipelined execution of compressed VLIW instruction words. These instruction words are composed of a number of atomic instructions which typically have a xed length but may also have a dual-length format [Roz99]. When compared with the operation of conventional DSP processors, the instruction pipeline realizes wide program memory fetches, identies and decodes a set of parallel atomic instructions, and dispatches them to the execution pipeline. Often the execution pipeline consists of several stages for performing operations in a pipelined fashion. In general, VLIW DSP processors use a load-store memory architecture and avoid pipeline interlocking and forwarding by using multi-cycle no-operation instructions because this signicantly reduces the complexity of the processor implementation.
4.4. VLIWDSPProcessors
pdb pab
33
Fetch(Generate)
pdb pab
Fetch(Send) Fetch(Wait) Fetch(Read) Decode(Dispatch) Decode(Decode)

xdb ydb xdb ydb
Fetch(Generate) Fetch(Read) Decode(Decode/Dispatch)
24x32 Address Registers
AALU AALU
xab yab
BMU
MAC BSH EXP
MAC BSH EXP
MAC BSH EXP
MAC BSH EXP
AALU
xab
ALU BSH BMU
ALU
MUL
MUL
ALU
ALU BSH BMU
AALU
yab
a)
b)
Figure13. Examples of VLIW processor architectures showing the processor pipeline and datapath conguration for a) StarCore SC140 and b) TMS320C62x processors [Roz99, Mot99, Ses98]. AALU: address arithmetic-logic unit, BMU: bit-manipulation unit, MAC: multiply-accumulate unit, BSH: barrel shifter, EXP: exponent encoder, MUL: multiplier, ALU:arithmetic-logicunit.
Fig. 13 illustrates the operation and architecture of StarCore and TMS320C62x VLIW DSP processors. StarCore, depicted in Fig. 13a, resembles conventional DSP processor architectures by dividing resources into memory addressing and datapath sections. The 5-stage, 6-issue processor employs three stages for program pipeline and two execution pipeline stages for data address generation and the actual execution of load/store and arithmetic operations. The key benet from a relatively short pipeline is the reduced penalty in instruction cycles associated with branching instructions. In StarCore the datapath has a total of four blocks, each of which contains a multiply-accumulate unit, bit-manipulation unit, and barrel shifter. The TMS320C62x processor, however, incorporates a deep 11-stage pipeline partitioned into six and ve stages for instruction and execution pipeline sections, respectively. The processor uses a unied architecture which is based on two arithmetic registerlescross-connectedtotwoidenticaldatapathblocks. Adatapathblockiscomprised of four independent units: a multiplier, ALU/exponent encoder, ALU/barrel shifter, and AALU. These units can be utilized as general-purpose resources for common operations, such as 32-bit addition and subtraction. Although most of the TMS320C62x instructions execute in asingle instruction cycle, execution of amultiplication, load/store, and branching consumetwo,ve,andsixcycles,respectively. WhereastheStarCoreprocessorincorporates hardware looping, TMS320C62x does not have such capability.
34
VLIWDSPProcessor StarCoreSC140 TMSC62x TMS320C64x TriMediaTM-1100 MPact21 TigerSharcADSP-TS0011 ZSPLSI402Z2
Issue Pipeline 16x16 Width Stages MACs 6 5 4 8 8 5 6 4 4 11 11 7 353 8 5 2 4 8 44 8 2
VLIW Atomic Packed Data Speed Width Instr.Word DataTypes Buses (MHz) 128 16/32/48 8/16/32 2x64 300 256 256 224 81 128 645 32 32 44 81 32 16 8/16/32 8/16/32 8/16/32 9/18/36 8/16/32 16/32 2x32 2x64 2x32 11x72 2x128 1x64
1
Ref.
[Roz99,Mot99] [Ses98,Tex97b] [Tex00c] [Rat96,Phi99] [Owe97,Pur98] [Fri99,Ana99] [LSI99]
250 600 133 125 150 200
Floating-pointDSPprocessor 2SuperscalarDSPprocessor 3 Lengthof3Dgraphicsrenderingpipeline 4 18x18MACoperation,valueestimatedfromdatabuswidth 5 Widthofinstructioncacheline
Table2. Summary of VLIW DSP processor features. Processorspeeds are either from the references orsuppliedbytheprocessorvendors.
Table 2 lists the main features of a number of VLIW DSP processors. It should be noted that only the rst four processors can be classied as xed-point VLIW DSP processors. Although the other three processors are either oating-point or superscalar DSP processors, theyareincludedforcomparisonpurposesduetotheirstrongxed -pointMACperformance. Typically,VLIWDSPprocessorscanissuesixoreightatomicinstructionsinparallelandthe depth of the pipeline is at least ve stages. The width of a decompressed VLIW instruction is between 128 and 256 bits. In addition to VLIW compression, the StarCore processor employs an atomic instruction word of variable-length to achieve even higher code density. Interestingly, the StarCore architecture also supports extension instructions that can execute various operations in tightly-coupled instruction-set accelerators. Although a 16-bit data precision is adequate for most DSP computations, all the listed processors support packed data types and they have wide data memory buses for realizing highbandwidthtotransferoperandsforthefunctionalunits. Asanexample,usingtwo64-bit data buses, TMS320C64x can simulataneously read a total of eight 16-bit values to perform four1616-bitMACoperationsinparallel. Asstatedearlier,processorspeedsaredependent on the semiconductor manufacturing technology. For VLIW DSP processors, operating speeds of 250-300 MHz can be expected for 0.18 m CMOS technologies [Roz99, Tex00c].
5. CUSTOMIZABLE FIXED-POINT DSP PROCESSOR CORE
This chapter presents a xed-point DSP processor core that has been utilized in the research work covered in most of the publications. General architecture, main features, and various implementation aspects of both hardware and software are described.
5.1
Background
The DSP processor presented in this chapter has evolved through three generations. The rst processor architecture, named Gepard, was presented in [P1, P2] and [Gie97]. This initial processor architecture established the base architecture template that incorporates a customizable DSP processor core with the modied Harvard memory architecture. The second and third core, referred to here as VS-DSP1 and VS-DSP2, employed a slightly different set of parameters and gradually added various enhancements to the processor operation primarily by optimizing the structure of the functional units [Nur97],[P5]. The DSP processor was targeted for use as an embedded processor core in highly integrated DSP systems that are integrated into a single silicon die. The processor development aimed at designing a DSP core architecture which combines a exible processor architecture with an efcient hardware implementation by using optimized transistor-level circuit compilers [Nur94]. The DSP processor has a customizable architecture that has a native support for adjustment of a wide range of core parameters and it also allows straightforward extension of the instruction set. These customization capabilities can be exploited to characterize the processor ISA to match the exact needs of a given application [P1]. In the past, similar DSP processor architectures have been reported, for example in [Wou94, Yag95]. These processors, however, were either based on a different implementation approach or they allowed only a limited degree of customization.
5.2
Architecture
In order to support extensive processor customization, the DSP processor employs an architecture that allows changes to a specied set of core parameters. Thus from an embeddedsystemdeveloperspointofview,thisparameterizedDSPprocessorcanbeviewed
36
5. CustomizableFixed-PointDSPProcessorCore
DSPProcessorCore
iab
Program Control Unit
Datapath ctrl
Data Address Generator
xab yab
idb
xdb ydb
Program Memory
XData Memory
YData Memory
Figure14. Base architecture of the customizable xed-point DSP processor. The DSP processor core is composed of three main units: Program Control Unit, Datapath, and Data Address Generators. The processor core is connected to three off-core memories with theassociatedaddress(iab,xab,yab)anddata(idb,xdb,ydb)buses.
as a family of DSP processors that share a common base architecture rather than a single processor that has xed functional characteristics and architecture. The base architecture of theDSPprocessor,depictedinFig.14,iscomposedofthreemainfunctionalunits: Program Control Unit, Datapath, and Data Address Generator. The DSP processor core connects to the separate program memory, two data memories, and off-core peripheral units using three global buses, each with its associated data, address, and control bus. Core parameters available in all the three implemented DSP processors are the following: data word width, multiplier operand width, number of accumulator guard bits, number of arithmetic and addressregisters,dataandprogramaddresswidths,programwordwidth,andthedepthofthe hardware looping. In this text these parameters are referred to as dataword, multiplierwidth, multiplierguardbits, accumulators, indexregs, dataaddress, programaddress, programword, and loopregs. In these DSPprocessor cores, program execution is basedon the scalarpipelined instruction issue scheme found in conventional DSP processors. Instructions are encoded into a 32-bit instruction word. The width of the instruction word is a core parameter, programword, b ut it is not likely to change without very denitive grounds. Although the instruction word is relatively wide, it has at least two main benets from the instruction-set architecture
5.2. Architecture
37
and hardware design perspectives. Most importantly, a wide instruction word inherently permits larger elds for operations and operands which consequently result in a highly orthogonal instruction set. This orthogonality facilitates the programming in assembly language and it makes the DSP processor core a more suitable target for code generation from high-level programming languages. If necessary, an instruction word can specify an extension instruction that executes complex parallel operations in the functional units or off-core hardware units. In addition, the hardware needed for instruction decoding becomes relatively simple and thus also enabling fast circuit operation. The only core parameter affecting the entire processor core is the the width of the native data word, dataword. This is clearly the most important parameter since it simultaneously species the precision of arithmetic, the maximum range of data memory addresses, and, consequently, the die area of the data memories. As discussed in the previous chapter, a 16-bit data word is well suited to a large majority of DSP applications. A wider data word, however, can be benecial in certain applications, such as digital audio decoding where a 24-bit data word can be employed to achieve better reproduced audio quality [P3]. It should be noted that, in current single-chip embedded systems, the associated data and program memoriesconstitutethedominantcomponentoftheoveralldiearea. Asasimpleillustration, a block of 102432-bit SRAM consumes a die area which is comparable to the area of a complete VS-DSP1 processor core. The X and Y data memories are typically mapped into two separate memory spaces. The size of the data address space is specied as 2 , but the actual amount of SRAM integrated on the chip can be less than this. The processor core employs a memory-mapped access scheme in transferring data between on-chip peripheral units and the processor core [Lin96]. In this scheme a block of data memory space is specied as a peripheral memory area that is mapped to various registers in the peripheral units. In addition to these basic addressing capabilities, the VS-DSP2 processor adds support for larger memory spaces and register-to-register data transfers [P5]. Moreover, an external bus interface peripheral can be incorporated to allow accesses to off-chip memory devices.
dataaddress
5.2.1
Program Control Unit
The Program Control Unit (PCU) supervises the pipelined operation of the instruction address issue, instruction word decoding and execution. The processor pipeline comprises three stages: fetch, decode, and execute. The pipeline structure is depicted in Fig. 15. Whereas the actual execution of arithmetic and data transfer operations is carried out in the Datapath and Data Address Generator units, the fetch and decode stages are realized in the PCU. The processor employs the delayed branching scheme where the instruction following a conditional or unconditional branch instruction is executed normally [Hen90].
38
idb
iab
Fetch Decode
xdb ydb
Additional Functional Unit
Address Registers
Arithmetic Registers
AALU
xab
AALU
yab
Datapath
Figure15. Pipeline structure of the customizable xed-point DSP processor. Processor architecture supports extension instructions which perform application-specic arithmetic and addressing operations in the processor core and in additional functional units. AALU: addressarithmetic-logicunit.
Thus the processor pipeline is visible to the programmer. In these processors, an instruction cycle corresponds to one processor clock cycle. Pipeline interlocking is not needed since all the instructions effectively execute in one clock cycle. However, the interrupt dispatch mechanism requires a selective cancellation of an instruction being processed in the processor pipeline [P4]. TheprincipalstructureofthePCUisdepictedinFig.16. Asaresultofinstructiondecoding, two groups of control signals are generated: execution and ow control. The execution control signals are used to initiate various operations in the main functional units and off-core peripherals. The ow control signals, however, are solely utilized by the Instruction Address Generator (IAG). With the aid of condition status ags, hardware looping control
execution ctrl flow ctrl
iab condition flags Instruction Address Generator Control Registers Interrupt Control Looping Hardware xdb ydb interrupt reset
idb
IR
Instruction Decode
Figure16. FunctionalblockdiagramoftheProgramControlUnit. IR:instructionregister.
5.2. Architecture
39
MR0 flow ctrl

Branch Target Address
MR1
condition flags interrupt ctrl
1 Reset Interrupt 0x0000 0x0008
LS
LR0
LR1 Mux PC iab
hardware loopingctrl
Figure17. InstructionAddressGeneratoroperation. Possiblesourcesforthenextinstructionaddress are the incremented program counter (PC), loop start address (LS), subroutine and interrupt return addresses (LR0, LR1), branch target address, or reset/interrupt vector addresses[VS97],[P4]. Mux: multiplexer,MR0/MR1: controlregister0/1.
and interrupt control signals, the IAG block produces a stream of instruction fetch addresses to realize linear program sequencing, hardware looping, and branching. The conceptual operation of the IAG block is illustrated in Fig. 17. In addition, the PCU incorporates a set of control registers and a simple nite-state machine as the Interrupt Control Unit (ICU) which detects a pending interrupt and ensures undisrupted execution of the interrupt service routine. The number of nested loops supported by the hardware can be dened using the loopregs parameter. A value of one instantiates a looping unit that does not support nested looping in hardware. A larger parameter value species the number of additional shadow registers that
iab LE from internal/data buses LS LC
Compare iab=LE
hardware loopingctrl
Compare LC0
Mux -1
Figure18. Primarycomponentsofahardwareloopingunitaretwocomparators,adecrementer,and a set of registers. LE: loop end address, LS: loop start address, LC: loop count, Mux: multiplexer.
40
Multiplier C D Shifter P ALU

P
MultiplyAccumulate Unit
ALU Datapath Register File xdb ydb xdb ydb b)
Shifter
Datapath Register File
a)
Figure19. Functional block diagram of the Datapaths used in a) the Gepard processor and b) the twoVS-DSPprocessors[P1],[Nur97]. ALU:arithmetic-logicunit,C:multiplierregister, D:multiplicandregister,P:productregister.
arerequiredtostoreseveralloopendandstartaddressesandloopcounts. Afunctionalblock diagramofthehardwareloopingunitisdepictedinFig.18. Inprogramcodeahardwareloop can be initialized with a loop instruction or the register contents can directly be manipulated with data transfers.
5.2.2
Datapath
The primary computation engine of the DSP processor core is the Datapath unit. As typicaloftheconventionalDSParchitecturesdiscussedinthepreviouschapter,theDatapath design follows a traditional structure based on an arithmetic-logic unit (ALU) and a multiply-accumulateormultiplierunit,asdepictedinFig.19. Thegureshowstwodifferent structures which correspond to the original Gepard datapath and to the modied datapath which is employed in the two VS-DSP processors. Both structures have a pipeline register and thus an additional instruction cycle is necessary to move the result of a multiplication or MAC operation to the register le. In this subchapter the parameter for the multiplier operand width, multiplierwidth, is assumed to be equal to dataword. The Gepard datapath, shown in Fig. 19a, incorporates a MAC unit that can perform a dataworddataword-bit multiplication and a (multiplierguardbits + 2dataword)-bit additionwhichisstoredintheproductregisterP.InaMACoperationthemultiplieroperand
5.2. Architecture
41
isalwaystheCregister,butthemultiplicandcanbeeithertheDregisteroroneofthedatapath registers. In Gepard, the shifter is used to select certain bit slices from the full-precision productregister[VS96]. Using dataword wideoperands,theALUperformsgeneral-purpose arithmetic and logical functions: addition, subtraction, absolute value, left and right shift by one, and basic logical operations. Depending on the value of accumulators, the datapath registerlecontains2,3,or4registers. Thebenetofthisdatapathstructureisthatitallows independentALUoperationsinparallelwithMACcomputations. Unfortunately,itwaslater discovered that the parallel execution of MAC/ALU operations was of limited practical use in typical DSP algorithms. Therefore, the later VS-DSP processor cores incorporated a new datapath structure which is shown in Fig. 19b. The MAC operation is carried out with a (dataword + 1) (dataword + 1)-bit hardware multiplier and a (multiplierguardbits + 2dataword)-bit ALU [Nur97]. The extra bit in the hardware multiplier operands is to allow mixed operations with signed twos complement and unsigned binary operands. Support for multiplication with fractional numbers is enabled by the simple shifter which, in its basic form, can only perform the necessary logical left shift by one bit. The datapath register le contains a maximum of eight dataword wide registers which can be grouped to compose four accumulator registers. Optionally the register le may include additional guard bits specied by the multiplierguardbits parameter.
5.2.3
The Data Address Generator (DAG), two types of which are depicted in Fig. 20, is capable ofissuingandpost-modifyingtwoindependentdatamemoryaddressesduringaninstruction cycle. The DAG incorporates two address arithmetic-logic units (AALUs) coupled with an addressregisterlethatholds indexregs registers. Thusdatamemoryaddressingemploysthe register indirect addressing mode as the basis for all data memory accesses. The addressing mode and post-modication operation are determined either directly from an instruction word or they are specied by an address register pair. As a result of the load-store memory architecture the Gepard and VS-DSP processor cores inherently support the register direct addressing mode. Additionally, the VS-DSP2 processor core realizes register-to-register data transfers. The immediate addressing is only available as a load constant instruction. Moreover, two DSP-specic addressing modes can be utilized: modulo (alternatively circular) and bit-reversed addressing. Both of these addressing modes realize a special access pattern to a programmer-specied block of memory. Modulo addressing can be used to effectively realize data structures found in common DSP algorithms, such as FIFO buffers and delay lines [Lee88]. Bit-reversed addressing provides a signicant acceleration of data manipulations required in an N-point
42
xdb ydb xdb ydb
Address Register File
Address/Page Register File
AALU
AALU
AALU
AALU
xab
yab
exab
eyab
a)
b)
Figure20. Functional block diagram of the Data Address Generator used in a) the Gepard and VS-DSP1 processors and b) the VS-DSP2 processor. AALU: address arithmetic-logic unit, xdb/ydb: X/Y data bus, xab/yab: X/Y address bus, exab/eyab: extended X/Y address bus.
FFT computation, where N is a power of 2. The presence of these two modes is specied by the addrmodes parameter. In the DSP processor architecture the width of the data and program memory addresses is limited to less than or equal to that of the data word width. These widths are typically adjusted with respect to the actual memory requirements of an application. Therefore, by adjusting the dataaddress and addrmodes parameters, some savings in the die area of the AALUs and the address register le can be achieved.
5.3
5.3.1
Implementation
Processor Hardware
The physical CMOS circuit implementation adopted a methodology which combines standard-cell and full-custom very large-scale integration (VLSI) design approaches [Smi97]. The standard-cell approach is based on an automated implementation path which begins with logic synthesis. Logic synthesis tools convert an HDL description into a circuit netlist which realizes various functions by using a set of standard library cells. A physical circuit layout of this circuit netlist is then constructed with automated cell placement and routing tools. Using the standard-cell approach it was possible to quickly derive instruction decoding circuitry since this hardware is merely a block of combinational logic. However, for other hardware units a full-custom approach was justied for a number of reasons. As
5.3. Implementation
43
A=0.118mm 2,t =10.3ns,P

d
=18.8mW
avg
A=0.061mm 2,t =13.1ns,P

d
=7.7mW
avg
a)
b)
Figure21. Two physical circuit layouts of a 16x16-bit twos complement array multiplier [Pez71]. Multipliers were implemented in 0.35 m CMOS technology by using a) standard-cell librarycellsandb)full-customcellsandlayoutgenerators. Circuitsoperatefroma3.3V powersupply,averagepowerconsumptionisfor50MHzoperation. [Vih99,Sol00].
opposed to the standard-cell approach, full-custom VLSI design inherently allows more optimal circuit realizations in terms of circuit speed, area, and power consumption [Wes92]. These characteristics are illustrated by Fig. 21, which shows standard-cell and full-custom layouts of a low-power hardware multiplier. Furthermore, it was possible to reuse a number of pre-designed, pre-tested full-custom blocks from the existing ASIC designs. The processor design methodology adopted a top-down approach in which the processor architecturewasgraduallyrenedfromaninformalspecicationdowntoahighlyoptimized transistor-level circuit layout. The hardware development was carried out in an electronic design automation (EDA) framework for design capture and simulation at various levels of abstraction: transistor layout, circuit schematic, and register transfer-level (RTL). Fig. 22 depicts a parameterized RTL model of the Gepard Datapath. Later in the design process this model was used to verify correct operation of the hardware circuit implementations: a functional model was substituted with an extracted circuit netlist, realistic load capacitances wereincorporated,andtheresultingheterogenousmodelwasthensimulated. Afull-custom, generator-based circuit design was founded on a set of hand-optimized transistor-level cell layouts. Using custom generator scripts these cells can be placed in regular arrays and then selectively connected with wiring. Due to their relatively regular structures, it was possible to design optimized layout generators for the multiplier, ALU, AALUs, register les, and other functions. Interestingly, the instruction decoding design exploited a novel method for automatic HDL generation. In this method, the combinational logic in the instruction decodingwasproducedwithacustomsoftwaretoolwhichgeneratesapieceofsynthesizable
44
xdb W=16 ydb W=16
creg
3:0 W=16 alu_reg_ctrl 7:4
W=16W=16
dreg
W=16W=16
clk1 clk2 reset_x dreg_test creg_test
W=4 W=16 W=16
W=4 W=16 W=16
W=16W=16
mux
0 W=31 alu_ctrl W=16 W=16W=40W=40 W=16
3:1
W=3
mac
W=40 W=40 4 W=40
W=40 W=40 W=40 W=40W=40
preg
pregs_test pregc_test
6:5
W=2 W=40
W=16
shift
W=40 W=16
W=16W=16 16:15 18:17 W=2
mux
W=16W=16 W=2
W=16
mux
W=16
W=16 14:7 W=8
W=16
alu
W=16 W=16
20:19 22:21 24:23 26:25 27 11:8 13:12 15:14 30:28W=3 W=4
W=2 W=2 W=2 W=2 W=4 W=2 W=2 W=4
accu
a0_test a1_test a2_test a3_test
mux
W=16 W=16 W=16 W=16 W=16 W=16W=16W=16W=16
alu_cond_neg
Figure22. Circuit schematic showing a register transfer-level model of the Datapath used in the Gepardprocessor[Kuu96].
VHDL source code from an instruction-set description [P1]. The tool provides also the necessary exibility for straightforward realization of the extension instructions. The power savings in the VS-DSP2 processor core were realized by the extensive use of gated clocks and latching of control signals. Processor registers, i.e. ip-ops and latches, are only clocked when useful data is available at their inputs. Thus, the functional blocks are active only when there is valid data available for processing. Furthermore, new Halt instruction can effectively freeze the processor core clock. Potentially this enhancement provides a signicant decrease in power consumption since this low-power sleep mode can be activated during idle periods.
5.3. Implementation
45
DATAPATH
INSTRUCTION DECODE
CLK
PAGE LOGIC
INSTRUCTION FETCH
DATAADDRESS GENERATOR
Figure23. CircuitlayoutoftheVS-DSP2processorcoredesignedfora0.35mtriple-metalCMOS process. Thecorecontains64000transistorsandhasa2.2mm 2 diearea[P5].
Although the Gepard and VS-DSP1 processors have some differences, coarse comparisons can be made by investigating two implementations for a 0.6 m CMOS technology. Assuming the estimates given in [Ofn97, AMS98a, AMS98b] are valid, the standard-cell Gepard and full-custom VS-DSP1 implementations [Tak98] have virtually the same die area and the same power dissipation, 5 mm and 6 mW/MHz at 4.5 V, respectively. However, it should be noted that the gure for the Gepard power consumption is for an implementation that did not contain a hardware looping unit and the modulo addressing capability [AMS98a]. With respect to the maximum clock frequency of 49 MHz for the VS-DSP1 processor, Gepard is capable of operating at a modest 22 MHz [AMS98b]. These observations support the fact that full-custom design methodology results in more optimal circuit implementations. The VS-DSP2 processor core layout is shown in Fig. 23. A number of enhancements were incorporated and the layout generators and full-custom cells were modied for a 0.35 m CMOS technology. This resulted in a 64000 transistor design which equals approximately 16000 logic gates. Interestingly, the added features did not increase the relative area due to an unused die area in the center of the layout. The VS-DSP2 processor core has a die area of 2.2 mm , which measures quite favorably with respect to the 2 high-performanceTMS320C54xprocessorcorethathasanareaof4.8mm inacomparable CMOS technology [Lee97]. At a 1.8 V operating voltage the VS-DSP2 processor core dissipates 0.65 mW/MHz [Tak00].
2 2
46
Figure24. ScreenviewoftheXWindowversionoftheinstruction-setsimulator[P2].
The main drawbacks of a full-custom design are technology dependence of the cell layouts and the development time, which is relatively long. This, however, is not an issue with synthesized hardware implementations since a design can smoothly be retargeted to nearly any standard-cell library provided by semiconductor manufacturers. In the past, fully synthesizable DSP processor cores have been reported for standard-cell ASIC [Wou94] and FPGA technologies [Lah97]. The apparent ease of implementation in these DSP processors, however, was strongly offset by poor performance and relatively high power consumption. More recently, synthesizable DSP processors have been announced in [Oha99, Ova99]. For the time being, it appears that logic synthesis tools are capable of producing fast hardware circuits but are still unable to efciently cope with some low-level physical issues, such as power-aware synthesis of logic circuits. The DSP processor reported in [Wou94] was later followed by a processor realization [Moe97] which actually resembles the Gepard and VS-DSP processors from both the architecture and implementation points of view.
5.3.2
Software Tools
In addition to the hardware circuit design, the development process required a considerable amountofsoftwareengineeringefforttocreatesoftwaredevelopmenttoolsfortheprocessor core. The rst set of tools incorporated a symbolic macro assembler, a disassembler, an
5.3. Implementation
47
object le linker, and an instruction-set simulator (ISS) [P2]. In addition, a proler tool was later implemented which is essential for comprehensive analysis of the dynamic behavior of the application code [P3, P4]. A graphical user interface of the ISS is shown in Fig. 24. The ISS provides a cycle-accurate simulation engine for testing, debugging, and analysis of the application software. The simulator also supports the parameterized architecture and it allows co-simulation using C-language descriptions of the off-core hardware units. The ISS executes simulations in an interpretative fashion and it achieves an execution rate of 0.25 million instructions per second in current state-of-the-art workstations. As opposed to using code interpretation, a compiledsimulationapproachcouldbeemployedtoaccelerateprogramsimulation[Bar87]. Inthecompiledsimulationapproach,theprogramcodeforsimulationiseachtimecompiled into a single executable program, effectively constructing a high-performance ISS for this program code. Thus, the overhead from the interpretation of the code is eliminated. This type of compiled simulation approach for DSP processors has been reported in [Ziv95], showing a simulation speed-up by a factor of 100 to 200. More recently, this approach has also been applied to instruction-set simulation of VLIW processors [Ahn98]. Traditionally, the application programming for a DSP processor solely relied on writing the necessaryroutinesinassemblylanguage. Inassemblylanguage,theprogramdevelopmentis very time-consuming, programming is error-prone by nature, and the program code exhibits poor maintainability. For the customizable DSP processor core, a major upgrade to the software tools was introduced with a C-compiler [P5]. The developed C-compiler supports the ANSI C-language standard and also includes a number of features which can be used to guide code generation towards a more optimal result. It is likely, however, that a majority of the applications will benet from a mixed approach where a large bulk of the program code is written in C-language and the performance-critical algorithm kernels and low-level peripheral drivers are implemented as optimized assembly language modules. For example, this approach was successfully applied in developing a software implementation of the MPEG layer III audio decoder [Tak00]. Furthermore, the development environment was reinforced with a real-time operating system (RTOS) that provides a pre-emptive multitasking kernel and a wide range of system servicesfor embeddedapplications. Theservicesincludeintertaskcommunication, memory management, and task switching. Due to the modular structure of the RTOS, different services can be selected to build a system kernel that contains the services needed by an application. The RTOS was written completely in assembly language, thus resulting in a small program memory footprint and a minimal overhead in the RTOS operation [P5]. It should be noted, however, that a fully-featured system kernel requires a hardware system timer as an off-core peripheral.
6. SUMMARY OF PUBLICATIONS
This chapter summarizes the seven publications included in Part II of this thesis. The publications are divided into two categories describing the customizable xed-point DSP processor core and high-level specication of wireless communications systems. Whereas this chapter highlights the primary topics in each publication, main conclusions are given in the next chapter.
6.1
Customizable Fixed-Point DSP Processor Core
Publications[P1],[P2],[P3],[P4],and[P5]aresummarizedinthissection. Thepublications describe the evolution of the DSP processor architecture and present the development of an audio decoder application and an analysis of a parallel program memory coupled with the DSP processor core.
Publication [P1]: A parameterized and extensible DSP core architecture.
This publication
gives the rst presentation of the novel DSP processor core architecture. The Gepard processor is the result of the research that was carried out as a joint collaboration work at Tampere University of Technology, VLSI Solution Oy, and Austria Mikro Systeme International AG (AMS). The early development of the DSP processor core was carried out in 1996 and it has been reported in [Kuu96]. In relation to the other papers, this publication contains the most detailed coverage of the Gepard processor architecture. Block diagrams of all three main functional units are shown, the core parameters employed in this processor version are presented, and their impact on the functional units is studied in detail. As an applicationexample,thecustomizationoftheGepardarchitectureforaGSMfull-ratespeech codec is briey reviewed.
Publication [P2]: Flexible DSP core for embedded systems.
Whereas the previous
publication was an initial presentation of the DSP processor architecture, this article gives a more comprehensive view of a DSP processor core-based ASIC design ow. The article focuses on the main issues associated with deployment of this licensable DSP processor in embedded system designs. Interestingly, this publication in fact describes a system design
50
6. SummaryofPublications
ow which was later introduced in the form of intellectual property (IP) usage in which a system developer integrates a reusable hardware component as part of a larger entity. A core-based design ow is illustrated with a gure that shows the tasks performed by the processor core vendor and the customer, i.e. the system developer. The concept of an extensible instruction set is presented. The processor ISA is composed of 25 basic instructions, a parameterized number of registers and levels of hardware looping, and a number of extension instructions. The extension instructions can be dened to allow access to off-core peripherals or special functions embedded in the processor datapath. Software development tools supporting the exible ISA are presented. The application example briey covered in [P1] is given a more comprehensive treatment. Using four DSP processor congurationsthespeechcodecapplicationisrenedintoanoptimizedimplementation. The four cases are carefully evaluated in terms of task run-times, memory usage, and estimated die areas. In addition, comparisons of application speed-up, estimated power consumption, and relative cost of speed-up are given.
Publication [P3]: MPEG-1 layer II audio decoder implementation for a parameterized DSP core. This publication presents the development of a standard digital audio decoder [ISO93] for the VS-DSP1 processor core. Conceptually, real-time audio decoding is realized as embedded software executed in an MPEG audio decoder IC that contains a VS-DSP1processorcore, two16-bit audioDACs, andmiscellaneousperipheralsonasingle silicon die. An external ash memory device complements the decoder IC by providing a large storage space for digital audio streams. The publication describes a systematic implementation approach to transform a C-language source code with oating-point arithmetic into an efcient implementation in assembly language. In order to provide satisfactory audio quality, certain sections had to exploit extended-precision multiplication operations, a feature that had then become available in the VS-DSP1 processor [Nur97].
Publication [P4]: AparallelprogrammemoryarchitectureforaDSPcore.
Thispublication
describes an experiment with the VS-DSP1 processor coupled with a memory architecture in which the single program memory block was replaced with several parallel program memory blocks. The rationale for the parallel program memory architecture is that, to some extent, a potentially slow memory read access time can be compensated by fetching multiple instruction words in parallel. A slow read access time is a characteristic of ash memory devices which have found increasingly wide-spread use in embedded systems. The program sequencing in the VS-DSP1 processor is presented. The publication presents a general parallel memory architecture [Gos94] and a suitable architecture for pipelined program execution is derived for the DSP architecture. Moreover, program code mapping and implications on the program memory addressing are described. Memory architectures
6.2. SpecicationofWirelessCommunicationsSystems
51
with 1 to 8 memory blocks were evaluated with a GSMhalf-rate speechcodec and the audio decoder that was implemented in [P3]. For both applications, performance was evaluated with three cases.
Publication [P5]:
Enhanced DSP core for embedded applications.
The VS-DSP1
processor, utilized in Publications [P3] and [P4], was followed by the VS-DSP2 processor core that was improved in several ways. The VS-DSP2 processor was enhanced with severalnewinstructions,extendedprogramanddataaddressingcapability,vectoredinterrupt support, and a number of low-power features. The design objectives are rst formulated and justied. Then implementation of each of the enhancements and their inuence on the processor operation are studied in detail. Moreover, the publication describes two new additions to software development environment: an optimizing C-compiler and a modular real-time operating system. The embedded system prototyping was also reinforced with a DSP evaluation board that can be employed for application prototyping purposes.
6.2
Specication of Wireless Communications Systems
Publications [P6] and [P7] are summarized in this Section. A dataow simulation of a wirelessLANsystemisreportedandahigh-levelevaluationofathird-generationW-CDMA radio transceiver is described.
Publication [P6]: Run-time congurable hardware model in a dataow simulation.
This
publication describes a system-level simulation of a wireless communication system. As a case study, a wireless LAN system in which compressed image data is transmitted to a number of mobile terminals is modeled and simulated [Mik96, Wal91]. Conceptually, the mobile terminal implementation had a target architecture which integrates a DSP processor, a microcontroller, a hardware accelerator, and a radio frequency front-end. The system was modeled by constructing a dataow model of the transmitter-receiver chain. The functions were described using C-language models and the entire system was simulated with a commercial simulation environment. Two basic transform operations were needed in themobileterminal: complex-valued16-pointfastFouriertransform(FFT)anda88-point inversediscretecosinetransform(IDCT).Inthesystem,acongurablehardwareaccelerator, described in VHDL code, carried out both of these transforms. The publication describes the main system functions and time-multiplexed FFT/IDCT scheduling and reviews the implementation of synchronous and asynchronous models which are necessary to permit heterogenous dataow system simulation with the event-driven HDL model of the hardware accelerator.
52
6. SummaryofPublications
Publication [P7]: Baseband implementation aspects for W-CDMA mobile terminals.
This
publication presents a functional architecture of a mobile terminal transceiver that can be employed to implement the European candidate proposal for the third-generation mobile cellular standard [ETS98]. After a brief overview of the two operation modes specied in theproposal,thefundamentaloperationsinthereceiverandtransmitterbasebandsectionsare studied in detail. In this context, the term baseband is used to refer to all the digital signal processing that is needed in the inner receiver [Mey98]. Due to the considerably higher complexity, the emphasis is mainly on the receiver implementation. The presented receiver architectureisbasedonaconventionalRakereceiverwhichiscomplementedwithanumber of relatively simple functional units for tasks such as pulse shaping ltering and various measurements. The publication includes coarse estimates of sample precision, sample rate, and digital signal processing requirements and presents well-suited hardware structures for the main receiver functions. Moreover, the baseband partitioning into application-specic hardware and DSP software is briey discussed.
6.3
Authors Contribution to Published Work
In this section the Authors contribution to the published work is claried publication by publication. The Author is the primary author in six of the seven publications. The co-authorshaveseentheseclaricationsandagreewiththeAuthor. Noneofthepublications have been used as part of another persons academic thesis or dissertation.
Publication [P1]. The initial DSP processor core architecture was developed by a team consisting of Prof. Jari Nurmi, Janne Takala, M.Sc., Pasi Ojala, M.Sc., Henrik Herranen, Richard Forsyth, M.Sc., and the Author. The Author was involved in the design and simulation of a register transfer-level model of the Gepard processor core and he also performed functional verications on the processor operation [Kuu96]. Prof. Olli Vainio gave valuable comments on the work.
Publication [P2].
In this publication the Author was responsible for the detailed
presentation of the licensable DSP processor. Together with Prof. Jari Nurmi and Janne Takala,M.Sc.,theconceptoftheDSPprocessorcore-basedASICdesignowwassolidied. The software tools were designed by Pasi Ojala, M.Sc., and Henrik Herranen. The Author and Prof. Jari Nurmi performed the trade-off analysis of the GSM speech codec that was programmed by Juha Rostrom, M.Sc. This analysis is a more comprehensive study of the preliminary results presented in [P1].
6.3. AuthorsContributiontoPublishedWork
53
Publication [P3].
The idea of designing a standard audio decoder for the xed-point
VS-DSP1 processor was proposed by the Author. Teemu Parkkinen, M.Sc., performed this work under supervision of the Author [Par99]. The Author suggested the implementation approach in which a C-language source code was gradually transformed into an assembly language program. The main contribution was the idea of modifying the C-language source code rst to employ 16-bit xed-point arithmetic. Thereafter assembly language programming became a straighforward task. comments on the work. Prof. Jarkko Niittylahti gave valuable
Publication [P4]. The idea of a parallel program memory was initially suggested by Prof. JarmoTakalaandProf. JarkkoNiittylahti. WiththeaidofaVS-DSP1processorHDLmodel providedbyJanneTakala,M.Sc.,theAuthorconstructedatestbenchfortheparallelmemory architecture. The Author performed the analysis of the memory architecture using a GSM speechcodecprogrammedbyJuhaRostrom,M.Sc.,andtheMPEGaudiodecoderpresented in [P3]. Prof. Jarkko Niittylahti gave valuable comments on the work.
Publication [P5]. Architectural design and low-level circuit implementation of the VS-DSP2 processor was devised by Janne Takala, M.Sc. Based on the data provided by him and Pasi Ojala, M.Sc., the Author carried out an extensive evaluation of enhancements thatwereimplementedinboththeVS-DSP2processorcoreandsoftwaredevelopmenttools. Pasi Ojala, M.Sc., developed the real-time operating system and also the C-compiler which was initially referred to in [P2].
Publication [P6]. In this publication the Author designed various dataow models and a hierarchical block diagram of the wireless LAN system. This case study was suggested by Prof. Jarmo Takala, who also provided an HDL model of the congurable transform hardware. The Author developed a scheme to allow embedding of the run-time congurable hardware model, planned operation scheduling in the receiver, and performed extensive simulation runs to verify correct system operation. Prof. Jukka Saarinen gave valuable comments on the work.
Publication [P7].
In order to have a solid foundation for later research, an extensive study
of CDMA receivers was performed by the Author. The Author resolved the functions needed in a W-CDMA transceiver and drafted conceptual architectures for both receiver and transmitter sections. Later, performance estimations in terms of MAC operations per second were calculated and reported in [Kuu99].
7. CONCLUSIONS
The research reported in this thesis has been to a great extent applied technical research rather than basic research. The published results address a wide range of issues which are associated with the specication, design, and implementation of a commercially viable DSP processor architecture. Furthermore, the research work covers specication of wireless communication systems, an application area which clearly benets the most from the raw computational power, low power consumption, and instruction-set specialization provided bymodernDSPprocessors. Inthischapterthemainresultsaresummarizedandthethesisis concluded with a discussion on future trends in wireless system design and DSP processors.
7.1
Main Results
In this thesis, the development of a exible DSP processor core architecture has been presented. The processor evolution encompasses three generations, all sharing the base architecturetemplateinitiallypresentedin[P1]. Inthispublicationthemainfunctionalunits andcoreparametersfortheGepardprocessorweredescribed. UsingaGSMfull-ratespeech codecalgorithm,itwasdemonstratedthatitispossibletoimprovetheprocessorperformance by adjusting the core parameters and the features of the processor datapath. A generic ASIC design ow for usage of the DSP processor core was shaped in [P2]. Based onthelicensableprocessorcoreapproach,thestepsinthesystemdevelopmentweredivided into tasks carried out by the core vendor and the DSP system developer. In the publication, theGSMfull-ratespeechcodecapplicationwasgivenamoredetailedanalysis. Thetrade-off analysis covered four cases beginning with a basic core and ending with an optimized core thathasahardwareloopingunit,saturationmode,andadd-with-carrycapability. Asopposed to the basic core, the optimized core reduced the instruction cycle count by 43 % and consequently the estimated power consumption by 37 %. Interestingly, the total die area remainedvirtuallythesame,17mm ,becausetheareaincreaseinthecorewascompensated by the reduced program memory size. Implementation of an MPEG audio decoder for the VS-DSP1 processor was presented in [P3]. The decoder software was based on a systematic approach in which a oating-point C-language source code was rst converted to a version that accurately mimics 16-bit
2
56
7. Conclusions
xed-point arithmetic operations.
After this modication the converted C-language
source code served as a bit-accurate representation of the algorithm behavior in the DSP processor. Theimplementationalsoillustratestheuseofextended-precision1632-bitMAC operations which were needed for certain parts in the decoding algorithm. The program code required 2.3 kwords and the data memory usage was 12.4 kwords, of which 74 % was employed for various xed-valued data values. An extensive analysis performed on the dynamic behavior of the application code revealed that a 25 MHz processor clock frequency was sufcient for 192 kbit/s, 44.1 kHz stereo audio streams. In [P4] a parallel program memory architecture was described. The proposed parallel architecture was analyzed with a GSM speech codec and the audio decoder that was presented in [P3]. The main problems encountered were the instruction cycle penalties associated with branching and hardware looping. The results show that the GSM speech codec was, in fact, quite ineffective with the memory architecture. However, due to the highlysequentialprogramcode,theMPEGaudiodecoderwasabletogainalinearspeed-up. From the practical point of view, memory architectures with two or four parallel memory banks seemed to be reasonable. In addition to improvements to the DSP processor core itself, [P5] presented several topics emphasizing the importance of the development environment. During the course of development, it had became clear that a bare DSP processor core is quite far from a reusable, licensable IP component. The key area of concern for a DSP system developer is the infrastructure provided by a DSP processor core vendor. Before committing to a certain processor architecture, potential system developers need to be convinced that they have access to all the support necessary to accomplish the development work. This infrastructure contains a wide range of issues: software and hardware development tools, operating systems, high-level EDA tools, software and algorithm libraries, and extensive technical support. Anestablished DSPprocessor corevendor hasto build allthe appropriate infrastructure in place, so that system developers can immediately benet from this infrastructure. Furthermore, the research covers two different approaches to high-level specication of wirelesscommunicationssystems. Currently,simulationenvironmentsbasedonthedataow paradigm have an increasingly important role in specication of complex signal processing systems. As presented in [P6], these tools can be exploited to rapidly design an executable system specication using a library of functional models. Later, this specication was reused for co-verication purposes where two functional models were realized with an implementation-leveldescriptionofamulti-functionalhardwareunit. Althoughtheresulting systemmodelwasrathercomplicated,thesimulationenvironmentprovidedexcellentmeans for formulating system-level concepts, such as operation scheduling. Moreover, the system simulation with the hardware unit increased the simulation time by at least two orders of magnitude, thus distinctly demonstrating trade-offs between simulation accuracy and speed.
7.1. MainResults
57
Area
(mm)2
Gepard VS-DSP1 VS-DSP2 Gepard VS-DSP1 VS-DSP2

2.2 2.2
5.0 5.3
Power
(mW/MHz@3.3V)
2.7 3.2
Speed
(MHz)
Gepard VS-DSP1 VS-DSP2
22 49 100
Figure25. ComparisonofthreeDSPprocessorcoreversions. ForGepard,theareaestimateisbased onagate-levelnetlistandthepowerconsumptionisforaprocessorthatdoesnotcontain hardwareloopingandmoduloaddressing. [AMS98b,Ofn97,Tak98],[P5].
Publication [P7] presented a high-level feasibility study of the system functions and various implementation aspects associated with a W-CDMA radio transceiver. The emphasis was on the receiver baseband implementation which, as opposed to the transmitter, possesses considerably higher complexity. In the publication, rst impressions are given of the conceptual partitioning into functions realized as software executed by a high-performanceDSPprocessororasdedicatedhardwareunits. Asconcluded,aW-CDMA transceiver will mainly be hardware-based for functions performed at sample and chip rates. However, a high-performance DSP processor (or processors) can provide the exibility and computational power needed for the operations at the symbol rates. To conclude, the research work has satised the objectives of the research. A customizable DSP processor architecture was developed and successfully implemented as three core versions. The Gepard processor had a die area, maximum operating speed, and power consumptionof5mm ,22MHz,and2.7mW/MHzat3.3V,respectively[AMS98b,Ofn97]. The corresponding gures were 5.3 mm , 49 MHz, and 6 mW/MHz at 4.5 V for the VS-DSP1 processor [Tak98] and 2.2 mm , 100 MHz, and 0.65 mW/MHz at 1.8 V for the VS-DSP2 processor [P5]. Compared to the VS-DSP1, the VS-DSP2 implementation demonstrates a 100 % increase in performance while the power consumption was reduced by a factor of 9. These characteristics were mainly achieved by the shift from 0.6 m t o 0.35 m CMOS process. Furthermore, the VS-DSP2 processor incorporated other valuable functionality, such as the low-power idle mode and new instructions [P5]. In Fig. 25 the Gepard, VS-DSP1, and VS-DSP2 processor cores are compared with respect to the core area, power consumption at 3.3 V, and maximum operating speed. It should be noted that this Gepard processor was a soft core, but the VS-DSP processors were implemented as hard cores.
2 2 2
58
7. Conclusions
The customizable DSP processor architecture has proven its commercial viability in a number of DSP-based applications, such as MPEG audio decoding and GPS navigation [Tak00, VS00, VS99]. In the future, the VS-DSP processors will be further improved. One of the main considerations is to improve program code density by replacing the relatively wide 32-bit instruction word with adual-length instruction word. Lastly, asoft core version of the VS-DSP2 processor is currently under development.
7.2
Future Trends
Wireless communications system design will be an increasingly complex task. As the number of transistors integrated on a single chip is rapidly escalating, platform integrators are faced with new problems associated with system complexity, hardware/software co-simulation speed, interconnect-dominated delays, and testability. In addition, emerging wireless products, such as third-generation mobile phones, will require signicantly more hardware and processing power which, in turn, leads to higher implementation cost and power consumption. Potential scalability of VLIW DSP processors may also have its advantages for DSP algorithms that can effectively benet from the parallel datapath resources. However, it seems that the next level of raising computational performance will be heavily based on task-level parallelism. Increased parallelism is enabled by integrating multiple DSP processor cores into an on-chip multiprocessor. The problems associated with this approach are linked to, among others, the system partitioning, scheduling, intercore communication, and the programming model which may be quite peculiar. As a brief market overview, there seem to be two key players in the conventional DSP processor arena at the moment. The DSP Group, with PineDSPCore-based family of cores, has licensed its cores to more than 25 major system design and ASIC companies. At the other extreme, Texas Instruments TMS320C54x, or LEAD2, has acquired a solid position in wireless products. The company has claimed that over 60 % of mobile phones are based on this processor [Tex00a], which implies that the C54x core could be considered as an embedded DSP counterpart to the x86-based microprocessors. Backwards compatibility in DSP processor families is an important issue because system developers have a considerable amount of intellectual property associated with optimized software. In contrast to general-purpose microcontrollers, exact binary compatibility may not be necessary, because if an assembly language source code can just be reassembled, the software can easily be retargeted to a new processor. This approach was taken in the presented customizable DSP processor concept.
7.2. FutureTrends
59
Emerging hardware technologies and architectures may also prove their effectiveness in the near future. For example, recongurable hardware has the potential for providing energy-efcient, run-time reusable computation engines for DSP applications [Rab97, Zha00]. However, recongurable hardware needs proper EDA tools for developing such systems in order to be a viable solution. The speed of the context switches between various congurations is also an open question. In addition, there are indications that application-specic instruction-set processors (ASIPs) will have a more important role in future designs [Gat00, Kuu99]. It is imaginable that a properly designed VLIW ASIP might be an effective component if the application area is narrow and clearly specied. Interestingly, the presented customizable DSP processor could be exploited for such purposes as well. Embedded DRAM (eDRAM) will be an interesting option. Compared with conventional SRAM, six to eight times the bit density is available for the same area using eDRAM [Iye99]. On the downside, the use of a mixed logic/DRAM process slows down logic circuits, which may not be affordable in most systems. Admittedly, despite the many technological aspects and intricacies discussed in this thesis, their signicance will ultimately become transparent in the nished product. Users of commercialelectronicswillcontinuetodisregardhowmanytransistorshavebeenintegrated, or which of the advanced CMOS technologies has been utilized, or even how many programmable processors their new purchase contains. They will simply consider those state-of-the-art devices as handy gadgets. It has been said: Any sufciently advanced technology is indistinguishable from magic. Nevertheless, we all prot from current researchintoareaslikeembeddedDSPprocessorcores,astechnologyconvergenceresulting from the evolution of the system-on-a-chip methodology gives rise to all the conceivable benets ranging from reduced power requirements to smaller product size and weight and more importantly lower product cost. This essentially summarizes what will make the development, design, and implementation of future systems such an exciting task.
BIBLIOGRAPHY
[Ahl98]
L. Ahlin and J. Zander, Principles of Wireless Communications, Studentlitteratur, Lund, Sweden, 1998.
[Ahn98] J.-W. Ahn, S.-M. Moon, and W. Sung, An efcient compiled simulation system forVLIWcodeverication, in Proc.31stAnnualSimulationSymposium,Boston, MA, U.S.A., Apr. 5-9 1998, pp. 9195. [Ali98] M. Alidina, G. Burns, C. Holmquist, E. Morgan, D. Rhodes, S. Simanapalli, and M. Thierbach, DSP16000: a high performance, low power dual-MAC DSP coreforcommunicationsapplications, in Proc.IEEECustomIntegratedCircuits Conference, Santa Clara, CA, U.S.A., May 1114 1998, pp. 119122. [Alp93] D. Alpert and D. Avnon, Architecture of the Pentium microprocessor, IEEE Micro Magazine, vol. 13, no. 3, pp. 1121, June 1993.
[AMS98a] Austria Mikro Systeme International, AG, Embedded Software Programmable DSP Core GEP 02, Preliminary datasheet, Mar. 25 1998. [AMS98b] Austria Mikro Systeme International, AG, Embedded Software Programmable DSP Core GEP 03, Datasheet, Mar. 25 1998. [Ana99] Analog Devices, Inc., ADSP-TS001 Preliminary Data Sheet, Dec. 1999. [ANS85] ANSI/IEEE Std 754-1985, IEEE standard for binary oating-point arithmetic, Standard, The Institute of Electrical and Electronics Engineers, Inc., New York, NY, U.S.A., Aug. 1985. [ARM95] Advanced RISC Machines, Inc., ARM7TDMI, Datasheet, ARM DDI 0029E, Aug. 1995. [Bar87] Z. Barzilai, J. L. Carter, B. K. Rosen, and J. D. Rutledge, HSS - A high-speed simulator, IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, vol. CAD-6, no. 4, pp. 601617, July 1987. B. Barrera and E. A. Lee, Multirate signal processing in Comdiscos SPW, in Proc.IEEEInt.ConferenceonAcoustics,Speech,andSignalProcessing,Toronto, Canada, Apr. 14-17 1991, vol. 2, pp. 11131116.
[Bar91]
62
Bibliography
[Bar96]
H. Barad, B. Eitan, K. Gottlieb, M. Gutman, N. Hoffman, O. Lempel, A. Peleg, andU.Weiser, Intelsmultimediaarchitectureextension, in Proc.Conventionof Electrical and Electronics Engineers in Israel, Jerusalem, Israel, Nov. 5-6 1996, pp. 148151.
[Bat88]
A. Bateman and W. Yates, Digital Signal Processing System Design, Pitman Publishing, London, United Kingdom, 1988. Y. Beery, S. Berger, and B.-S. Ovadia, An application-specic DSP for portable applications, in VLSI Signal Processing, IV, L. D. J. Eggermont, P. Dewilde, E. Deprettere, and J. van Meerbergen, Eds., pp. 4856. IEEE Press, New York, NY, U.S.A., 1993.
[Be93]
[Bet97]
M. R. Betker, J. S. Fernando, and S. P. Whalen,
The history of the
microprocessor, Bell Labs Technical Journal, pp. 2956, Autumn 1997. [Bid95] E. Bidet, D. Castelain, C. Joanblanq, and P. Senn, implementation of 8192 complex point FFT, Circuits, vol. 30, no. 3, pp. 300305, Mar. 1995. A fast single-chip
IEEE Journal of Solid-State
[Bod81] J. R. Boddie, G. T. Daryanani, I. I. Eldumiati, R. N. Gadenz, and J. S. Thompson, Digital signal processor: Architecture and performance, Bell System Technical Journal, vol. 60, no. 7, pp. 14491462, Sep. 1981. [Bog96] A. J. P. Bogers, M. V. Arends, R. H. J. De Haas, R. A. M. Beltman, R. Woudsma, and D. Wettstein, The ABC chip: Single chip DECT baseband controller based on EPICS DSP core, in Proc. Int. Conference on Signal Processing Applications and Technology, Boston, MA, U.S.A., Oct. 7-10 1996. [Bru98] D. M. Bruck, H. Yosub, Y. Itkin, Y. Gold, E. Baruch, M. Rafaeli, G. Hazan, S. Shperber, M. Yosen, L. Faibish, B. Branson, T. Baggett, and K. Porter, The DSP56652 dual core processor, in Proc. Int. Conference on Signal Processing Applications and Technology, Toronto, Canada, Sep. 13-16 1998. [Buc91] J. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt, Multirate signal processing in Ptolemy, in Proc. IEEE Int. Conference on Acoustics, Speech, and Signal Processing, Toronto, Canada, Apr. 14-17 1991, vol. 2, pp. 12451248. [Cam96] R. Camposano and J. Wilberg, Embedded system design, Design Automation for Embedded Systems, vol. 1, no. 1, pp. 550, Jan. 1996. [Cha87] B. W. Char, K. G. Geddes, G. M. Gonnet, and S. M. Watt, MAPLE Reference Manual, Watcom Publications, Waterloo, Canada, 1987.
Bibliography
63
[Cha95] A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Kluwer Academic Publishers, Menlo Park, CA, U.S.A., Apr. 1995. [Cha96] W.-T. Chang, A. Kalavade, and E. A. Lee, Effective Heterogenous Design and Co-Simulation, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1996. [Cha99] H. Chang, L. Cooke, M. Hunt, G. Martin, A. McNelly, and L. Todd, Surviving the SOC Revolution: A Guide to Platform-Based Design, Kluwer Academic Publishers, Menlo Park, CA, U.S.A., 1999. [Che98] S.-K. Cheng, R.-M. Shiu, and J. J.-J. Shann, Decoding unit with high issue rate for x86 superscalar microprocessors, in Proc. Int. Conference on Parallel and Distributed Systems, Dec. 14-16 1998, pp. 488495. [Cla76] T. A. C. M. Claasen, W. F. G. Mecklenbrauker, and J. B. H. Peek, Effects of
quantization and overow in recursive digital lters, IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 24, no. 6, pp. 517529, Dec. 1976. [DM99] G. De Micheli, Hardware synthesis from C/C++ models, in Proc. Design, Automation and Test Europe Conference, Munich, Germany, Mar. 9-12 1999, pp. 382383. [Eri92] A. C. Erickson and B. S. Fagin, Calculating the FHT in hardware, IEEE Trans. on Signal Processing, vol. 40, no. 6, pp. 13411353, June 1992.
[ETS92] ETSI300175, RadioEquipmentandSystems(RES);DigitalEuropeanCordless Telecommunications (DECT); Common Interface; Parts 1 to 9, International Standard, European Telecommunications Standards Institute, Sophia Antipolis, France, Oct. 1992. [ETS98] ETSI Tdoc SMG2 260/98, The ETSI UMTS terrestrial radio access (UTRA) ITU-R RTT candidate submission, Preliminary standard, European Telecommunications Standards Institute, Sophia Antipolis, France, May/June 1998. [Eyr00] J. Eyre and J. Bier, The evolution of DSP processors, IEEE Signal Processing Magazine, vol. 17, no. 2, pp. 4351, Mar. 2000. P. Faraboschi, G. Desoli, and J. A. Fischer, The latest word in digital and media processing, IEEE Micro Magazine, vol. 15, no. 2, pp. 5985, Mar. 1998. G. Fettweis and H. Meyr, High-speed parallel Viterbi decoding: Algorithm and VLSI-architecture, IEEE Communications Magazine, vol. 29, no. 5, pp. 4655, May 1991.
[Far98]
[Fet91]
64
Bibliography
[Fri99]
J. Fridman and W. C. Anderson, A new parallel DSP with short-vector memory architecture, in Proc. IEEE Int. Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ, U.S.A., Mar. 15-19 1999, vol. 4, pp. 21392142.
[Ful98]
S. Fuller, Motorolas AltiVec Technology, White paper, Motorola, Inc., Aug. 20 1998.
[Gaj95]
D. D. Gajski and F. Vahid, Specication and design of embedded hardware-software systems, IEEE Design & Test of Computers Magazine, vol. 12, no. 1, pp. 5367, Spring 1995.
[Gat00]
A. Gatherer, T. Stetzler, M. McMahan, and E. Auslander, DSP-based architectures for mobile communications: Past, present and future, IEEE Communications Magazine, vol. 38, no. 1, pp. 8490, Jan. 2000.
[Gho99] A. Ghosh, J. Kunkel, and S. Liao, Hardware synthesis from C/C++, in Proc. Design, Automation and Test Europe Conference, Munich, Germany, Mar. 9-12 1999, pp. 387389. [Gie97] A. Gierlinger, R. Forsyth, and E. Ofner, GEPARD: A parameterizable DSP core for ASICs, in Proc. Int. Conference on Signal Processing Applications and Technology, San Diego, CA, U.S.A., Sep. 14-17 1997, pp. 203207. [Gol99] M.Golden,S.Hesley,A.Scherer,M.Crowley,S.C.Johnson,S.Meier,D.Meyer, J. D. Moench, S. Oberman, H. Partovi, F. Weber, S. White, T. Wood, and J. Yong, Aseventh-generationx86microprocessor, IEEEJournalofSolid-StateCircuits, vol. 34, no. 11, pp. 14661477, Nov. 1999. [Gon99] D. R. Gonzales, Micro-RISC architecture for the wireless market, IEEE Micro Magazine, vol. 19, no. 4, pp. 3037, July-Aug. 1999. [Goo95] G.Goossens,D.Lanneer,M.Pauwels,F.Depuydt,K.Schoofs,A.Kii,P.Petroni, F. Catthoor, M. Cornero, and H. De Man, Integration of medium-throughput signal processing algorithms on exible instruction-set architectures, Journal of VLSI Signal Processing, vol. 9, no. 1/2, pp. 4965, Jan. 1995. [Goo97] G. Goossens, J. Van Praet, D. Lanneer, W. Geurts, A. Kii, C. Liem, and P. G. Paulin, Embedded software in real-time signal processing systems: Design technologies, Proceedings of the IEEE, vol. 85, no. 3, pp. 436454, Mar. 1997. [Gos94] M. Gossel, B. Rebel, and R. Creutzburg, Architecture & Parallel Access, Elsevier Science, Amsterdam, the Netherlands, 1994. Memory
Bibliography
65
[Gre95]
D. Greenley, J. Bauman, D. Chang, D. Chen, R. Eltejaein, P. Ferolito, P. Fu, R. Garner, D. Greenhill, H. Grewal, K. Holdbrook, B. Kim, L. Kohn, H. Kwan, M. Levitt, G. Maturana, D. Mrazek, C. Narasimhaiah, K. Normoyle, N. Parveen, P. Patel, A. Prabhu, M. Tremblay, M. Wong, L. Yang, K. Yarlagadda, R. Yu, R. Yung, and G. Zyner, UltraSPARC: The next generation superscalar 64-bit SPARC, in IEEE Compcon 95, Digest of Papers, San Francisco, CA, U.S.A., Mar. 59 1995, pp. 319326.
[Gut92]
G. Guttag, R. J. Gove, and J. R. Van Aken, A single-chip multiprocessor for multimedia: The MVP, IEEE Computer Graphics & Applications, pp. 5364, Nov. 1992.
[Hag82] Y. Hagiwara, Y. Kita, T. Miyamoto, Y. Toba, H. Hara, and T. Akazawa, A single chip digital signal processor and its application to real-time speech analysis, IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP-16, no. 1, pp. 339346, Feb. 1982. [Ham97] L. Hammond, B. A. Nayfeh, and K. Olukotun, A single-chip multiprocessor, IEEE Computer Magazine, vol. 30, no. 9, pp. 7985, Sep. 1997. [Hen90] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kauffman Publishers, San Mateo, CA, U.S.A., 1990. [Hen96] H. Hendrix, Viterbi decoding in the TMS320C54x family, Application note SPRA071, Texas Instruments, Inc., Dallas, TX, U.S.A., June 1996. [Heu97] V. P. Heuring and H. F. Jordan, Computer Systems Design and Architecture, Kluwer Academic Publishers, Menlo Park, CA, U.S.A., Apr. 1997. [Hwa79] K. Hwang, Computer Arithmetic: Principles, Architecture and Design, John Wiley & Sons, Ltd., New York, U.S.A., 1979. [Hwa85] K. Hwang and F. A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill Book Co., Singapore, 1985. [IEE87] IEEE Std 1076-1987, IEEE Standard VHDL Language Reference Manual, Standard, The Institute of Electrical and Electronics Engineers, Inc., New York, NY, U.S.A., Mar. 31 1987. [Ife93] E.C.IfeachorandB.W.Jervis, DigitalSignalProcessing: APracticalApproach, Addison Wesley Longman, Inc., Menlo Park, CA, U.S.A., 1993. [ISO93] ISO/IEC 11172-3, Information technology - Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s - Part 3:
66
Bibliography
Audio, International standard, International Organization for Standardization, Geneva, Switzerland, Mar. 1993. [Iye99] S. S. Iyer and H. L. Kalter, Embedded DRAM technology: Opportunities and challenges, IEEE Spectrum, vol. 36, no. 4, pp. 5664, Apr. 1999. [Joe94] O. J. Joeressen and H. Meyr, Hardware in the loop simulation with COSSAP: Closing the verication gap, in Proc. Int. Conference on Signal Processing Applications and Technology, Dallas, TX, U.S.A., Oct. 18-21 1994. [Joh91] W. M. Johnson, Superscalar Processor Design, Prentice Hall, Englewood Cliffs, NJ, U.S.A., 1991. [Kal96] K. Kalliojarvi and J. Astola, Roundoff errors in block-oating-point systems,
IEEE Trans. on Signal Processing, vol. 44, no. 4, pp. 783790, Apr. 1996. [Ken97] A. R. Kennedy, M. Alexander, E. Fiene, J. Lyon, B. Kuttanna, R. Patel, M. Pham, M. Putrino, C. Croxton, S. Litch, and B. Burgess, A G3 PowerPC superscalar low-power microprocessor, in Proc. IEEE Compcon, San Jose, CA, U.S.A., Feb. 23-26 1997, pp. 315324. [Kes98] R.E.Kessler,E.J.McLellan,andD.A.Webb, TheAlpha21264microprocessor architecture, in Proc. Int. Conference on Computer Design, Oct. 5-7 1998, pp. 9095. [Kie98] P.Kievits,E.Lambers,C.Moerman,andR.Woudsma, R.E.A.L.DSPtechnology for telecom baseband processing, in Proc. Int. Conference on Signal Processing Applications and Technology, Toronto, Canada, Sep. 13-16 1998. [Kla00] A. Klaiber, The Technology behind Crusoe Processors, White paper, Transmeta Corp., Jan. 2000.
[Knu97] J.KnuutilaandT.Leskinen, Systemrequirementsofwirelessterminalsforfuture multimediaapplications, in Proc.EuropeanMultimedia,MicroprocessorSystems and Electronic Commerce Conference, Florence, Italy, Nov. 1997, pp. 658665. [Knu99] J. Knuutila, On the Development of Multimedia Capabilities for Wireless Terminals, Dr.Tech.Thesis,TampereUniversityofTechnology,Tampere,Finland, May 1999. [Kop97] H. Kopetz, Real-Time Systems Design Priciples for Distributed Embedded Applications, Kluwer Academic Publishers, Norwell, MA, U.S.A., 1997. [Kum97] A. Kumar, The HP-PA-8000 RISC CPU, IEEE Micro Magazine, vol. 17, no. 2, pp. 2732, Mar./Apr. 1997.
Bibliography
67
[Kur98] I. Kuroda and T. Nishitani, Multimedia processors, Proceedings of the IEEE, vol. 86, no. 6, pp. 12031221, June 1998. [Kur99] I. Kuroda, RISC, video and media DSPs, in Digital Signal Processing for Multimedia Systems, K. K. Parhi and T. Nishitani, Eds., pp. 245272. Marcel Dekker, Inc., New York, NY, U.S.A., 1999. [Kut99] K. Kutaragi, M. Suzuoki, T. Hiroi, H. Magoshi, S. Okamoto, M. Oka, A. Ohba, Y. Yamamoto, M. Furuhashi, M. Tanaka, T. Yutaka, T. Okada, M. Nagamatsu, Y. Urakawa, M. Funyu, A. Kunimatsu, H. Goto, K. Hashimoto, N. Ide, H. Murakami, Y. Ohtaguro, and A. Aono, A microprocessor with a 128b CPU, 10 oating-point MACs, 4 oating-point dividers, and an MPEG2 decoder, in IEEE Int. Solid-State Circuits Conference, Digest of Tech. Papers, San Francisco, CA, U.S.A., Feb. 15-17 1999, pp. 256257. [Kuu96] M. Kuulusa, Modelling and Simulation of a Parameterized DSP Core, M.Sc. Thesis, Tampere University of Technology, Tampere, Finland, 1996. [Kuu99] M. Kuulusa and J. Nurmi, SCREAM Q4 report: W-CDMA baseband
performance estimations, Technical report, Tampere University of Technology, Tampere, Finland, Oct. 1999. [Lah97] J. Lahtinen and L. Lipasti, Development of a 16 bit DSP core processor using FPGA prototyping, in Proc. Int. Conference on Signal Processing Applications and Technology, San Diego, CA, U.S.A., Sep. 14-17 1997. [Lap95] P. D. Lapsley, J. C. Bier, A. Shoham, and E. A. Lee, Buyers Guide to DSP Processors, Berkeley Design Technology, Inc., Fremont, CA, U.S.A., 1995. [Lap96] P. D. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor Fundamentals: Architectures and Features, Berkeley Design Technology, Inc., Fremont, CA, U.S.A., 1996. [Lee88] E.A.Lee, ProgrammableDSParchitectures: PartI, IEEEASSPMagazine,vol. 5, no. 4, pp. 419, Oct. 1988. [Lee90a] E. A. Lee, Programmable DSPs: A brief overview, IEEE Micro Magazine, vol. 10, no. 5, pp. 1416, Oct. 1990. [Lee90b] J. C. Lee, E. Cheval, and J. Gergen, The Motorola 16-bit DSP ASIC core, in Proc. IEEE Int. Conference on Acoustics, Speech, and Signal Processing, Albuquerque, New Mexico, Apr. 3-6 1990, vol. II, pp. 973976. [Lee94] E. A. Lee and D. G. Messerschmitt, Digital Communication, Kluwer Academic Publishers, Menlo Park, CA, U.S.A., 1994.
68
Bibliography
[Lee95]
R. B. Lee, Accelerating multimedia with enhanced microprocessors, IEEE Micro Magazine, vol. 15, no. 2, pp. 2232, Apr. 1995. W. Lee, P. E. Landman, B. Barton, S. Abiko, H. Takahashi, H. Mizuno, S. Muramatsu, K. Tashiro, M. Fusumada, L. Pham, F. Boutaud, E. Ego, G. Gallo, H. Tran, C. Lemonds, A. Shih, R. H. Eklund, and I. C. Chen, A 1-V programmable DSP for wireless communications, IEEE Journal of Solid-State Circuits, pp. 17661776, Nov. 1997.
[Lee97]
[Lie94]
C. Liem, T. May, and P. Paulin, Instruction-set matching and selection for DSP andASIPcodegeneration, in Proc.EuropeanDesignandTestConference,Paris, France, Feb. 28-Mar. 3 1994, pp. 3137.
[Lin96]
B. Lin, S. Vercauteren, and H. De Man, Embedded architecture co-synthesis and system integration, in Proc. Int. Workshop on Harware/Software Codesign, Pittsburgh, PA, U.S.A., Mar. 18-20 1996, pp. 29.
[LSI99]
LSI Logic Corp., ZSP Digital Signal Processor Architecture, Technical manual, Sep. 1999.
[Mag82] S. Magar, E. Claudel, and A. Leigh, A microcomputer with digital signal processing capability, in IEEE Int. Solid-State Circuits Conference, Digest of Tech. Papers, Feb. 1982, pp. 3233, 284285. [Mey98] H. Meyr, M. Moeneclaey, and S.A. Fechtel, Digital Communications Receivers: Synchronization,ChannelEstimation,andSignalProcessing, JohnWiley&Sons, Inc., New York, NY, U.S.A., 1998. [Mik96] J. Mikkonen and J. Kruys, The Magic WAND: A wireless ATM access system, in Proc. ACTS Mobile Summit, Granada, Spain, Nov. 1996, pp. 535542. [Moe97] K. Moerman, P. Kievits, E. Lambers, and R. Woudsma, R.E.A.L. DSP:
RecongurableembeddedDSParchitectureforlow-power/low-costapplications, in Proc. Int. Conference on Signal Processing Applications and Technology, San Diego, CA, U.S.A., Sep. 14-17 1997. [Mol88] C. Moler, MATLAB - A mathematical visualization laboratory, in Proc. IEEE Compcon, San Francisco, CA, U.S.A., Feb. 29-Mar. 3 1988, pp. 480481. [Mot96] Motorola,Inc., DSP5660016-bitDigitalSignalProcessorFamilyManual, Users manual, DSP56600FM/AD, 1996. [Mot99] Motorola, Inc., Lucent Technologies, Inc., SC140 DSP Core, Preliminary
reference manual, MNSC140CORE/D, Dec. 1999.
Bibliography
69
[Nic78]
W. E. Nicholson, R. W. Blasco, and K. R. Reddy, The S2811 signal processing peripheral, in Proc. WESCON, 1978, vol. 25/3, pp. 112.
[Nis81]
T. Nishitani, R. Maruta, Y. Kawakami, and H. Goto, Digital signal processor: Architecture and performance, IEEE Journal of Solid-State Circuits, vol. SC-16, no. 4, pp. 372376, Aug. 1981.
[Nok99] Nokia Corp., Nokias Financial Statements 1999, Annual report, 1999. [Nur94] J. Nurmi, Application Specic Digital Signal Processors: Architecture and TransferableLayoutDesign, Dr.Tech.Thesis, TampereUniversityofTechnology, Tampere, Finland, Dec. 1994. [Nur97] J. Nurmi and J. Takala, A new generation of parameterized and extensible DSP cores, in Proc. IEEE Workshop on Signal Processing Systems, M. K. Ibrahim, P.Pirsch,andJ.McCanny,Eds.,pp.320 329.IEEEPress,NewYork,NY,U.S.A., Nov. 3-5 1997. [Obe99] S. Oberman, G. Favor, and F. Weber, AMD 3DNow! technology: Architecture andimplementations, IEEEMicroMagazine,vol.19,no.2,pp.37 48,Mar./Apr. 1999. [Ofn97] E. Ofner, R. Forsyth, and A. Gierlinger, GEPARD, ein parammetrisierber DSP Kern fur ASICs, in Proc. DSP Deutschland, Munich, Germany, Sep. 1997, pp. 176180, in German. [Oha99] I. Ohana and B.-S. Ovadia, TeakDSPCore - New licensable DSP core using standard ASIC methodology, in Proc. Int. Conference on Signal Processing Applications and Technology, Orlando, FL, U.S.A., Nov. 14 1999. [Oja98] T. Ojanpera and R. Prasad, Wideband CDMA for Third Generation Mobile Communications, Artech House, Boston, MA, U.S.A., 1998. K.Olukotun,B.A.Nayfeh,L.Hammond,K.Wilson,andK.Chang, Thecasefor a single chip multiprocessor, in Proc. Int. Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, U.S.A., Oct. 1-4 1996, pp. 211. [Opp89] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, U.S.A., 1989. [Ova94] B.-S. Ovadia and Y. Beery, Statistical analysis as a quantitative basis for DSP architecture design, in VLSI Signal Processing, VII, J. Rabaey, P.M. Chau, and J. Eldon, Eds., pp. 93102. IEEE Press, New York, NY, U.S.A., 1994.
[Olu96]
70
Bibliography
[Ova98] B.-S. Ovadia, W. Gideon, and E. Briman, Multiple and parallel execution units in digital signal processors, in Proc. Int. Conference on Signal Processing Applications and Technology, Toronto, Canada, Sep. 13-16 1998, pp. 14911497. [Ova99] B.-S. Ovadia and G. Wertheizer, PalmDSPCore - Dual MAC and parallel
modulararchitecture, in Proc.Int.ConferenceonSignalProcessingApplications and Technology, Orlando, FL, U.S.A., Nov. 14 1999. [Owe97] R. E. Owen and S. Purcell, An enhanced DSP architecture for the seven
multimedia functions: the Mpact 2 media processor, Proc. IEEE Workshop on Signal Processing Systems, pp. 7685, Nov. 3-5 1997. [Par92] D. Parsons, The Mobile Radio Propagation Channel, Pentech Press Publishers, London, United Kingdom, 1992. [Par99] T. Parkkinen, Digitaalisen audiodekooderin toteutus, M.Sc. Thesis, Tampere University of Technology, Tampere, Finland, 1999, in Finnish. A. Peleg and U. Weiser, MMX technology extension for the Intel architecture, IEEE Micro Magazine, vol. 16, no. 4, pp. 4250, Aug. 1996. S. D. Pezaris, A 40-ns 17-bit by 17-bit array multiplier, Computers, vol. 20, pp. 442447, Apr. 1971. IEEE Trans. on
[Pel96]
[Pez71]
[Phi99]
Philips Electronics North America Corp., TriMedia TM-110 Data Book, July 1999. J.G.Proakis, DigitalCommunications, McGraw-HillBookCo.,Singapore,1995. S. Purcell, The impact of Mpact 2, IEEE Micro Magazine, vol. 15, no. 2, pp. 102107, Mar. 1998.
[Pro95] [Pur98]
[Rab72] L. R. Rabiner, Terminology in digital signal processing, IEEE Trans. on Audio and Electroacoustics, vol. 20, no. 1-5, pp. 322337, Dec. 1972. [Rab97] J. M. Rabaey, Recongurable processing: The solution to low-power
programmable DSP, in Proc. IEEE Int. Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, Apr. 2124 1997, pp. 275278. [Rat96] S. Rathnam and G. Slavenburg, An architectural overview of the programmable multimedia processor, TM-1, in IEEE Compcon 96, Digest of Papers, Santa Clara, CA, U.S.A., Feb. 2528 1996, pp. 319326. [Rat98] S. Rathnam and G. Slavenburg, Processing the new world of interactive media The Trimedia VLIW CPU architecture, IEEE Signal Processing Magazine, vol. 15, no. 2, pp. 108117, Mar. 1998.
Bibliography
71
[Reg94] D. Regenold, A single-chip multiprocessor DSP solution for communications applications, in Proc. IEEE Int. ASIC Conference and Exhibit, Rochester, NY, U.S.A., Sep. 19-23 1994, pp. 437440. [Roz99] Z. Rozenshein, M. Tarrab, Y. Adelman, A. Mordoh, Y. Salant, U. Dayan, O. Norman, K. L. Kloker, Y. Ronen, J. Gergen, B. Lindsley, P. DArcy, and M.Betker, StarCore100-Ascalable,compilable,high-performancearchitecture forDSPapplications, in Proc.Int.ConferenceonSignalProcessingApplications and Technology, Orlando, FL, U.S.A., Nov. 14 1999. [Sch91] U. Schmidt and K. Caesar, Datawave: A single-chip multiprocessor for video applications, IEEE Micro Magazine, vol. 11, no. 3, pp. 2294, June 1991. [Sch98] M.Schlett, Trendsinembedded-microprocessordesign, IEEEMicroMagazine, vol. 31, no. 8, pp. 4449, Aug. 1998. [Sem00] L. Semeria in C/C++, and A. Ghosh, Methodology for hardware/software co-verication
in Proc. Asia and South Pacic Design Automation Conference,
Yokohama, Japan, Jan. 25-28 2000, pp. 405408. [Ses98] N. Seshan, High VelociTI processing, IEEE Signal Processing Magazine, vol. 15, no. 2, pp. 86101, 117, Mar. 1998. [SGS95] SGS-Thomson Microelectronics, Inc., D950-CORE, Preliminary specication, Jan. 1995. [Smi97] M. J. S. Smith, Application-Specic Integrated Circuits, Addison Wesley
Longman, Inc., Reading, MA, U.S.A., 1997. [Sol00] T. Solla and O. Vainio, Reusable full custom layout generators in ASIC design ow, Unpublished paper, 2000. [Sri88] S. Sridharan and G. Dickman, Block oating point implementation of digital lters using the DSP56000, Microprocessors and Microsystems, vol. 12, no. 6, pp. 299308, July/Aug. 1988. [Suc98] R. Sucher, N. Niggebaum, G. Fettweiss, and A. Rom, CARMEL - A new
high performance DSP core using CLIW, in Proc. Int. Conference on Signal Processing Applications and Technology, Toronto, Canada, Sep. 13-16 1998. [Tak98] J. Takala, Design and Implementation of a Parameterized DSP Core, M.Sc. Thesis, Tampere University of Technology, Tampere, Finland, 1998.
72
Bibliography
[Tak00]
J. Takala, J. Rostrom,
T. Vaaraniemi, H. Herranen, and P. Ojala, A low-power
MPEG audio layer III decoder IC with an integrated digital-to-analog converter, in IEEE Conference on Consumer Electronics, Digest of Technical Papers, Los Angeles, CA, U.S.A., June 13-15 2000, pp. 260261. [Teu98] C. M. Teuscher, Low Power Receiver Design for Portable RF Applications: Design and Implementation of an Adaptive Multiuser Detector for an Indoor, Wideband CDMA Application, Ph.D. Thesis, University of California, Berkeley, CA, U.S.A., Jul. 1998. [Tex95] Texas Instruments, Inc., TMS320C54x Users Guide, SPRU131B, Oct. 1995.
[Tex97a] Texas Instruments, Inc., TMS320C54x - Low-Power Enhanced Architecture Device, Workshop notes, Feb. 1997. [Tex97b] Texas Instruments, Inc., TMS320C6201, TMS320C6201B Digital Signal Processors, Datasheet, SPRS051D, Jan. 1997. [Tex98] Texas Instruments, Inc., TMS320C5x Users Guide, SPRU056D, June 1998.
[Tex00a] Texas Instruments, Inc., TI Breaks Industrys DSP High Performance and Low Power Records with New Cores, Press release, Feb. 22 2000. [Tex00b] Texas Instruments, Inc., TMS320C55x DSP CPU Reference Guide, Preliminary draft, Feb. 2000. [Tex00c] TexasInstruments,Inc., TMS320C64xTechnicalOverview, SPRU395,Feb.2000. [Tho91] D. E. Thomas and P. R. Moorby, The Verilog Hardware Description Language, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1991. [Tre96] M. Tremblay, J. M. OConnor, V. Narayanan, and L. He, VIS speeds media processing, IEEE Micro Magazine, vol. 16, no. 4, pp. 1020, Aug. 1996. D. M. Tullsen, S. J. Eggers, and H. M. Levy, Simultaneous multithreading: Maximizing on-chip parallelism, in Proc. Annual Int. Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June 22-24 1995, pp. 392403. [vdP94] R. van de Plassche, Integrated Analog-to-Digital and Digital-to-Analog Converters, Kluwer Academic Publishers, Norwell, MA, U.S.A., 1994. [Ver96] I.Verbauwhede,M.Touriguian,K.Gupta,J.Muwa,K.Yick,andG.Fettweis, A low power DSP engine for wireless communications, in VLSI Signal Processing, IX, W. Burleson, K. Konstantinides, and T. Meng, Eds., pp. 471480. IEEE Press, New York, NY, U.S.A., 1996.
[Tul95]
Bibliography
73
[Vih99]
K. Vihavainen, P. Perala, and O. Vainio, Estimation of energy consumption usinglogicsynthesisandsimulation, Technicalreport,6-1999,SignalProcessing Laboratory, Tampere University of Technology, Tampere, Finland, 1999.
[Vit67]
A. J. Viterbi,
Error bounds for convolutional coding and an asymptotically
optimum decoding algorithm, IEEE Trans. on Information Theory, vol. 13, pp. 260269, Apr. 1967. [VS96] VLSI Solution Oy and Austria Mikro Systeme International AG, Gepard Architecture and Instruction Set Specication, Revision 1.3, Feb. 1996. VLSI Solution Oy, VS-DSP Specication Document, Revision 0.8, Nov. 1997. VLSI Solution Oy, GPS Receiver Chipset, Datasheet, Version 1.1, Mar. 1999. VLSI Solution Oy, VS1001 - MPEG Audio Codec, Datasheet, Version 2.11, May 2000. [Wal91] K. Wallace, The JPEG image compression standard, Communications of the ACM, pp. 3045, Apr. 1991. [Wei92] D. Weinsziehr, H. Ebert, G. Mahlich, J. Preissner, H. Sahm, J.M. Schuck, H. Bauer, K. Hellwig, and D. Lorenz, KISS-16V2: A one-chip ASIC DSP solution for GSM, IEEE Journal of Solid-State Circuits, vol. 27, no. 7, pp. 10571066, July 1992. [Wes92] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Circuit Design, Addison Wesley Longman, Inc., Reading, MA, U.S.A., 1992. [Wil63] J. H. Wilkinson, Rounding Errors in Algebraic Processes, Prentice-Hall,
[VS97] [VS99] [VS00]
Englewood Cliffs, NJ, U.S.A., 1963. [Wou94] R. Woudsma, R. A. M. Beltman, G. Postuma, A. C. Turley, W. Brouwer, U. Sauvagerd, B. Strassenburg, D. Wettstein, and R. K. Bertschmann, EPICS, a exible approach to embedded DSP cores, in Proc. Int. Conference on Signal Processing Applications and Technology, Dallas, TX, U.S.A., Oct. 18-21 1994, vol. I, pp. 506511. [Yag95] H.YagiandR.E.Owen, ArchitecturalconsiderationsinacongurableDSPcore forconsumerelectronics, in VLSISignalProcessing,VIII,T.NishitaniandK.K. Parhi, Eds., pp. 7081. IEEE Press, New York, NY, U.S.A., 1995. [Zha00] H. Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, and J. Rabaey, A 1V heterogeneous recongurable processor IC for baseband
74
Bibliography
wireless applications, in IEEE Int. Solid-State Circuits Conference, Digest of Tech. Papers, San Francisco, CA, U.S.A., Feb. 6-10 2000. [Zil99] Zilog, Inc., Z89223/273/323/373 16-bit Digital Signal Processors with A/D Converter, Product specication, DS000202-DSP0599, 1999. V. Zivojnovic, S. Tijang, and Meyr H., Compiled simulation of programmable DSP architectures, in VLSI Signal Processing, V, T. Nishitani and K. K. Parhi, Eds., pp. 187196. IEEE Press, New York, NY, U.S.A., 1995.
[Ziv95]
Part II PUBLICATIONS
PUBLICATION 1
M. Kuulusa and J. Nurmi, A parameterized and extensible DSP core architecture, in Proc. Int. Symposium on IC Technology, Systems & Applications, Singapore, Sep. 1012 1997, pp. 414417. Copyright c 1997 Nanyang Technological University, Singapore. permission, from the proceedings of ISIC97. Reprinted, with
A PARAMETERIZED AND EXTENSIBLE DSP CORE ARCHITECTURE

1 2 Mika Kuulusa and Jari Nurmi 1
Signal Processing Laboratory, Tampere University of Technology, P.O. Box 553, FIN-33101 Tampere, Finland 2VLSI Solution Oy, Hermiankatu 6-8 C FIN-33720 Tampere, Finland
Abstract : In order to create a highly integrated, single-chip signal processing system, a DSP core can be used to provide the basic DSP functions for the target application. In this paper, a exible DSP core architecture is presented. The resources of this DSP core are ne-tuned with various parameters and extension instructions that execute applicationspecic operations in the arithmetic units of the DSP core and in additional functional units off-core. At rst, the exible core-based application development is discussed briey. The DSP core architecture, its parameters, and the three main functional blocks are described, and, nally, the benets of this versatile DSP core are illustrated with a speech coding application example.
1. INTRODUCTION The emerging of the powerful digital signal processing (DSP) cores has revolutionized the conventional application-specic integrated circuit (ASIC) design. An ASIC is no longer assembled exclusively from in-house design, but often composed of a selected DSP core and a bunch of hand-picked peripherals. A DSP core acts as the primary engine for a signal processing system, providing all the necessary fundamental arithmetic operations, data memory addressing, and program control for the DSP application at hand. The attractive point in utilizing DSP cores is their programmability combined with the benets of the custom circuits [1]. Memories, peripherals, and custom logic is embedded together with a DSP core to achieve a highly integrated, low cost solution. DSP cores are provided as synthesizable HDL, layout, or both [2]. In general, DSP cores have xed architecture which may cause excess performance, extra cost, or less efcient use of resources in some applications. Popular choices for DSP cores include Texas Instruments TMS320C2xx/TMS320C54x, SGSThomson D950-CORE, and Analog Devices ADSP21cspxx [3]. Vendors provide these cores as standard library components for their silicon technologies. Freedom of choice in the fabrication of the chip is preserved when DSP cores are licensed. Cores of this category are Clarkspur Design CD2450 and DSP Group Pine/OakDSPCore. Still, common to all of these cores is their non-existent or very limited parameterization of the DSP core itself. Other DSP cores of interest are Philips EPICS [4] and REAL [5] which have more sophisticated core parameters, but are by no means offered to customers in a broad fashion.
2. FLEXIBLE CORE-BASED APPLICATION DEVELOPMENT The drawbacks of traditional DSP cores can be addressed by a exible core that has parameters to tailor the actual implementation to match the application as well as possible. The optimal values for the parameters can be discovered in software development tools supporting the implementation parameters. The application developer does not even need to know the hardware very thoroughly, since the dependence of the implementation features upon the parameters can be expressed in a very straightforward manner. The features of importance include physical geometry (the size of the core and the attached memories), maximum clock rate applicable, and relative power consumption. These are easily added to the understanding of a DSP software developer, besides the traditional measures such as number of code lines, data memory allocation and number of cycles to execute. In our core architecture, the parameterization is rather extensive. Word lengths in different blocks of the core can be congured separately, and different levels of features can be included by changing the type of various units all over the architecture. All the numerous avors of the implementation are supported by a single tool set, consisting of an assembler, a linker, an archiver, and a cycle-based instruction level simulator. The tools have been programmed in generic ANSI-C to support multiple platforms including PC/Windows95, Sun/Solaris, and HP/HP-UX. There also exists a graphical user interface for the UNIX X-windows version. In addition to the parameterization, there are extension mechanisms built into the architecture. Userspecic instructions and the corresponding hardware can be added to the basic core. These hardware-software trade-offs to achieve the specied performance, memory sizes and circuit area can be made by the DSP
414
engineer within the software development tools, before committing to the application-specic hardware design. The actual implementation with the selected parameter values and extensions will be carried out separately. The blocks are built with full custom module generators within Mentor Graphics GDT. The generators have inherent and purpose-built features for facilitating changes between different fabrication processes [6]. The extension hardware has to be implemented separately, and the instruction decoding synthesized. 3. THE DSP CORE ARCHITECTURE The top-level block diagram of the DSP core architecture is depicted in Fig.1. The DSP core uses a modied Harvard architecture comprising two data buses, XBUS and YBUS, and an instruction bus IBUS. There are three main functional units: the Program Control Unit, the datapath, and the Data Address Generator. Furthermore, one or more functional units may be incorporated off-core for application-specic purposes. Even though data and program memories are necessities in a DSP system, they are not a part of the core. Both single-port and dual-port RAM/ROM memories are supported.
and default values are listed in Table 1. The core version applying the default values is called the basic core. Considering the silicon area, most radical effects can be obtained with the dataword parameter. This parameter affects the silicon area consumed by all the main functional units and the attached memories. Inevitably, the dataaddress and programaddress parameters are dictated by the memory sizes needed. The parameters are described in the following paragraphs.
Table 1. The DSP Core parameters. Parameter dataword dataaddress programword programaddress loopregs multiplierwidth multiplierguardbits mactype shiftertype alutype accumulators enablecd indexregs modieronly addrmodes Range 8 - 64 8 - 23 32 8 - 19 0-8 8 - 64 0 - 16 0002, 3, or 4 0 or 1 8 or 16 0 or 1 0-3 Default 16 16 32 16 0 (no looping hardware) dataword 8 0 (basic unit) 0 (basic unit) 0 (basic unit) 4 0 (not enabled) 8 0 (not enabled) 0 ( m only)
3.1. The datapath The datapath, shown in Fig.2, executes all arithmetic operations of the DSP core. The operational units perform in twos complement arithmetic, although also fractional arithmetic can be supported.
Fig. 1.Top-level block diagram of the DSP core architecture (address buses omitted for clarity).
The basic core has a total of 25 assembly language instructions. The basic 32-bit instruction word readily allows the theoretical ability to add up to 2 extension instructions that execute operations in additional functional units and in the custom arithmetic units of the datapath. The core also supports external interrupts and optional hardware looping units to perform zero-overhead loops. The core has a load/store architecture and uses indirect addressing in data memory accesses. Two data memory addresses can be referenced and updated on each instruction cycle. There are two general-purpose accumulators, three multiplier registers, eight index registers, and two control registers available in all core versions. All the main functional units are extensively parameterized. The DSP core parameters, their ranges,
31
Fig. 2.The block diagram of the datapath.
The multiply-accumulate (MAC) unit and the arithmetic-logic unit (ALU) execute operations in parallel. Multiplier operands are the CREG and either DREG or one of the accumulators. The result of a mul-
415
tiplier instruction is stored in the PREG, which is 2*dataword+multiplierguardbits bits wide. The shifter is used to extract specic bit-slices of the PREG into the accumulator le. There are four pre-dened bitslices available in the basic shifter. The ALU executes basic addition, subtraction and bit-logic instructions. The ALU instruction operand may be accumulators, or special operands NULL and ONES. Optionally, the CREG and DREG can be used as ALU operands with the cdenable parameter. As custom operational units come available in the future, new units are selected with the mactype, shiftertype, and alutype parameters. The MAC unit has two modes reserved for future extensions and the shifter supports up to four bit-slices to be dened. A new ALU type could implement a barrel shifter or a Viterbi accelerator, for example. 3.2. The Data Address Generator Unit (DAG) The Data Address Generator Unit containing the index register le and two address ALUs is shown in Fig.3. The index registers may be used individually or in pairs for more advanced data addressing modes (e.g. circular buffers). On each instruction cycle, valid data addresses can be generated for both data buses and these addresses are updated (i.e. post-modied) after the data memory reference, if required.
3.3. The Program Control Unit (PCU) The Program Control Unit (PCU) is depicted in Fig.4. The PCU generates all the control signals for the datapath, the DAG, additional functional units, and the attached RAM/ROM memories. The DSP core has a three-stage pipeline (fetch, decode, and execute) and it supports external interrupts and reset.
Fig. 4.The block diagram of the Program Control Unit.
A program counter (PC) and two link registers (LR0 and LR1) are in all core versions. The link registers are used for saving return addresses of interrupts and subroutine calls. If one or more looping hardware units is selected with the loopregs parameter, the PCU contains a set of looping hardware control registers: loop start (LS), loop end (LE), and loop count (LC). M-language models for the control logic was generated from special instruction set mapping les with automated command scripts written in PERL [7]. These command scripts can easily be modied to generate synthesizable VHDL codes needed later on. 3.4. Additional Functional Units Additional functional units can be attached to the core to suit application-specic needs. Since the PCU controls the additional off-core functional units directly, these units become an integral part of the core. A variety of commercial IP blocks (timers, high-speed serial ports, I/O interfaces, etc.) are readily available from integrated circuit vendors for use in ASIC development. Computational efciency of the DSP system can be improved with custom functional units tailored for the application. For example, an iterative divider unit or an advanced bit-manipulation unit might dramatically boost the application performance in some cases. Moreover, general-purpose microcontroller cores, RISC cores [8], or multiple DSP cores may be embedded on the same silicon chip, if the parallel processing can be exploited by the application. This kind of approach would probably require separate data and program memories for each core instances.
Fig. 3.The block diagram of the Data Address Generator.
The DAG has either 8 or 16 index registers determined by the indexregs parameter. With the modieronly parameter, it is possible to force even numbered index registers to be exclusively used for the selection of the data addressing mode. The addrmodes parameter selects the level of the post-modication modes implemented in the address ALUs. Supported post-modication modes include linear and modulo addressing, as well as the bit-reversed addressing mode essential for Fast Fourier Transform (FFT) algorithm. This parameter affects to the complexity of the two address ALUs, thus the effect on the silicon area of the DAG is obvious.
416
4. AN APPLICATION EXAMPLE The application development and optimization for our DSP core can be seen as hardware-software partitioning of the application algorithm. The parameters are tuned to yield the desired combination of small silicon area, performance and low power consumption. Several real application algorithms have been coded for the DSP core to demonstrate its capabilities, e.g. the GSM full rate recommendation by ETSI [9], and the G.722 (Sub-Band ADPCM) [10] and G.728 (Low Delay CELP) [11] standards by ITU-T. For example, the GSM speech codec was rst coded with the basic instruction set of the core, yielding about 320,000 instruction cycles and 4,000 code lines for the complete codec including encoder, decoder, voice activity detection, and discontinuous transmission control. The algorithm was analysed by proling it in the simulator, and changes based on the indications were implemented one by one. By incorporating the hardware loop mechanism (and thus a LOOP instruction), the numbers were about 292,000 and 3,908. This increased the core size by 7%, but on the other hand shrank the program memory slightly and decreased the required clock rate by 9%. By adding two extension instructions for saturating add and subtract, the gures were about 202,000 cycles and 3,837 lines. This was a very small change to the hardware (far less than 1% of the original area), but shrank the memory requirements again and the total reduction in the clock rate was 37%. A further change was tried by extending the instruction set by carryinclusive arithmetic, which gave about 184,000 cycles from 3807 code lines. The added area was again minor (less than 1%), but the program memory shrank and the cumulative clock rate decrease was 43%. The nal version was able to run only at 19 MHz clock frequency, while the rst one required at least 32 MHz. The corresponding savings in power consumption were achieved by consuming 10% more core area, which was partially compensated in the memory area. In other applications the set of useful extensions can be explored in a similar manner. Here the data word length was xed by the algorithm, but when the trade-offs can be done also there (e.g. a ltering algorithm where the word length and the lter length can be traded-off), the impact on the chip size is more dramatic. In typical applications the data memories dominate the ASIC area, and the memory optimization by the software tools is of paramount importance. The basic core with default parameters is estimated to occupy 3.5 mm in 0.6 m CMOS and less than 2 mm2 in 0.35 m CMOS. The use of module generators alleviates the change of the design to other technologies as well. The use of compact memory generators nalizes the optimality of the design.
2
5. CONCLUSIONS This versatile DSP core portrays a widely parameterized architecture that allows straightforward extension of the instruction set. With the presented DSP core architecture it is possible to nd an optimum DSP core-based solution for the target application by netuning the numerous parameterized features of the core. The exible software tools are used for rapid evaluation of different system congurations. As demonstrated with the speech coding example, the application engineer experiments with various core versions, memory congurations, and additional functional units. Eventually, the DSP core-based implementation, which meets the specications with minimal cost, is realized. 6. ACKNOWLEDGMENTS The DSP core development project was a joint effort carried out at Tampere University of Technology and VLSI Solution Oy. The project has been co-funded by VLSI Solution Oy and Technology Development Center (TEKES). The authors wish to thank Mr. Janne Takala for his comments on this paper. REFERENCES
[1] P. D. Lapsley, J. Bier, A. Shoham, Buyers Guide to DSP Processors. Berkeley Design Technology Inc., 1995, pp. 18-23. [2] M. Levy, Streamlined Custom Processors: Where Stock Performance Wont Cut It. EDN Magazine, Oct. 1995, pp. 49-50. [3] M. Levy, EDNs 1997 DSP-Architecture Directory. EDN Europe, May 8th 1997, pp. 42-107. [4] R. Woudsma, EPICS, a Flexible Approach to Embedded DSP Cores. Proceedings of The 5th Intl Conference on Signal Processing Applications and Technology, Oct. 1995. [5] P. Clarke, Philips Sets Consumer Plan. Electronic Engineering Times, issue 854, June 26 1995. [6] J. Nurmi, Portability Methods in Parametrized DSP Module Generators. VLSI Signal Processing, VI, IEEE Press, L. D. J. Eggermont, P. Dewilde, E. Deprettere, and J. van Meerbergen Eds., 1993, pp. 260-268. [7] M. Kuulusa, Modelling and Simulation of a Parameterized DSP Core. M.Sc. Thesis, Oct. 1996, pp. 32-33. [8] B. Caulk, Optimize DSP Design With An Extensible Core. Electronic Design, Jan. 2, 1996, pp. 82-84. [9] Recommendation GSM 06.10, Full rate speech transcoding, ETSI, Sophia Antipolis, France, 1992. [10] Standard G.722, 7 kHz Audio-Coding within 64 kbit/s. ITUT, Geneva, Switzerland, 1993. [11] Standard G.728, Coding of Speech at 16 kbit/s using LowDelay Code Excited Linear Prediction. ITU-T, Geneva, Switzerland, 1992.
417
PUBLICATION 2
M.Kuulusa,J.Nurmi,J.Takala,P.Ojala,andH.Herranen,FlexibleDSPcoreforembedded systems, IEEE Design & Test of Computers, vol. 14, no. 4, pp. 6068, Oct./Dec. 1997. Copyright c 1997 IEEE. Reprinted, with permission, from the IEEE Design & Test of Computers magazine.
...
FLEXIBLE DSP CORE
A Flexible DSP Core for Embedded Systems

MIKA KUULUSA Tampere University of Technology JARI NURMI JANNE TAKALA PASI OJALA HENRIK HERRANEN VLSI Solution Oy
for high-volume application-specific inte- DSP coresprogrammability, software ligrated circuit (ASIC) design relies on the use braries, and development toolsas fully as of programmable digital signal processing possible. cores. Combining these dedicated, highBecause software functions alone are not performance DSP engines with data and pro- sufficient for many applications, a wide gram memories and a selected set of range of peripherals is available for coreperipherals yields a highly integrated system based systems. In addition to essential RAM on a chip. Rapidly evolving silicon tech- and ROM, core-based designs can include nologies and improving design tools are the special types of memories such as FIFO, mulkey enablers of this approach, allowing sys- tiport, and Flash. Also available are hightem engineers to pack impressive amounts speed serial buses (UART, I2C, USB), of functionality into a system within a rea- dedicated bus controllers (PCI, SCSI), A/D sonable development time. According to an and D/A converters, and other special I/Os to embedded-processor market survey,1 more interface the system with the off-chip world.2 than two thirds of high-volume embedded systems will be based on specialized DSP cores by the end of this decade. This approach offers many advantages. Unlike conventional methods, DSP-corebased ASIC design combines the benets of a standard off-the-shelf DSP and optimized custom hardware. As a direct result of higher integration, it reduces unit cost, an inOther examples of the most common peripheral circuitry are timers, DMA controllers, and miscellaneous analog circuitry. Much larger and more complex functional entities, also called cores, are available for use as embedded on-chip peripherals. The broad diversity of these cores ranges from RISC microprocessors to dedicated discrete cosine transform (DCT) engines.3 Moreover,
TODAYS CUTTING-EDGE TECHNOLOGY tem engineers must exploit the benefits of
Cores currently available
for ASIC design allow creasingly important issue in sensitive market system engineers can improve system perlittle customization. The areas such as telecommunications and performance signicantly by designing a block authors have developed sonal computing. Equally important are the of custom hardware outside the DSP core to a parameterized and improved reliability and impact on time to implement special functionality for the apextensible DSP core that market of the core approach. The shrinking plication at hand. offers system engineers a life span of DSP-based products forces very A number of issues affect the selection of great deal of exibility in tight schedules that leave little time for re- a DSP core for an ASIC design. From the fabnding the optimum costdesign. As software content in modern signal rication point of view, the alternatives are performance ratio for an processing applications increases, system foundry-captive and licensable cores.4 Most application.
complexity typically becomes very high. To foundry-captive cores are derived from poprealize target applications on schedule, sys- ular off-the-shelf counterparts and provid-
60
0740-7475/97/$10.00 1997 IEEE
IEEE DESIGN & TEST OF COMPUTERS
..
ed by major IC vendors as design-library components for use in their standard-cell libraries. These cores provide very high performance and extensive software development tools and libraries, but are offered only to selected highvolume customers. Therefore, a licensable soft or hard core may be a more profitable choice for many applications. Soft cores, which customers receive as synthesizable HDL code, offer better portability than hard cores, which are physical transistor-level layouts for a particular silicon technology. However, hard cores have many attractive properties. These carefully optimized physical layouts commonly offer improved performance and more compact design5 than a layout generated from a synthesizable HDL description.6 A comprehensive set of software development tools is essential to a successful implementation. The DSP core vendor usually supplies core-specic software tools, such as an assembler, a linker, and an instruction set simulator. System engineers use an instruction set simulator coupled with an HDL simulator or a lower-level simulation model (VHDL, Verilog) to verify the cores operation with the surrounding functional units. DSP cores currently available for ASIC design typically offer limited possibilities for customizing the core itself. In a joint project of Tampere University of Technology and VLSI Solution Oy (Inc.) in Tampere, Finland, we used a new approach to design a parameterized, extensible DSP core. This new breed of licensable core gives system engineers a great deal of exibility to nd the optimum cost-performance ratio for a given application. In addition to data word width and the number of registers, our core allows engineers to specify a wide range of other core parameters. It also features an extensible instruction set that supports execution of special operations in the data path or in off-core custom hardware. With the extension instructions and additional circuitry, engineers can ne-tune the instruction set for specic needs of modern signal processing applications.
Specication
Assign initial parameters
Assembly coding
C coding
Design space exploration
Selecting peripherals
Specifying extensions
HDL simulation
ISS/HDL cosimulation
Core layout generation
Peripheral layout generation
DRC, place and route, simulation, and test vector generation
Prototype fabrication
Verication and validation
Application-specic IC
Customer task Core vendor or customer task Core vendor task
Figure 1. Core-based ASIC design ow. Dashed-line sections are not yet implemented.
Flexible-core approach
Our main objective was to create an extensively parameterized core that features convenient extension mechanisms yet enables straightforward architectural implementation. The core implementation strategy uses CAD/EDA tools supporting transistor-level layout generators.7 Carefully designed genparameters can be adjusted if necessary. A parameterized instruction set simulator (ISS) makes this design space exploration possible. After identifying appropriate extensions and parameters, the application engineer selects predesigned pe-
erator scripts and optimized full-custom cells result in a dense ripherals and designs custom extensions using standard-cell nal layout that gives exceptional application performance. techniques. Then the engineer carefully simulates the system Figure 1 depicts our core-based design ow, as divided into with an HDL simulator incorporated with a functional HDL tasks accomplished by the core vendor or by the customer. description of the core. An application engineer begins the development process by The core provider generates the core layout together with setting initial core parameters likely to meet the application the selected memories and peripherals. After routing, rule specication. The next step is to program the application in checking, and extensive simulation, the vendor or customer assembly language. During program development, the core fabricates a prototype. Finally, the customer performs sys-
OCTOBERDECEMBER 1997
61
..
FLEXIBLE DSP CORE
Additional functional unit
Control signals
clude up to four general-purpose accumulators. The

Program control unit Data path Data address generator
X bus Y bus I bus
data path can use the shifter to transfer a subset of the fullprecision result in the P reg to one of the accumulators. The core vendor can modify the data paths functional units to support special op-
erations required by an application. Program X Y The instruction set supDSP core units memory data memory data memory ports parallel operation exeMemories and peripheral units cution, which allows high throughput for most mediumFigure 2. Block diagram of the DSP core system architecture (address buses are omitted for clarity). complexity signal processing applications. At most, a single instruction can execute tem-level verication and validation of the prototype. an arithmetic-logic operation, a multiply-accumulate operaThe key to taking full advantage of the exible core archi- tion, and two independent data moves simultaneously. The tecture is a set of supporting software tools. The symbolic core features a three-stage pipelinefetch, decode, and exmacroassembler and the ISS make it possible to develop, test, ecuteand executes instructions effectively in one clock cyand benchmark applications even when the actual hardware cle. Instructions causing program ow discontinuity, such as is not yet available. Engineers can nd an optimum core com- branches, have one delay slot. Two interrupts are available: position by experimenting with core parameters and exten- an external interrupt and a reset. If more interrupts are resion instructions to achieve an acceptable performance-cost quired, the system can include an external interrupt handler ratio and minimize power consumption and memory re- unit to arbitrate interrupt priorities. Interrupts can be nested. quirements. They can use detailed statistical data provided A unique feature of the core is its expandability. The baby the ISS to evaluate application performance. Interfaces sic 32-bit instruction word readily supports extension inprovided by the software tools enable rapid evaluation of ex- structions to access additional functional units or take tended core operations and additional functional units. advantage of a special MAC, ALU, or shifter operation. The additional functions become an integral part of the core beDSP core overview cause they are under its direct control. Figure 2 shows a block diagram of a DSP-core-based sysThe cores execution and expansion properties together tem. The DSP core consists of three main parameterized with its long instruction word allow straightforward design functional units: program control unit (PCU), data path, and of the core. When implemented in 0.6-m double-metal data address generator. It has a modied Harvard architec- CMOS technology, the core delivers a maximum of 200 milture with three buses. All three main units can access the X lion operations per second at a 50-MHz clock frequency with and Y data memories through the dedicated X and Y buses. a 16-bit data word length. The PCU uses a separate bus, the I bus, for fetching instruction words. Although data and program memories are Parameterized functional units. All the central charmandatory components of all DSP-core-based systems, they acteristics of the DSP core hardware are parameterized. are not considered part of the core itself. Table 1 lists the core parameters, their ranges, and their deThe data path performs twos-complement arithmetic with fault values. The most important parameter is dataword. It a variable data word width. The core uses a load-store ar- has a major impact on the nal chips performance and cost chitecture; that is, operands are loaded into registers before because the die size of the core and attached memories corthey are processed by the data path. In addition to basic sub- relates strongly with the data word width. Other important traction and addition, the arithmetic logic unit (ALU) parameters affect the number of registers and hardware supports fundamental bit logic operations. The multiply- looping units and the complexity of the data address genaccumulate unit (MAC) can multiply two operands of pa- erator unit. The core vendor can implement application-sperameterized word width and sum the result with the previous cic versions of the MAC, shifter, and ALU units if requested value in the product register (P reg). The data path can in- by a customer.
62
..
Data path. The data path Table 1. DSP core parameters. executes all the cores data processing operations. FigParameter Range Default ure 3 shows the data path accommodating a MAC unit, a dataword 864 bits 16 bits shifter, and an ALU. Various dataaddress 823 bits 16 bits parameters alter the data programword 32 32 bits paths functionality. For exprogramaddress 819 bits 16 bits ample, dataword specifies multiplierwidth 864 bits dataword the length of the data word multiplierguardbits 016 bits 8 bits used by the ALU and the acmactype 0 0 (basic unit) cumulator le. Accumulators shiftertype 0 0 (basic unit) selects the number of accualutype 0 0 (basic unit) mulators. The P reg containindexregs 8 or 16 8 ing the product of a multiply accumulators 2, 3, or 4 4 or MAC instruction has a paenablecd 0 or 1 0 (not enabled) rameterized word length of (2 multiplierwidth) + multimodieronly 0 or 1 0 (not enabled) plierguardbits. In the multiplication operation, the loopregs 08 0 (no loop hardware) multiplier is always the C reg, but the multiplicand is addrmodes 03 0 ( m only) either the D reg or one of the accumulators. By making multiplication precision inX bus dependent of data word Y bus width, one can achieve savings in MAC unit area. MacC reg D reg type, shiftertype, and alutype select special types of functional units. These new units offer advanced functions operated with extension instructions. As we demonMAC strate later, we have implemented an ALU featuring new extensions such as saturation arithmetic.
Description
data word data word Word length of multiplier operands data word 2
Number of index registers Number of accumulators Use of C reg and D reg as ALU operands Only odd-numbered index registers can be modiers Number of hardware looping units Supported data-addressing modes
Accumulator file
A0 A1 A3
Data address generator. P reg The data address generator ALU provides data addresses and Shifter postmodifies the index registers if necessary. Consisting of an index register le and Figure 3. Block diagram of the cores data path. two identical address calculation units, it generates two independent data addresses on each cycle. The indexregs Program control unit. The PCU, which consists of the exeand addrmodes parameters select the total number of index cution control, the instruction address generator, and the inregisters and available data-addressing modes. Dataaddress terrupt control unit, controls operation of all core units and species the internal word length used by the address cal- additional off-core functional units. It usually obtains the proculation units for both data memories. gram memory address from either the program counter, the
63
..
FLEXIBLE DSP CORE
Table 2. Instruction set overview. Mnemonic Description
ABS ADD SUB CMPZ AND OR NOT* XOR LSL* LSR
Arithmetic logic Absolute value Add two operands Subtract two operands Compare operand to zero Bitwise logical AND Bitwise logical OR Bitwise logical NOT Bitwise logical XOR Logical shift left Logical shift right Multiplier Multiply Multiply-accumulate Multiplier no operation Control Start a hardware loop Jump to absolute address Jump to absolute address if register negative Return from subroutine (LR0) Return from interrupt (LR1) Move P reg parts to accumulator No operation Moves Load constant to a register Load register from X memory Load register from Y memory Store register to X memory Store register to Y memory
Because core parameters do not affect the execution ow structure, the execution sequence is identical in all core versions. But as a result of various parameters and extension instructions, instruction decoding varies in different versions of the core. The programword parameter specifies the instruction word width, which affects the fetch register and the decode logic. Extensible instruction set. The assembly language instruction set supports both DSP-specic and general-purpose applications. Table 2 lists the minimum instruction set containing 25 instructions, which can be extended to support application-specic features of the data path and additional functional units. A single bit in the instruction format indicates the fetched instruction is to be decoded as an extension instruction. Thus, increasing the word length of the basic instruction set is unnecessary. The basic core includes 18 registers. Four accumulators and special operands Null and Ones are available with the arithmetic-logic instructions. Additionally, the ALU can use the C reg and D reg multiplier registers if the enablecd parameter is set. Accumulators A2 and A3 and index registers I8 to I15 may or may not be available in parameterized core versions. However, this does not affect the instruction set or the supported addressing modes. The core performs data memory accesses using indirect addressing. It can use the index registers independently or as index register pairs consisting of a base address and a modier. Three postmodication types are available: linear, modulo, and bit-reversed. All core versions support the basic linear addressing mode. The parameter modieronly forces odd-numbered index registers to be used only as modier registers.
MUL MAC MNOP
LOOP J* JN JR RETI MV NOP*
LDC LDX LDY STX STY
Software tools
Our parameterized software development tools consist of a symbolic macro assembler, a disassembler, a linker, an archiver, and an instruction set simulator. With portability in mind, we programmed the tools in standard ANSI C. As a result, the tools are available for multiple platforms, ranging from Unix workstations (HP-UX, Solaris) to PCs. The ISS has two user interfaces: command-line-oriented and graphical. Figure 4 shows the graphical interface. Both versions provide a cycle-based simulation of the core and attached memories. The ISS executes at a rate of 70,000 to 90,000 instruction words per second on a 200-MHz Pentium Pro. Users can view the pipeline state and memory and core register contents at any time. The ISS allows scheduling of any number of interrupt and reset events and maintains a cycle counter and an operation counter during simulation runs. It also supports profiling, which facilitates program optimization by providing the application programmer accurate runtime data. The programmer can use proling statistics to
*Instruction is an assembly language macro.
immediate address of a branch instruction, or the optional looping hardware. An exception is that when the program is returning from a subroutine or an interrupt, link registers LR0 and LR1 provide the program memory address. All PCU registers connect to the X and Y buses. Programaddress determines the size of the PCU registers and the width of the instruction address generator. Loopregs denes the number of hardware looping units. Each hardware looping unit introduces a loop start register, a loop end register, and a loop counter register. The PCU initializes zerooverhead loops with the loop instruction or by writing directly to the loop registers. Nested hardware loops are possible when multiple hardware looping units are present.
64
observe instruction utilization and to nd parts of the code that execute most often and thus benefit the most from optimization. If manual code renement is not sufcient, the programmer can specify an extension instruction to accelerate program execution to an acceptable level. Three les control the conguration of the ISS. The memory description le de nes the kind of memory blocks attached to the core in the simulator. In addition to normal RAM and ROM, the programmer can specify special memory blocks, such as dual-port memories. Memorymapped le I/O supplies input data for the application, and the resulting output data Figure 4. X Window version of the instruction set simulator. is saved to a le. This makes it possible to quickly verify the results of several simulation runs. The hardware description ble software tools to nd the applications minimum hardle species each core parameter. This le is also used by the ware requirements. For instance, choosing a data word HDL models of the core (VHDL, M) and the assembler. The width of 12 bits instead of 16 will reduce the data path and extension description le denes extension instructions. The data memory area roughly 25%. Excluding a hardware loopassembler encodes these instructions on the y by examining ing unit eliminates three registers and the end-of-loop evalthe extension description le, but the additional functionali- uation logic, which occupy 50% of the instruction address ty must be programmed into the ISS. generator. Apart from area efciency, hardware reductions To evaluate a core conguration, we program the appli- also decrease power consumption. In addition, data path cation using the available resources of the core and then extensions and inclusion of custom off-core hardware affect simulate the application with the ISS. The simplest way to area-performance gures. experiment with various core parameters is to change the If an applications performance is insufcient, even simparameters of the memory and hardware description les. ple and inexpensive hardware extensions may bring it to an The simulator automatically congures the execution units acceptable level. On the other hand, adding complex exfunctionality using the parameters in these files. To intro- tensions for demanding applications is a straightforward task duce new extension instructions, we must create a new ver- with this architecture. The engineer can iteratively explore sion of the simulator. We describe the bit-accurate behavior the cores configuration until it meets requirements. The of these instructions by compiling special modules written process of setting parameters and developing applicationin C-language. By linking all the common simulator mod- specific core extensions can be considered a hardwareules with these compiled extension modules, we generate software partitioning task. a new simulator. Now, with the new extension instructions Consider the following example of an application requiravailable, we can evaluate application performance of the ing saturation arithmetic. The basic core conguration does modied conguration. not support this kind of arithmetic, so we must use an assembly language macro for the saturation operation (Figure Application performance optimization 5, next page). When the number of saturation operations is To some extent, system engineers can trade core imple- high enough, it is tempting to use an ALU extended with the mentation performance for area efciency, using the exi- saturation features. By dening new instructions ADDS and
65
FLEXIBLE DSP CORE
... SUB XOR NOT JN XOR NOT JN
a2,a1,a0 a2,a1,a1 a1,a1 a1,sub_is_ok a2,a0,a1 a1,a1 a1,sub_is_ok
// actual subtraction a0 = a2 - a1 // if operands have the same sign, // over/underflow could not have occurred // a1 gets negative if MSBs are equal // branch if were clear // if result and a2 have the same sign, // over/underflow could not have occurred either // if MSBs of a2 and the result match, // over/underflow did not occur // load sign mask to a1 // get a2s sign into a2 // if a2 is positive then a0 = 0xffff, else a0 = 0 // a0 = a0 + 0x8000 // saturated result is in a0
s s s
case 1: the basic core case 2: a core with a hardware looping unit case 3: a core with a hardware looping unit and saturation mode case 4: a core with a hardware looping unit, saturation mode, and add-with-carry
tions in 0.35-m and 0.6-m CMOS technologies with 3.3-V and 5-V operating voltages will achieve a 50-MHz clock rate. Table 3 shows the results of our trade-off analysis for the four cases. We based the silicon area estimates on existing SUBS, we can execute the saturating addition and subtraction DSP core block implementations and VLSI Solutions comin a single instruction cycle. Performance requirements alone mercially available RAM/ROM generators for a double-metal will easily force the design space explorer to switch to satu- 0.6-m CMOS process. ration hardware in saturation-intensive applications. But othThe results in Table 3 show that case 3 is acceptable for er core users need not suffer from this hardware overhead 26-MHz operation and provides additional headroom for and its possible effect on minimum cycle time. other system tasks. We could improve case 2 by careful opAnother example of hardware-software trade-offs is loop- timization of the program code to fulll the specications. ing, which we can implement in software or in zero- Similarly, we could optimize case 4 to meet the ultimate 13overhead looping hardware. Software looping requires a MHz boundary with some additional extension instructions. modification of a loop counter register and a conditional We could even use case 1 with a higher, 39-MHz clock rate, jump on every iteration. Moreover, a register used explicit- but that would increase power consumption drastically. The ly as the loop counter cannot be assigned to any other use gures show that we can cut power consumption approxiduring the loop. In a hardware implementation, however, mately 30% to 35% by using more advanced arithmetic than the looping unit initializes its registers once and carries out that included in the basic core architecture. loop counter register modication, end-of-loop evaluation, To evaluate the instruction set in a particular application, and conditional jumps automatically thereafter. When the we must weigh performance against the implementations core is performing a large number of relatively short itera- area (cost). Figure 6 compares area costs of the four core contions, the difference in performance is signicant. gurations. The speed-up, power consumption, and cost bars are normalized with respect to the basic core (case 1). Application example Underlying the normalized power consumption estimates is We can clearly demonstrate the importance of trade-off the assumption that power consumption is a linear function analysis with an application example. We decided to im- of the clock rate and the number of active logic gates in the plement the primary functions specied in the GSM 06.10 core. We assumed the number of gates to be proportional to speech-transcoding standard8 by programming assembly the estimated core area. Interestingly, Table 3 shows that even
Figure 5. Assembly language code for saturating subtraction.
LDC #0x8000,a1 AND a1,a2,a2 CMPZ a2,a0 ADD a0,a1,a0 sub_is_ok: ...
The goal of our trade-off analysis was to nd a core conguration that meets two requirements: It must execute all the GSM speech-coding routines in less than 10 ms, and it must run at the lowest multiple of the 13-MHz system clock specified by the standard. We assumed that implementa-
code for the flexible DSP core and adding applicationspecic extensions as necessary. The software implements the fundamental signal processing algorithms required in portable cellular telephones for GSM (Global System for Mobile Communication) networks. The algorithms implement the GSM full-rate speech encoder/decoder and the supplementary routines for logarithmic compression (Alaw/-law), voice activity detection (VAD), and discontinuous transmission (DTX). Due to the nature of the standard, we were forced to select the 16-bit data word width. We evaluated four core congurations:
if the cores area is increasing, the total area decreases slightly. This indicates that the saturation and carry-inclusive arithmetic extensions not only improve performance and reduce power consumption but also decrease overall implementation cost. The reasons for this behavior are that the extra features require very little additional silicon area and that their more compact code ts into a smaller program ROM.
THE DSP-CORE ARCHITECTURE described here extends beyond the current state of the art in parameterization and extensibility levels. Not only can system engineers choose
66
peripherals and the basic Table 3. GSM application performance. All routines must execute in less than 10 ms. data word width. They can also congure more adCase 1 Case 2 Case 3 Case 4 vanced parameters such as addressing modes, hardware Worst-case runtime by section (cycles) looping, and various address GSM 06.10 full-rate transcoder and data word widths within Encoder 193,487 172,208 126,643 109,755 the core to suit application reDecoder 68,654 63,009 18,967 18,967 quirements. With the extenG.711 (A-law/-law) 26,000 25,298 25,298 25,298 sion instructions, they can GSM 06.31 DTX handler 10,000 9,788 9,788 9,788 ne-tune existing operations, GSM 06.32 VAD 21,700 21,380 21,380 20,480 add new core operations, or Total cycle count 319,841 291,683 202,076 184,258 use custom logic much like a Normalized cycle count (speed-up) 1.00 1.10 1.58 1.74 coprocessor controlled diLowest feasible clock frequency 32.0 MHz 29.2 MHz 20.3 MHz 18.5 MHz rectly by the core control % of cycle budget in use @ 50 MHz 63% 58% 40% 36% unit. We know of no other existing DSP cores that accomMemory usage (words bits) modate such a exible set of Program ROM 4,002 32 3,908 32 3,837 32 3,807 32 extension mechanisms. X RAM 1,182 16 1,182 16 1,182 16 1,182 16 As the speech coding Y RAM 616 16 616 16 616 16 616 16 example shows, the architecY ROM 441 16 441 16 441 16 441 16 ture is sufcient for executing signal processing algorithms Estimated area in 0.6-m CMOS (mm2) of at least medium complexiCore 3.50 ty. However, it is not sufcient Program ROM 4.25 for the most complex signal Data memory 9.20 processing algorithms, since Total area 16.95 its parallelism cannot always Normalized total area (cost) 1.000 be used efciently. Some DSP programmers may consider the jump condition set too limited, and the implementation Application speed-up of extended-precision arithmetic is not straightforward. The commercial partner of this project addressed these limitations in a second-generation core called VS-DSP.9 While this core
3.80 4.15 9.20 17.15 1.012 3.82 4.10 9.20 17.12 1.010 3.85 4.05 9.20 17.10 1.009
1.00 1.10 1.58
retains the originals level of parameterization and extensibil1.74 ity, its more orthogonal register set and larger selection of Estimated power consumption (core area/number of cycles) branch conditions make programming easier. The software development tools supporting the parame1.00 ter space and extension attachment are essential to fine0.98 tuning the core architecture for a specic application. One 0.69 can regard the parameterized core as a broad DSP-core fam0.63 ily, rather than a single core. The implemented software tools Cost of speed-up (total area/number of cycles) adjust successfully to the familys varying features. We also 1.00 revised the tools to support the VS-DSP instruction set and arCase 1 0.92 Case 2 chitecture, proving the exibility of the software tools. Case 3 0.64 The elastic DSP core and supporting software tools enable Case 4 0.58 exploration of the application design space. Developers can nd the most appropriate division between hardware- Figure 6. Normalized comparisons of the four evaluated cases. supported and software-coded operations for a particular application by experimenting in software before proceeding to a hardware implementation. In addition to optimizing perfor- ory and hardware logic to reach the most economical realmance, they can balance the use of data and program mem- ization of the application algorithm. Also, an extension of the
67
FLEXIBLE DSP CORE
DSP cores functional units can replace part of the surrounding logic circuitry of a more conventional ASIC implementation. The extension instructions become an integral part of the DSP core. Thus, an application software developer can effortlessly comprehend how the additional hardware synchronizes and interfaces with the cores execution ow.
ded-software compilation, logic synthesis, VLSI implementation, and IC design. Kuulusa received his MSc degree in information technology from Tampere University of Technology.
Jari Nurmi is the vice president of VLSI Solution Oy in Tampere, Finland. His research Acknowledgments interests include DSP cores and applicationThe DSP-core development was a joint project of VLSI Solution specic DSP architectures and their VLSI imand Tampere University of Technology, both in Tampere, Finland. plementation. Previously, he worked in VLSI Solution and the Technology Development Center TEKES Tampere University of Technologys Signal funded the project. Processing Laboratory as leader of the DSP We thank Juha Rostrm of VLSI Solution for providing us infor- and Computer Hardware Group. Nurmi received his MSc and limation about the speech-coding algorithm implementation. centiate of technology degrees in electrical engineering and his doctor of technology degree in information technology from TamReferences pere University of Technology. He is a member of the IEEE. 1. P.G. Paulin et al., Trends in Embedded Systems Technology, in Hardware/Software Co-Design, Kluwer Academic, Norwell, Mass., 1996, pp. 311-337. Janne Takala is an IC designer at VLSI Solu2. A.J.P Bogers et al., The ABC Chip: Single Chip DECT Baseband tion Oy, where he is involved in developing Controller Based on EPICS DSP Core, Proc. Intl Conf. Signal and implementing DSP core architectures. He Processing Applications and Technology, 1996, pp. 299-302. is also working toward the MSc degree at Tam3. C. Liem et al., System-on-a-Chip Cosimulation and Compilapere University of Technology. tion, IEEE Design & Test of Computers, Vol. 14, No. 2, Apr.-June 1997, pp. 16-25. 4. P.D. Lapsley, J.C. Bier, and A. Shoham, Buyers Guide to DSP Processors, Berkeley Design Technology Inc., Berkeley, Calif., Pasi Ojala is a software engineer at VLSI So1995. lution Oy. Previously, he worked as a research 5. H. Yagi and R.E. Owen, Architectural Considerations in a Conassistant in the Signal Processing Laboratory gurable DSP Core for Consumer Electronics, VLSI Signal Proof Tampere University of Technology. His recessing, VIII, IEEE Press, Piscataway, N.J., 1995, pp. 70-81. search interests range from digital system de6. R. Woudsma et al., EPICSA Flexible Approach to Embedsign and low-level programming to writing ded DSP Cores, Proc. Intl Conf. Signal Processing Applications application software for end users. Ojala reand Technology, 1994, pp. 506-511. ceived his MSc degree in information technology from Tampere 7. J. Nurmi, Portability Methods in Parameterized DSP Module University of Technology. Generators, VLSI Signal Processing, VI, IEEE Press, Piscataway, N.J., 1993, pp. 260-268. 8. Recommendation GSM 06.10, GSM Full Rate Speech TranscodHenrik Herranen is a software developer at ing, European Telecommunications Standards Institute (ETSI), VLSI Solution Oy. Previously, he worked as a Sophia Antipolis, France, 1992. research assistant in the Signal Processing 9. J. Nurmi and J. Takala, A New Generation of Parameterized Laboratory of Tampere University of Techand Extensible DSP Cores, Proc. IEEE Workshop on Signal Pronology, where he is also working toward his cessing Systems, Leicester, UK, 1997. MSc degree.
Mika Kuulusa is a research scientist in the Signal Processing Laboratory at Tampere University of Technology, Finland. He is working toward the doctor of technology degree. His Address questions or comments about this article to Mika Kucurrent research activities focus on hardware- ulusa, Signal Processing Laboratory, Tampere University of Techsoftware codesign of systems based on DSP nology, PO Box 553 (Hermiankatu 12), 33101 Tampere, Finland; cores. Other areas of interest include embed- mika.kuulusa@cs.tut..
68
PUBLICATION 3
M. Kuulusa, T. Parkkinen, and J. Niittylahti, MPEG-1 layer II audio decoder implementation for a parameterized DSP core, in Proc. Int. Conference on Signal Processing Applications and Technology, Orlando, FL, U.S.A., Nov. 14 1999 (CD-ROM). Copyright c 1999 Miller Freeman, Inc. Reprinted, with permission, from the proceedings of ICSPAT99.
MPEG-1 Layer II Audio Decoder Implementation for a Parameterized DSP Core

Mika Kuulusa, Teemu Parkkinen and Jarkko Niittylahti Signal Processing Laboratory, Tampere University of Technology P.O. Box 553 (Hermiankatu 12), Tampere, Finland
Abstract A compact, fixed-point DSP core can be utilized to realize an MPEG-1 Layer II audio decoder. The firmware for the decoding algorithm was implemented by transforming a floating-point C-language source code into an efficient assembly language code for the DSP. This paper describes our systematic design approach and reviews the program code behavior in light of detailed statistical profiling information. 1. Introduction MPEG digital audio coding is the audio compression standard utilized in many modern applications, such as digital audio broadcasting (DAB) and digital versatile disc (DVD) players. Since the consumer products typically require only the decoding of the compressed audio stream, a successful implementation of the audio decoder becomes imperative. Often the most suitable way to realize the audio decoder is to use a programmable DSP jointly with an optimized assembly language module to perform all the necessary decoding functions. Although an audio decoder implementation utilizing floating-point arithmetic typically results in better quality of the reproduced audio, the cost of the floating-point DSPs is clearly prohibitive. Therefore, fixed-point DSPs are utilized to achieve a more costeffective solution. In our approach, we have taken a flexible fixed-point DSP core as the
target platform for our MPEG-1 Layer II audio decoder. The paper begins with a brief overview to the MPEG audio coding standards. A system architecture incorporating an MPEG audio decoder chip is described and the development of the audio decoder firmware is presented. The run-time characteristics of the audio decoder implementation are studied in detail. Finally, the conclusions are drawn. 2. MPEG Audio Coding 2.1 Overview MPEG audio compression algorithms are international standards for digital compression of high-fidelity audio. The MPEG audio-coding standard offers audio reproduction which is equivalent to CD quality (16-bit PCM). MPEG-1 audio covers 32, 44.1, and 48 kHz sampling rates for bitrates ranging from 32 to 448 kbit/s [1]. The MPEG-1 audio supports four modes: mono, stereo, joint-stereo and dual-channel. The standard defines three Layers which fundamentally differ in their compression ratios with respect to the quality of the reproduced audio. For transparent quality, Layer I, Layer II, and Layer III require 384, 192, and 128 kbit/s bitrates, respectively. The MPEG-2 standard introduces several new features, such as an extension for multichannel audio and support for lower sampling frequencies [2]. A more comprehensive description of the MPEG audio compression can be found in [3],[4].
Number of Bits 32 0-16 Header CRC
26-188 Bit
0-60
0-1080 Subband Samples Ancillary Data Input Encoded Bit Stream
Input 32 New Subband Samples S , i = 0...31i
SCFSI Scalefactors
Allocation
Shifting for i=1023, down to 64 do V[i] = V[i-64] Matrixing for i=0 to 63 do V =i N * Sik
k=0 31
Figure 1. MPEG-1 Layer II audio frame structure.
Decoding of Bit Allocation
2.2 Frame Structure MPEG-1 Layer II audio is based on a frame structure that is depicted in Figure 1 [4]. A single frame corresponds to 1152 PCM audio samples. The frame begins with a header that carries a 12-bit synchronization word and a 20-bit system information field. The system information specifies the details of the audio data contained in a frame. An optional 16-bit CRC field is used for error detection. The CRC field is followed by the compressed audio which is divided into fields for subband bit allocation, scalefactor format selection information, scalefactors, and the actual subband samples. The total size of a frame depends on the sampling frequency and bitrate. For example, the frame size is about 620 bytes for a 44.1 kHz/192 kbit/s stream. The frames are autonomous, i.e., each frame contains all information necessary for decoding. 2.3 Audio Decoding The flow chart for MPEG-1 audio decoding is shown in Figure 2 [1]. The decoding algorithm begins by reading the frame header. The bit allocation and scalefactors for the coded subband samples are then decoded. The coded subband samples are requantized and passed to a synthesis subband filter which uses 32 subband samples to reconstruct 32 PCM samples. In addition to various array manipulations, the main operations in the synthesis subband filter involve matrixing and windowing operations. The matrixing operation applies an inverse discrete cosine transform (IDCT) to map the frequency domain representation back into
Decoding of Scalefactors
Requantization of Samples
Build a 512 Values Vector for i=0 to 7 do for j=0 to 31 do { U[i*64+j] = V[i*128+j] U[i*64+32+j] = V[i*128+96+j] }
Synthesis Subband Filtering
Windowing by 512 Coefficients for i=0 to 511 do W = U * Di i i Calculate 32 Samples for j=0 to 31 do S =j Wj+32i
i=0 15
Output PCM Samples
Output 32 Reconstructed Samples
Figure 2. MPEG-1 audio decoder flow chart for Layer I and II bit streams.
the time domain. The windowing operation performs the necessary filtering within a window of 512 samples. 3. Audio Player System Architecture The MPEG audio coding can be employed in a portable audio player which utilizes a large non-volatile memory for audio storage. Our design objective was to integrate a fixedpoint DSP core together with program/data memories and a set of peripherals on a single chip. This allows development of a portable audio player device which is based on an MPEG audio decoder system chip, shown in Figure 3. The dedicated peripherals include two 16-bit digital-to-analog converters, a Universal Serial Bus (USB) interface and some additional hardware for the user interfaces. T h e MP E G a u d io d e co d e r ch ip i s complemented with a large external Flash
MPEG Audio Decoder Keyboard Interface
Program Memory
Display Controller DSP Core Misc. Peripherals 96 MB Flash Memory
USB Interface
X Data Memory
Y Data Memory
2 x 16-bit DAC
Figure 3. Block diagram of the MPEG audio decoder chip with an external Flash memory.
memory. A 96 MB Flash memory can hold approximately 70 minutes of 192 kbit/s audio streams. Moreover, the external memory contains the program code and various data arrays which are downloaded into the on-chip memories during the system initialization. The DSP core is based on a modified Harvard architecture with a separate program memory and two data memories [5]. The execution pipeline contains three stages which effectively fetch, decode and execute 32-bit instruction words. The DSP core has a flexible architecture that allows a number of parameters to be changed to suit the specific needs of the application at hand [6]. The adjustable data word length allows us to increase the dynamic range of the calculations in case the audio decoder would not have satisfactory audio quality. However, a 16-bit data word width was selected for our initial implementation. In this configuration, the processor datapath contains eight 16-bit registers which can also be used as four 40-bit accumulators. Eight data address registers can be employed to realize indirect data memory accesses together with various postmodification operations. Moreover, the processor supports zero-overhead program looping.
4. Audio Decoder Implementation Several C-language audio decoders were extensively studied to facilitate the implementation in an assembly language. Based on various experiments, a floatingpoint C-source code was selected for further refinement. A systematic approach was taken to transform this floating-point version into an efficient assembly language program. First, the floating-point C-language decoder was modified to employ 16-bit arithmetic operations and data values instead of singleprecision floating-point. The fixed-valued data arrays employed in the matrixing and windowing operations were scaled and truncated to fixed-point representations which provided satisfactory audio quality. However, certain operations had to be carried out with 32 16-bit multiply-accumulate (MAC) operations. These operations were performed by using an assembly macro that executes the multiplication with four instructions. An alternative way to realize these MACs would be by extending the length of the data word. However, this was not found necessary since the criteria for the decoder performance and audio quality were fulfilled, thus the additional cost was not justified. The 32-point IDCT operations in
the synthesis subband filter were effectively realized with Lees fast algorithm [7]. The fast algorithm reduces the original 2048 multiply-accumulate operations into 80 multiplications and 209 additions. The optimized fixed-point C-language program served as a bit-exact functional representation for our implementation that was hand-coded in assembly language for the target DSP. Since all the calculations were performed with 16-bit arithmetic operations, the modified C-language program allowed a straightforward conversion process from Clanguage to DSP assembly code. The assembly language implementation of the MPEG-1 Layer II audio decoder has a program size of 2272 words. Data memory usage is 12325 words, of which 74% is used to accommodate various fixed-valued data tables needed in the audio frame decoding and the synthesis subband filtering. 5. Audio Decoder Performance The audio decoder implementation was evaluated with different types of audio material that were encoded into audio streams for bitrates ranging from 64 kbit/s to 320 kbit/s. The firmware was simulated with a cycle-accurate instruction-set simulator of the DSP core. Table 1 shows the results for 44.1 kHz stereo streams. For these streams, it was assumed that a total of 39 frames has to be decoded in one second. In order to get a worst-case estimate, several 5 second streams were decoded and the longest run-time for one frame was multiplied by 39. The variation in the clock cycles per frame is not very large, typically less than 3% of the worst-case run-times. Depending on the bitrate and sampling frequency of the audio, real-time decoding can be achieved at relatively low processor clock frequencies. For 44.1 kHz stereo audio at bitrates less than 192 kbit/s, a clock
frequency of 25 MHz is sufficient providing additional capacity for other system tasks.
Table 1: Worst-case decoder run-times for one second of 44.1 kHz stereo audio. Bitrate (kbit/s) 64 128 192 256 320 Decoder Run-Time (clock cycles) 21 225 000 23 435 000 24 575 000 25 688 000 26 156 000
Table 2 shows the percentages and the number of clock cycles for the main functions in the audio decoder.
Table 2: Processing requirements for the main audio decoder functions. MPEG-1 Layer II Decoder Function Decoding of Bit Allocation Decoding of Scalefactors Requantization of Samples Matrixing Windowing Input/Output Other All
*
% 1.4 19.9 42.9 35.0 0.7 0.1 100
Clock * Cycles 350 000 4 975 000 10 725 000 8 750 000 175 000 25 000 25 000 000
Decoding time of 25 000 000 clock cycles assumed.
As expected, the matrixing and windowing operations in the synthesis subband filtering dominate, consuming roughly 78% of the decoding time. The requantization of samples takes 20% and the remaining 2% of the clock cycles is spent in input/output operations and functions that are performed only once per frame. By investigating Table 2, the decoder
functions that benefit most from further optimization are clearly the matrixing and windowing operations. For example, the windowing operation has a program kernel consisting of blocks of five instructions. This kernel contributes roughly 30% of the total run-time of the decoder. If a 24-bit data word were used, it would be possible to realize the kernel with only two instructions. Effectively, this modification would reduce 24%, or about 6 million clock cycles, from the decoding time. On the other hand, the quality of the reproduced audio could be improved by employing block floating-point arithmetics in the synthesis subband filter [8]. However, block floating-point arithmetics would increase the number of clock cycles needed in the audio decoding. 6. Conclusions An MPEG-1 Layer II audio decoder was successfully realized for a fixed-point DSP core with a 16-bit data word. A 25 MHz processor clock frequency was found sufficient to accomplish decoding of 44.1 kHz stereo audio streams at a 192 kbit/s bitrate. The audio decoder implementation provides audio reproduction with good perceptual quality. The developed firmware can be utilized in an integrated MPEG audio decoder chip which offers a cost-effective audio decoding solution for a wide range of consumer electronics applications. 7. Acknowledgments The research work has been co-funded by the National Technology Agency of Finland and several companies from the Finnish industry. The support received from VLSI Solution Oy is gratefully acknowledged.
References [1] ISO/IEC 11172-3, Information Technology - Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s - Part 3: Audio, Standard, International Organization for Standardization, Geneva, Switzerland, Mar. 1993. [2] ISO/IEC 13818-3, Information Technology - Generic Coding of Moving Pictures and Associated Audio: Audio, Standard, International Organization for Standardization, Geneva, Switzerland, Nov. 1994. [3] P. Noll, MPEG Digital Audio Coding, IEEE Signal Processing Magazine, vol. 14, no. 5, pp. 59-81, Sep. 1997. [4] D. Pan, A Tutorial on MPEG/Audio Compression, IEEE Multimedia, vol. 2 no. 2, pp. 60-74, Summer 1995. [5] VLSI Solution Oy, VS_DSP Specification Document, rev. 0.8, Nov. 1997. [6] J. Nurmi and J. Takala, A New Generation of Parameterized and Extensible DSP Cores, in Proc. IEEE Workshop on Signal Processing Systems, Leicester, United Kingdom, Nov. 1997, pp. 320-329. [7] K. Konstantinides, Fast Subband Filtering in MPEG Audio Coding, IEEE Signal Processing Letters, vol.1, no. 2, pp. 26-28, Feb. 1994. [8] R. Ralev and P. Bauer, Implementation Options for Block Floating Point Digital Filters, in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Munich, Germany, Apr. 1997, pp. 21972200.
PUBLICATION 4
M. Kuulusa, J. Nurmi, and J. Niittylahti, A parallel program memory architecture for a DSP, in Proc. Int. Symposium on Integrated Circuits, Devices & Systems, Singapore, Sep. 1012 1999, pp. 475479. Copyright c 1999 Nanyang Technological University, Singapore. permission, from the proceedings of ISIC99. Reprinted, with
A PARALLEL PROGRAM MEMORY ARCHITECTURE FOR A DSP

Mika Kuulusa, Jari Nurmi and Jarkko Niittylahti
Signal Processing Laboratory, Tampere University of Technology, P.O. Box 553 (Hermiankatu 12), FIN-33101 Tampere, Finland Tel. +358 3 3652111, Fax +358 3 3653095, E-mail: mika.kuulusa@cs.tut.
Abstract : This paper describes an approach where a DSP core is coupled with a parallel program memory architecture to allow rapid program execution from a number of slow memory banks. The slow read access time to the memory banks may be due to lowered supply voltages or it can be a property of the memory technology itself. Thus, the approach has two benets: 1) it allows program execution directly from a non-volatile memory to reduce the system cost, and 2) lower supply voltages can be employed in low-power applications. The suitability of the memory architecture is evaluated with assembly language implementations of an MPEG audio decoder and a GSM speech codec. The results show that the speed-up of a highly sequential program code is directly proportional to the number of memories, whereas in a more complex application only a 2x speed-up is achievable.
1. INTRODUCTION Most embedded systems contain a non-volatile memory for permanent storage of the application rmware that is executed by programmable processors. Reprogrammability of this memory has become one of the key requirements because it allows rmware updates later in the design cycle or even on the eld. From the current non-volatile memory technologies, low-cost high-capacity ash memory devices have gained widespread acceptance in DSP-based embedded systems, such as cellular phones. Because ash memory devices are inherently slow, currently providing read access times in the 40-70 ns range, the program execution cannot be carried out directly from the ash memory. During the system initialization, the program code is copied entirely, or partly, to an on-chip program memory to enable rapid program execution. Moreover, in low-power applications the access times to on-chip SRAM memories tend to increase signicantly when lower supply voltages are employed. If the program code could be executed directly from the non-volatile memory, meaningful cost-savings could be realized since the separate fast program memory could be eliminated from the system. In addition, a low-power program memory could be realized if there were some means to compensate the slow read access time. The effective program memory bandwidth, however, can be increased if the read accesses are performed in parallel, i.e., several instruction words are read simultaneously. In this paper, a parallel program memory architecture for a DSP core is presented. To allow reasonable evaluation of the parallel memory architecture, a behavioral-level hardware model of a commercial DSP core was used in the development. Two applications were used to analyze the suitability of 475
the memory architecture and the effect on the program execution with this particular DSP was studied. 2. PROGRAM EXECUTION IN THE DSP A xed-point DSP core, designated VS_DSP [1], was chosen as the target processor of a parallel program memory architecture. Program execution is based on a shallow three-stage pipeline comprising of instruction fetch, decode and execute phases. All instructions effectively execute in one clock cycle. The DSP core incorporates three main blocks: a program control unit (PCU), datapath and data address generator (DAG). A detailed description of the DSP core can be found from [2,3], reference [4] has a presentation of the rst generation DSP core. The operation of the PCU is illustrated in Figure1. Depending on the current processor state and the decoded instruction, the next instruction fetch address may come from a variety of sources: incremented program counter (PC) target address of a branching instruction subroutine or interrupt return address registers loop start address register, and interrupt or reset vector addresses. Typically, the next address is fetched from the incremented PC to carry out sequential execution of the program code. Discontinuity in the sequential program ow is caused by branching/return instructions, or it may result from an activity of the looping hardware and the interrupt controller. Target addresses for conditional and unconditional branching instructions are embedded into the 32-bit instruction word. Other possible addresses are either xed values or they are fetched from dedicated registers.
Instruction Data Bus Access Format Program Control Unit Decode Status Registers Interrupt Control Hardware Looping Interrupt Interrupt Program Counter Memory Bank 0 Memory Bank 1 Memory Bank 2 Memory Bank N-1 MemoryAddress Address Generator
Subroutine Return Address (LR0)
Loop
Branching
Reset
...
Return Start Target Vector Vector Address Address Address 0x0008 0x0000 (LR1) (LS) Increment
Mux Next Instruction Address Read/Write Control Data Output/Input Permutation Network
Instruction Address Bus
Fig. 1. Possible sources for the next program address in the DSP core.
Fig. 2. General parallel memory architecture.
In the DSP core, a non-sequential program memory access resulting from a jump instruction (i.e., a branching or return instruction) is performed by using delayed branching method, i.e., the instruction following a jump instruction is always executed. The execution overhead arising from this approach is acceptable since in typical applications, 80-90% of the delay slots can be tted with a useful instruction. For example, the delay slot can be utilized to store a subroutine return address or to pass one of the subroutine arguments. W hen the pipelined execution ow is considered, another problematic issue is the operation during interrupts. As soon as an interrupt is detected, a fetch from the xed interrupt vector address is issued. Now, the pipeline has two instructions in the decode and execute stages. The PCU selectively picks out the correct interrupt return address from the following options: address of the instruction (in decode) jump instruction target address (in execute), or loop start address. If the rst option is chosen, the instruction in the decode stage has to be canceled. However, this instruction is executed normally, if either 1) the instruction in the execute is a jump to be taken, or, 2) the instruction was fetched from loop end address and a new iteration should be taken. 3. PARALLEL PROGRAM MEMORY ARCHITECTURE A general block diagram of a parallel memory architecture is depicted in Figure2. The memory architecture comprises of an address generator unit, a permutation network, and a total of N memory banks [5]. Depending on the selected access format, the address generation unit provides a memory address for each of the memory banks. An access format can be comprehended as a template that is positioned on a twodimensional representation of the entire memory space. Common access formats are a row, a rectangle and a 476
diagonal line, for example. Permutation network is required to rearrange data values so that the input or output can be manipulated in a correct order. 3.1. Parallel Program Memory A suitable architecture for a parallel program memory can be derived from the general architecture by considering the pipelined operation of the DSP core. Such a memory architecture is depicted in Figure 3. The PCU operation is modied to contain all the necessary functions of the address generator. A pipelined read access to a parallel memory with N memory banks is specied to last a total of N clock cycles. Therefore, it is possible to issue N individual addresses to the memory banks so that only a nonsequential memory access will result in a memory access penalty. In processor clock cycles, this penalty is N-1 clock cycles. The permutation network can be replaced with an N-to-1 multiplexer which is controlled by the stream of absolute program memory addresses that are sequenced through N-1 shift registers. Moreover, loop start addresses needed in the initialization of the hardware looping are acquired from one of the shift registers. 3.2. Program Code Mapping The program code is interleaved to the memory banks [6]. Let us consider a value of N which is a power of two (N = 2 ). Instruction words are interleaved to the memory banks with the following mapping: Mi(addr) = P(addr + i mod N) (1)
M
where Mi(x) are the contents of a memory location x in the memory bank i, P(y) is the instruction word in the absolute program memory address y, i = [0, N-1], addr =[0, program_address_space-1]. By using this mapping, a parallel read access to the address K results in an instruction block containing the following N instruction words:
Instruction Address Buses
14 16
14
14
14
D Program Control Unit Memory Bank 0 Memory Bank 1 2 Memory Bank 3 Memory Bank
D
2 32 32 32 32
Mux Instruction Data Bus

32
determined sequentially by examining the instructions in the pipeline, in a similar way as described in Section 2. This leads to the worst-case interrupt latency of (1 + 2N) clock cycles, whereas the minimum latency is (N+2). Interrupt latency is dened as the time needed from the interrupt detection to the execution of the rst instruction of the interrupt service routine. Clearly, the actual interrupt overhead in a certain application depends on the rate at which the interrupts occur. The overhead is not an issue when the interrupt rate is relatively low. 4.3. Hardware Looping In order to avoid complications in the hardware looping, a loop body, i.e., the program code in the loops, must always contain a number of instructions which is a multiple of N. However, if the number of iterations is a constant value known at the program compile-time, this restriction can be avoided with loop unrolling. In loop unrolling, a new loop body is constructed by replacing the original code with several copies of the loop body, and adjusting the number of loop iterations appropriately. In this way the resulting loop body is a multiple of N, and the overhead arising from the parallel memory architecture is minimized. Unfortunately, loop unrolling can be employed only to certain extent, thus in some cases a loop body must be padded with nooperations resulting in a very undesirable overhead. 5. EXPERIMENTAL RESULTS OF THE MEMORY ARCHITECTURE Two different applications were used to evaluate the performance of the parallel memory architecture: GSM half rate speech codec and MPEG-1 Layer II audio decoder [7][8]. The GSM half rate speech encoder compresses a 13-bit speech signal sampled at 8 kHz into a 5.6 kbit/s information stream. Both the GSM half rate encoder and decoder were run sequentially during the experiments. MPEG-1 Layer II decoder was used to reconstruct 16-bit audio samples (44.1 kHz, stereo) from a 128 kbit/s compressed audio data stream. The experiments were carried out by running an extensive program trace from both of the applications by using an instruction-set simulator of the DSP core. The simulator allows a cycle-accurate simulation of the applications and generates proling information of the dynamic behavior of the program code. The program traces were analyzed with automated scripts that calculate the number of jump instructions and the number of no-operation instructions required to adjust the hardware looping sections. Loop unrolling was not applied in the applications. The application performance was calculated for memory congurations that have 1 to 8 memory banks. The results from the GSM half rate test are shown in Figure4. Three curves illustrate the performance in cases with no jump/looping overhead 477
Fig. 3. Parallel program memory architecture suitable for pipelined memory accesses (N=4).
[ M0(K) M1(K) M2(K) ... MN-1(K) ] = [ P(A) P(A+1) P(A+2) ... P(A+N-1) ]
(2)
where A=K*N. In other words, the result is the N sequential instruction words starting from absolute program memory address A. In the case all memory accesses could be aligned to N word boundaries, a single address could be issued. But since the PCU can selectively issue new addresses to the memory banks, the pipelined program memory access is straightforward to implement. Conceptually, the single cycle instruction fetch stage is stretched to cover N processor clock cycles. An interesting option in the presented memory architecture is that there is a straightforward way to support memory architectures where N is not a power of two. The use of such a program memory only requires a few additional steps in the program code assembly, and a minor change to the generation of the sequential program memory addresses in the PCU. 4. IMPLICATIONS ON THE PROGRAM MEMORY ADDRESSING 4.1. Branching/Return Instructions From the program execution point of view, the objective was to make the branching/return instructions function exactly in the same manner as in the single memory case. Therefore, the execution of the instruction in the delay slot remains the same, but due to the non-sequential memory access latency, N-1 instructions after the delay slot have to be cancelled. 4.2. Interrupt Operation Interrupt operation in the parallel memory architecture is mainly constrained by the hardware looping operation because the pipeline may contain instructions from a new loop iteration. To enable correct operation, the interrupt return address has to be
GSM Half Rate Codec Performance (avg.) 1 1
MPEG1 Layer II Decoder Performance (avg.)
0.9
w/ jumps+loops w/ jumps
0.9
w/ jumps+loops w/ jumps
0.8
ideal
0.8
ideal
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.11
0.1 1 2 3 4 5 Number of Memory Banks 6 7 8
Number of Memory Banks
Fig. 4. Results of the GSM speech codec test.
Fig. 5. Results of the MPEG audio decoder test.
(ideal), and with jump and jump/looping overhead. In the GSM test, 40 seconds of speech were rst encoded and then decoded. Due to the complex control-ow of the GSM half rate algorithms, only the memory congurations with 2 and 3 memory banks seem to be viable. The results from the MPEG-1 Layer II test are depicted in Figure5. Four 5 minute streams of compressed audio served as the test input to the MPEG decoder. The speed-up in the performance follows very closely the ideal curve. This can be explained by the highly sequential structure of the program code. The performance penalty resulting from the non-sequential program behavior cannot be avoided. However, most of hardware loops can be restructured so that the performance gets closer to the curve that includes only the branching overhead. 6. CONCLUSIONS A parallel program memory architecture presented in this paper can be used to allow fast program execution directly from a number of slow memories. The implementation overhead in the DSP is reasonable, requiring only minor modications to the program control unit. In addition, the N parallel memory banks need N-1 registers and an N-to-1 multiplexer to realize the parallel memory accesses. As the two application examples show, the performance of the architecture depends strongly on the control-ow behavior in the program code. Whereas the GSM half rate codec was quite ineffective with the parallel program memory architecture, the MPEG audio decoder was able to execute very efciently due to simple control structures in the program code. As seen from the results, memory architectures with 2 or 4 memory banks seem to be feasible in practice. For example, DSP core clock frequency of 100 MHz can be achieved by using four parallel memory banks that have a 40 ns read access time. To summarize, a successful 478
employment of the presented parallel memory architecture calls for an application that can be implemented with highly sequential program code. 7. ACKNOWLEDGMENTS The research project has been co-funded by Technology Development Center (TEKES) and several companies from the Finnish industry. The authors wish to thank Janne Takala, Juha Rostrm, and Teemu Parkkinen for their valuable contributions to the research. VS_DSP development environment provided by VLSI Solution Oy is gratefully acknowledged. REFERENCES
[1] VS_DSP Core, Product Datasheet, Version 1.2, VLSI Solution Oy, Finland, February 1999. [2] J. Takala, P. Ojala, M. Kuulusa, and J. Nurmi, A DSP Core for Embedded Systems, Proc. IEEE Workshop on Signal Processing Systems (SiPS99), to appear in 1999. [3] J. Nurmi and J. Takala, A New Generation of Parameterized and Extensible DSP Cores, Proc. IEEE Workshop on Signal Processing Systems (SiPS97), IEEE Press, 1997, pp. 320-329. [4] M. Kuulusa, J. Nurmi, J. Takala, P. Ojala and H. Herranen, A Flexible DSP Core for Embedded Systems, IEEE Design & Test of Computers, Vol. 14, No. 4, October/December 1997, pp. 60-68. [5] M. Gssel, B. Rebel and R. Creutzburg, Memory Architecture & Parallel Access, Elsevier Science, Amsterdam, The Netherlands 1994. [6] K. Hwang and F.A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill, New York, USA, 1984. [7] Digital cellular telecommunications system; Half rate speech transcoding (GSM 06.20), EN 300 969, European Telecommunications Standards Institute (ETSI), Sophia Antipolis, France, 1999. [8] ISO/IEC 11172-3, Information technology - Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s - Part 3: Audio, International standard, International Organization for Standardization, Geneva, Switzerland, March 1993.
PUBLICATION 5
J. Takala, M. Kuulusa, P. Ojala, and J. Nurmi, Enhanced DSP core for embedded applications, in Proc. Int. Workshop on Signal Processing Systems: Design and Implementation, Taipei, Taiwan, Oct. 2022 1999, pp. 271280. Copyright c 1999 IEEE. Reprinted, with permission, from the proceedings of SiPS99.
ENHANCED DSP CORE FOR EMBEDDED APPLICATIONS

J. Takala , M. Kuulusa , P. Ojala , J. Nurmi

VLSI Solution Oy Hermiankatu 6-8 C FIN-33720 Tampere Finland
Tampere University of Technology P.O. Box 553 FIN-33101 Tampere Finland
Abstract This paper describes a set of enhancements that were implemented to a 16-bit DSP core. The added features include several instructions, extended program/ data address spaces, vectored interrupts, and improved low -power operation. Embedded system development flow was reinforced with an optimizing C-compiler and a compact real-time operating system.
1. INTRODUCTION
Low-cost embedded-system products typically utilize a general-purpose microprocessor to accomplish a variety of system functions. Even though the performance in the current microprocessors is rapidly increasing, computationintensive tasks often need to be carried out with a digital signal processor (DSP) to enable real-time execution of the applications. Thus, many systems contain two processors. The duality complicates the software development because two different sets of software tools are needed. Moreover, there are inherent synchronization issues in a dual-processor system. Several embedded microprocessors have been coupled with a high-performance datapath for DSP operations [1],[2] but it seems that this approach has not found very wide acceptance because the resulting programming model is quite complicated. According to our observations, a typical embedded DSP application utilizes roughly 90% and 10% of clock cycles for DSP and control functions, respectively. For these computation-intensive DSP tasks, a DSP core designed for efficient program execution of mixed signal processing/control code becomes an attractive choice. The traditional software development flow for DSPs has heavily been based on assembly language programming. A major increase in productivity can be achieved by using a high-level language for program compilation for the control functions. Moreover, support for a real-time operating system alleviates the development process of complex embedded applications. This paper gives a comprehensive presentation of the further development of a commercial DSP core. First, the initial version of the DSP core is reviewed briefly. Detailed design objectives are declared and the selected enhancements are described. Then, a C-compiler and real-time operating system developed for the enhanced architecture are reviewed. Embedded system development flow is presented and, finally, the conclusions are drawn.
0-7803-5650-0/99/$10.00 271
idb
iab
xab
yab
Program Control Unit Instruction Fetch Instruction Decode
Datapath
XAddr ALU
YAddr ALU
Multiplier
ALU
Hardware Looping
Address Registers
Datapath Registers
xdb ydb
Figure 1. Block diagram of the DSP core architecture.
2. DSP CORE ARCHITECTURE

2.1. First-Generation DSP Core
The DSP core, designated VS-DSP, is a licensable processor core targeted for use in embedded DSP applications. The development work and integrated circuit design was carried out at VLSI Solution Oy, an independent IC design house located in Finland. The DSP core architecture is shown in Figure 1. An interested reader is referred to [3] for a more detailed description. The key features of the DSP core are the following: modied Harvard architecture with two data memories load/store memory architecture efcient three-stage execution pipeline branching operations with one delay slot extensively parameterized architecture, and extensible instruction-set. As typical to many DSPs, the processor performs several operations in parallel. In addition to implicit operations in the program control unit, a single 32-bit instruction word may perform an arithmetic-logic/multiplication operation, two data load/store operations, and two post-modifications to the data addresses. The DSP core is based on a flexible architecture that supports adjustment of a set of central parameters. When the data memory usage and the processor performance are the key criteria for optimization, the most important parameter is clearly the length of the data word [4]. This parameter inherently determines the physical size of the two data memories and it has a major effect on the critical signal paths in the various functional units. Other interesting parameters include the number of guard bits in the datapath and the number of registers available for data addressing and datapath operations.
272
The first-generation DSP core was successfully implemented in a 0.6 m CMOS technology. The DSP core operated with a maximum operating frequency of 45 MHz. A set of software development tools was designed for the DSP core. In addition to the standard assembly language-based software tools, the software development environment includes a program profiler and an instruction-set simulator to allow debugging and analysis of the application software. Moreover, a number of DSP algorithms were developed to evaluate the DSP core architecture: GSM full rate and half rate speech codecs, low-delay CELP G.728, sub-band ADPCM G.722, and MPEG-1 Layer II audio decoder.
2.2. Design Objectives

Typically, the majority of signal processing applications can be realized with arithmetic operations that employ 16-bit operands. Although the DSP core has an architecture that is parameterized in several ways, a DSP core configuration with a fixed 16-bit data word was chosen as a basis to facilitate the implementation of a set of enhancements. While the DSP portion dominates in the clock cycles spent on the applications, it constitutes a clear minority in the number of lines of code when compared to system control functions. As the amount of software in embedded systems is rapidly increasing, a major increase in the productivity can be achieved with a C-compiler. Other benefits from a C-compiler are improved code reliability, software maintainability and portability. Although carried out as further development, this enhancement provides another aspect to the processor/compiler co-development [5]. Since embedded applications are becoming increasingly complex, a real-time operating system (RTOS) alleviates the system development process by providing multitasking capabilities and various fundamental services for the applications. A pre-emptive multitasking scheme was considered the most appropriate choice for embedded applications. Moreover, the selected 16-bit data word width results in program and data address spaces of 64k words. The size of the address spaces may not be sufficient for some applications. This may be due to a large program size or the application may need to manipulate large amounts of data. An increasingly important issue in the emerging battery-powered applications is the system power consumption. Since the firstgeneration DSP core did not have any special low-power features, a number of lowpower enhancements were chosen for implementation. A low-power stand-by mode is a mandatory processor feature to allow significant savings in the power consumption. The identified design objectives for the DSP core can be summarized as follows: 1) architectural modifications to support a C-compiler and RTOS 2) extended program and data address spaces, and 3) enhanced low-power characteristics.
273
3. ENHANCED DSP CORE FEATURES

For a number of reasons, the DSP core architecture was already quite feasible target for C-code compilation. The processor is based on a straightforward load/ store architecture and it provides a sufficient number of registers for datapath operations and data memory addressing. Moreover, embedded applications can utilize a software stack which is one of the most important features enabling efficient design of a C-compiler [6].
3.1. Register-to-Register Transfers

The data transfers between the DSP core registers had to be performed via data memories in the earlier core. Because register-to-register data transfers are frequently needed in a C-compiled code, support for these data transfers was implemented. The new addressing mode allows data transfers between the registers of the three main functional units. An additional benefit from the enhancement is a reduction to the overall power consumption since the system buses are not employed in the register-to-register transfer operations.
3.2. Subroutine Call Instruction

In the earlier core version, a subroutine call had to be carried out with two separate instructions to store the subroutine return address into a dedicated register and to perform the branching to the subroutine target address. Typically, the return address is stored with the instruction in the delay slot of the actual branching instruction. A new instruction, Call, can automatically take care of both of these two operations. This frees the associated delay slot for other purposes in subroutine calls. Typically, the benefit resulting from the register-to-register transfer and subroutine call instructions is a 5% reduction to the program size.
3.3. Vectored Interrupts

Earlier, interrupt service was performed with a single interrupt request in combination with a register indicating the interrupt source. Thus an interrupt service was carried out with a read to this register followed by a jump to a certain interrupt service routine (ISR). By incorporating a separate interrupt controller as a peripheral, the new core supports a total of 32 vectored interrupts. Each of the interrupt sources has three interrupt priority levels and they can be disabled independently or globally. This enhancement results in a very fast interrupt response time with an interrupt latency of 7 clock cycles in between the interrupt detection and the execution of the first instruction of an ISR. As opposed to the earlier core, the interrupt latency is reduced by 8 clock cycles because there is no need to resolve the interrupt source separately. If an application has intense interrupt activity, the benefits from this enhancement are obvious.
3.4. Extended Program/Data Address Space

A straightforward way to extend the size of the memory address space is to realize a paged memory architecture. This architecture allows a major extension of the
274
address spaces without a radical change to the hardware resources or the data word width. In the paged memory architecture, both the program and data addresses are now divided into memory pages that hold 64k instruction or data words. Thus, a 32bit paged memory address is generated by concatenating two 16-bit values: a page address and page offset address. These two addresses correspond to the most and the least significant parts of a 32-bit address, respectively. The paged memory approach slightly changes the branching operation and data memory addressing. Due to the paged memory, branching target addresses are divided into near and far addresses, corresponding to references to the same memory page and to the other pages. Therefore, a call to a far subroutine needs three additional instructions when compared to a subroutine which resides on the same program memory page. Data memory addressing usually employs two 16-bit data address registers to access the two on-chip data memories in parallel. Now, one 32-bit data memory access can be performed by combining two 16-bit data address registers.
3.5. Low-Power Implementation

Besides low-cost and high performance, power consumption has become one of the key issues in processor design [7]. The DSP core employs a fully static CMOS design approach which allows flexible adjustment of the processor clock frequency from several tens of megahertz down to DC. A new instruction for power optimization is Halt which effectively freezes the processor core clock and, consequently, the execution pipeline. The processor wake-up procedure is handled by the interrupt controller which activates the processor clock again after an enabled interrupt becomes pending. In practice, the wake-up is immediate since the interrupt will be serviced as quickly as in the active operating mode of the processor core. This enhancement provides a significant decrease in the system power consumption since the low-power mode can be switched on as soon as the processor becomes idle. Low-power operation was also addressed on the lower levels in the processor design. A full-custom, transistor-level integrated circuit implementation inherently provides lower power consumption when it is compared with an implementation that is synthesized from a HDL code. This is due to the smaller switched capacitance resulting from smaller dimensions of the hand-crafted functional cells, accurate control over the clock distribution network, and carefully optimized transistor sizing. Moreover, a number of traditional low-power circuit design techniques were employed, such as input latching and clock gating. The input latching effectively eliminates unnecessary signal transitions in the functional units, thus effectively reducing the transient switching in the core. Gated clocks were extensively utilized to further avoid undesired switching in clocked processor elements. As a side effect of the register-to-register transfer instructions, the power consumption is also reduced. Because the transfers are implemented with local buses inside the processor core itself, the transfers do not employ the system buses that, due to interconnections to several off-core functional units, possess relatively large capacitances.
275
DATAPATH
CLK INSTRUCTION DECODE
PAGE LOGIC
INSTRUCTION FETCH
DATA ADDRESS GENERATOR
Figure 2. Chip layout of the enhanced DSP core. Moreover, the semiconductor manufacturing process was updated to a 0.35 m triple-metal CMOS technology. In addition to higher circuit performance, the advanced technology enables the use of lower supply voltages in the 1.1 - 3.3 V range. Clearly, the lower supply voltages have the most radical impact on the power consumption of the DSP core, program/data memories and other peripherals. A fullcustom implementation of the DSP core, shown in Figure 2, contains 64000 2 transistors and it occupies a die area of 2.2 mm . With a 3.3 V supply voltage, the DSP core is expected to operate at a 100 MHz clock frequency. The new features did not add any speed penalty to the processor. The new instructions for register-to-register transfers and subroutine calls required a modification in the instruction decoding and the paged memory architecture added a block of logic. Interestingly, the first-generation core layout had a relatively large unused area in the instruction decoding section. For this reason, it was possible to place most of the new features without increasing the core area. However, the interrupt controller needs to be included as an off-core peripheral.
4. OPTIMIZING C-COMPILER
VS-DSP C-compiler (VCC) is an optimizing ANSI-C compiler targeted especially for the VS-DSP architecture. The flow of operation in the DSP code generation is shown in Figure 3. The C-compilation can be divided into three logical steps: general C-code optimization, assembly code generation, and assembly code optimization. In addition to syntax analysis, the general optimizer performs the common C-compiler operations, such as constant subexpression evaluation, logical expression
276
Source C-Code
General Optimizer
Code Generator
C-Compiler
Code Optimizer
Code Assembly RTOS Kernel C-Libraries User Libraries Linking
Assembler
Linker
Binary Executable
Figure 3. Code generation with the C-Compiler. optimization, and jump condition reversal. The code generator allocates different variables to registers and data memories and it generates assembly code for all the integral arithmetic and control structures. Depending on the structure of the program loops, the looping can be carried out as either hardware looping or in software. The generated assembly code is then forwarded to the code optimizer. The code optimizer sequentially examines the assembly code, trying to make it more efficient by parallelizing various operations, filling delay slots in the branching instructions and merging stack management instructions. From the raw instruction word count, the code optimizer can typically eliminate 2030% of the instruction words. In a C source code, various methods can be employed to guide the C-compiler to achieve more optimal results. For example, the execution speed of the critical program sections can be increased considerably by forcing certain variables to specific registers in the datapath and data address generator. However, at least the DSP algorithm kernels should be hand-coded in assembly language because those program sections contribute the most to the execution time of the applications.
5. REAL-TIME OPERATING SYSTEM

A real-time operating system (VS-RTOS) is a compact system kernel providing pre-emptive multitasking and a wide range of fundamental services for embedded applications. The key features of the RTOS are summarized in Table 1. In pre-emptive multitasking, the RTOS determines when to change the running task and which one is the next task to be executed [8],[9]. However, the RTOS limits
277
the execution time of the tasks to a user-defined quantum of time when time-sliced scheduling is used. A system timer has to be included as an additional peripheral for time-slicing. Typically, the system timer has a resolution of 1 ms and one time-slice corresponds to 20 system timer intervals, i.e., 20 ms. For each of the tasks, an arbitrary number of time-slices can be allocated. Additionally, the RTOS supports software timers. The correct operation of the RTOS has been demonstrated with several hardware prototypes. It is imperative to have a fully functional prototype since the correct system behavior with multiple interrupts is practically impossible to verify by means of simulations. Table 1: RTOS Kernel Features.
Multitasking Pre-emptive Time-sliced* Signals Messages Semaphores Dynamic Allocated in Fixed-sized Blocks 87 clock cycles (0.87 s @ 100 MHz) 1355 words 39 words
Intertask Communication
Memory Management
Full Context-Switch
ROM Memory Requirement RAM Memory Requirement

*
optional, requires a system timer for scheduling
6. EMBEDDED SYSTEM DEVELOPMENT

Software development flow for the DSP core is quite straightforward. The application code is programmed in C and assembly languages. The programs can effectively be run in an instruction-set simulator (ISS). The ISS supports the parameterized architecture of the DSP and it also allows system simulation with behavioral models of the off-core peripherals. After a cycle-accurate simulation with the ISS, a profiler can be employed to analyze the dynamic run-time behavior of the application code. The information provided by the profiler enables identification of the program sections that would gain most from further optimization. Although the ISS is capable of executing the program code at over 100000 instructions per second, the highest execution speed can be achieved with a hardware emulator. An emulator program, which runs on a host PC, utilizes a DSP evaluation board to enable application prototyping with real-time input and output. The DSP evaluation board is equipped with a DSP prototype chip, external memories, miscellaneous digital/analog interfaces and an FPGA chip. A detailed summary of the DSP evaluation board features is listed in Table 2. In order to access off-chip memory devices, the DSP core includes an external bus
278
Table 2: Main features of the DSP evaluation board.

DSP Prototype Chip Processor Core 16-bit VS-DSP Core Four 40-bit Accumulators Eight Data Address Registers Hardware Looping Data Memory: 2 x 16k RAM Program Memory: 4k RAM, 4k ROM Synchronous Serial Port Two RS232 Serial Ports 8-bit Parallel Port Six 32-bit Timers Keyboard Interface Interrupt Controller External Bus Interface
Memories
Peripherals
Other Board Components
1M x 16-bit Flash Memory 64k x 16-bit SRAM Altera Flex 10K40 FPGA 2 x 16-bit DAC 2 x 16-bit ADC 25-button Keyboard
interface (EBI). The EBI has a 24-bit address space and it can insert processor waitstates to realize slow external memory accesses. The FPGA chip on the board has a typical capacity of 40k logic gates. A programmable logic device on the DSP evaluation board enables flexible system development and prototyping simultaneously with some supplementary functions implemented in an applicationspecific hardware block. The DSP evaluation board has already proven its applicability in the development of several system prototypes. For example, the board has recently been utilized to demonstrate the operation of an MPEG audio decoder [10]. The audio decoder performs decoding of a 128kbit/s Layer III stream (44.1 kHz, stereo) at a 18 MHz clock frequency.
7. CONCLUSIONS
This paper presented a number of issues involved in the further development of a commercial DSP core. The selected enhancements addressed several aspects in the processor architecture. A number of new instructions were added to facilitate the execution of those operations that are frequently required in C-compiler generated program code. Improved support for fast interrupt services were realized with an interrupt controller peripheral. This feature is mainly targeted to facilitate the development of the RTOS. Low-power characteristics of the processor core were enhanced in several ways. One of the most important characteristics is a low-power stand-by mode. The implementation of the new features did not add any speed penalty to the DSP core. The interrupt controller and the optional system timer were included as offcore peripherals. All the other enhancement were merged to the existing circuit
279
layout of the first-generation DSP core because unused circuit area was available for these purposes.
References
[1] Hitachi Micro Systems, Inc., SH-DSP Microprocessor Overview, Product Databook, Revision 0.1, Nov. 1996. [2] D. Walsh, "Piccolo - The ARM Architecture for Signal Processing: an Innovative New Architecture for Unified DSP and Microcontroller Processing," in Proc. Int. Conf. on Signal Processing Applications and Technology, Boston, MA, U.S.A., Oct. 1996, pp. 658-663. [3] J. Nurmi and J. Takala, "A New Generation of Parameterized and Extensible DSP Cores," in Proc. IEEE Workshop on Signal Processing Systems, Leicester, United Kingdom, Nov. 1997, pp. 320-329. [4] M. Kuulusa, J. Nurmi, J. Takala, P. Ojala, and H. Herranen, "A Flexible DSP Core for Embedded Systems," IEEE Design & Test of Computers, vol. 14, no. 4, pp. 60-68, Oct./Dec. 1997. [5] H. Meyr, "On Core and More: A Design Perspective for System-on-Chip," in Proc. IEEE Workshop on Signal Processing Systems, Leicester, United Kingdom, Nov. 1997, pp. 60-63. [6] B.-S. Ovadia and Y. Beery, "Statistical Analysis as a Quantitative Basis for DSP Architecture Design," in VLSI Signal Processing, VII, J. Rabaey, P. M. Chau, J. Eldon, Eds., pp. 93-102. IEEE Press, New York, NY, U.S.A., 1994. [7] J. M. Rabaey, Digital Integrated Circuits: A Design Perspective, Prentice-Hall, Upper Saddle River, NJ, U.S.A., 1996. [8] J. A. Stankovic, "Misconceptions About Real-Time Computing, Computer, vol. 21, no. 10, pp. 10-19, Oct. 1988. [9] W. Zhao, K. Ramamritham, and J. Stankovic, "Scheduling Tasks With Resource Requirements in Hard Real-Time Systems, IEEE Trans. on Software Engineering, vol. 13, no. 5, pp. 564-576, May 1987. [10] ISO/IEC 11172-3, Information technology - Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s - Part 3: Audio, Standard, International Organization for Standardization, Geneva, Switzerland, Mar. 1993.
280
PUBLICATION 6
M.Kuulusa,J.Takala,andJ.Saarinen,Run -timecongurablehardwaremodelinadataow simulation, in Proc. IEEE Asia-Pacic Conference on Circuits and Systems, Chiangmai, Thailand, Nov. 2427 1998, pp. 763766. Copyright c 1998IEEE.Reprinted, with permission, from theproceedings ofAPCCAS98.
Run-Time Congurable Hardware Model in a Dataow Simulation

Mika Kuulusa, Student Member, IEEE Jarmo Takala, Student Member, IEEE, and Jukka Saarinen Signal Processing Laboratory, Tampere University of Technology P.O. Box 553, FIN-33101 Tampere, Finland. Phone +358 3 3652111, Fax +358 3 3653095, e-mail mika.kuulusa@cs.tut.
Abstract This paper describes modeling of a mobile terminal system containing a run-time congurable transform unit specied in a hardware description language. This transform unit can perform two of the most commonly utilized trigonometric transforms: the fast Fourier transform (FFT) and inverse discrete cosine transform (IDCT). A wireless ATM network m odel was implemented to demonstrate how these transforms are scheduled in the terminal. Due to the dynamic reconguration, it was necessary to create a number of asynchronous models to successfully embed a synchronous hardware model into a dataow simulation. Scheduling of the transforms in the terminal system is presented and the dataow block diagram incorporating the hardware model is studied in detail.
AP TX
MT RX Level Detector Frame Timing & Freq. Synch. Frequency AGC A/D Compensation FFT MUX
Gen.
Protocol Processing
Symbol Decoding
Phase Compensation
Phase Estimation
Data Stream Parser
Entropy Decoding
Dequantizer
IDCT
Image Framer
Figure 1. Conceptual block diagram of the wireless system.
can consume and produce a variable number of data elements. This behavior results in a dynamic schedule that is exclusively determined during the system 1. Introduction simulation. Dataow computing has rapidly gained widespread This paper presents the embedding of a run-time acceptance in specifying complex signal processing congurable hardware model into a dataow simulation. systems, especially in forms of synchronous dataow or First a wireless network architecture is described and data-driven simulators. Graphical simulation scheduling of the transforms in a mobile terminal is environments together with extensive model libraries presented. Various model design aspects are reviewed enable system engineers to rapidly evaluate various and the use of a synchronous hardware model is studied options leading to a high-quality system implementation. in detail. Finally, there are the conclusions. Increased system simulation speed and better possibilities for design space exploration are the key benets of this 2. System Description approach. Communication systems are generally very In a dataow simulation environment a system is convenient and natural to be modeled as dataow because described as a block diagram which consists of a number they process streams of information. A wireless ATM of blocks (models) representing a certain functionality network [5] was chosen as a case study to experiment and signaling nets between these blocks. Blocks implementation of asynchronous models in a dataow exchange data through input and output ports. Although environment. A block diagram of the wireless system is actual implementations may vary, input ports can be illustrated in Figure1. The system has two network comprehended as FIFO queues. A synchronous dataow entities: an access point (AP) and a mobile terminal (SDF) system is based on synchronous blocks which (MT). The AP transmits compressed image data [6] to consume, process and produce a xed number of data m ultiple MTs by using a wireless ATM MAC elements (tokens) during each activation [1,2]. The protocol[7]. Air interface employs orthogonal frequency execution order is completely predictable at simulation division multiplexing (OFDM) with 8-PSK modulation compile time thus a static scheduling of block activations on each of the 16 subcarriers arranged around a center can be generated. However, a dynamic dataow (DDF) frequency in the 5 GHz range [8]. system introducing asynchronous blocks may be better The terminal system is implemented with a target suited for some applications [3,4]. Asynchronous blocks architecture that integrates a variety of hardware components: a digital signal processor (DSP), a
0-7803-5146-0/98/$10.00 1998 IEEE.
FP2-8.1
763
Variable Length Time Frame Downlink Period Uplink Period
Received Signal
ScheduledTransform FFT
recover the differentially encoded OFDM symbols by using a complex valued 16-point FFT. Therefore, the signal reception is enabled only by executing an FFT operation to decode OFDM symbols. The principle behind the transform scheduling is to execute an FFT operation always when it is needed and perform an IDCT operation when the transform unit is idle. The protocol processing operates in the following manner. First, the receiver detects the beginning of a new time frame by identifying the training sequence. Then it starts scanning the user data headers. If the destination address in a header does not match, the next time slots containing the data body are skipped. When a burst with a correct destination address is found, the data burst is decoded and the remaining time slots can be scheduled to perform IDCT operations. Additionally, IDCT operations are permitted in the time slots that are skipped during the header scanning. Because IDCT operations are performed only in the vacant time slots, the system must be capable of buffering frequency coefcients extracted from the image data stream. In the case no coefcients are available, the transform unit stays idle until it is again required for symbol decoding. 2.2. Regular Trigonometric Transform Unit Regular trigonometric transform (RTT) unit is a congurable hardware accelerator which can be utilized to perform either a complex valued 16-point FFT or a 8x8-point IDCT. The structure of the RTT unit is based on a constant geometry architecture that consists of congurable processor elements and a dynamic permutation network [9,10]. Before a certain transform can be executed, it may be necessary to switch the hardware conguration from one transform to another. This change of hardware operation takes one clock cycle. The RTT unit operates in pipelined fashion: the operations are executed by rst forwarding the input data, eight values in parallel, to the hardware, and then clocking the unit a certain number of clock cycles to iteratively perform the required transform operation. However, it should be noted that the input data values must be arranged into a transform-specic order before they can be passed to the hardware. In addition, output values resulting from an operation also need some rearranging. The RTT unit executes a complex valued 16-point FFT in 18 clock cycles. The rst four cycles are necessary to pass 32 input data values, split into 16 real and imaginary components. In the remaining 14 cycles the hardware performs the FFT operation. The image decompression is realized by executing a two dimensional (2-D) IDCT to a block of 8x8 frequency coefcients which are provided by an image data stream parser. It is of common practice to perform this 2-D
IDCT
Execute FFT
IDCT Permitted
Figure 2. Scheduling of the transform operations in a mobile terminal.
microcontroller (MCU), a hardware accelerator, and a radio frequency front-end. Tasks with hard real-time requirements, i.e., baseband signal processing and decoding of the image data stream, are performed by the DSP. The MCU is used for system tasks with less stringent real-time requirements, such as protocol processing and user interfaces. There are two fundamental transform operations required in an MT: fast Fourier transform (FFT) for decoding OFDM symbols, and inverse discrete cosine transform (IDCT) needed in image decompression. In our system, both of these transforms are effectively performed with an applicationspecic hardware accelerator. During the system operation this transform unit is congured in real-time to execute transforms in a time-multiplexed fashion. 2.1. System Scheduling Medium access scheme in the modeled wireless system utilizes time division multiple access [7]. The communication is based on a variable length time frame which contains a frame header and periods for downlink and uplink transmission. The structure of the time frame and the transform scheduling is shown in Figure2. A time frame contains an integer number of time slots. Each time slot contains 18 OFDM symbols that can be special symbols used in a training sequence or 54 octets of information. The frame header consumes a single time slot and contains a special training sequence. During the downlink period the AP transmits information in the form of user data bursts. A user data burst is composed of a header and data body. The header contains necessary information about the burst and the structure of the current time frame. The data body is used to transport compressed image data. In our simulation model, we assume that no transmission activity exists in the uplink period. A protocol processor controls scheduling of the transform unit in the system. After the received signal is mixed, ltered, and down converted it is possible to
764
FP2-8.2
IDCT by using a row-column decomposition [11]. In this method, the transform is executed with two 1-D IDCT operations in the following fashion: an 1-D IDCT is applied to each row of the 8x8 matrix, the resulting matrix is then transposed and the transform is performed once again. Since a single row is transformed in 5 clock cycles, the entire 2-D IDCT consumes a total of 80 cycles. 3. Hardware Model in a Dataow Simulation In prior to designing the dataow block diagram, the air interface was carefully studied. Mathematical models were created for symbol mapping, differential encoding, and signal modulation. Based on these experiments, suitable parameters for the receiver sample rate, sample buffer size, and lter coefcients were discovered. The dataow software utilized in our case was Cossap from Synopsys Inc. Cossap allows various types of models to be incorporated into a heterogeneous dataow simulation. Typically, models are described as Clanguage modules. Most straightforward way of programming these models is to implement them as synchronous models. However, it is possible to create asynchronous models when input and output functions are programmed directly without using the standard interfaces. Typically, application-specic hardware blocks are specied with a hardware description language (HDL), such as Verilog and VHDL. The event-driven simulation of these hardware units differs signicantly from the data-driven approach. The event-driven hardware simulation is based on the concept of global time where all blocks are activated when the global time is updated. In the data-driven approach, blocks are activated as soon as all data elements required in an operation are available in their input ports. A special software tool can be utilized to generate a synchronous dataow model from an HDL description. In our case, special attention must be addressed to use of the generated hardware model. This is due to the fact that the number of activations (clock cycles) required in an operation depends on the type of the transform. Moreover, the block diagram contains a feedback loop that is not used in the FFT. The feedback loop would normally cause a simulation deadlock after the rst IDCT operation. However, it is possible to avoid this deadlock by introducing some redundancy, i.e., dummy data elements, in the input data. This arrangement is described more precisely later in this section. The synchronous hardware model was placed inside a hierarchical model to conceal the underlying complexity. The hierarchical transform model is operated with asynchronous input and output controllers, as shown in Figure3.
RealIn ImagIn CoefIn
InData FTRealOut MuxCtrl input_ctrl DMuxCtrl RCtrl
rtt_h
OutData
output_ctrl FTImagOut
SampleOut
CtrlIn
OutputCtrl
Figure 3. Hierarchical transform model with supporting input and output controllers.
3.1. Input and Output Controllers An asynchronous input controller is responsible for executing transforms according to a scheduling control provided by the protocol processor. Transform operations are carried out by forwarding input data values and all necessary control signals to the hierarchical transform model. The input controller produces a variable number of output data elements depending on which transform is to be executed. The input controller has three options for transform scheduling: execute 18 consecutive FFTs, execute one 2-D IDCT, or no-operation. The input controller has four input ports: scheduling control, I and Q components of the baseband signal, and frequency coefcients. The scheduling control is used to determine whether the next transform operation is reserved for an FFT or if it is possible to execute a 2-D IDCT. An FFT operation is performed simply by multiplexing 16 data elements from both I and Q input ports into an output port. If there are any frequency coefcients available when an FFT is scheduled, they stay buffered in the input port until an IDCT operation is permitted. In order to decode all OFDM symbols in a time slot, a total of 18 FFT operations are executed. In case a 2-D IDCT or no-operation is scheduled, all baseband signal samples in a time slot must be discarded to enable correct synchronization with the signal reception. The input controller produces no output when a no-operation scheduled. This occurs only when a 2-D IDCT is possible but there are no frequency coefcients available. In a 2-D IDCT operation, a block of 8x8 frequency coefcients is forwarded to the hierarchical transform model. Transformed data values are nally processed by an output controller. Because the values can be resulting from either transform, a control signal from the input controller species which transform has been executed. This enables transformed data elements to be directed to appropriate output ports for further processing.
FP2-8.3
765
rtt_h
ired
trans
InData
ired
mux
rred
in_ro
rtt_vhdl
out_ro
dmux
OutData
synchronous
asynchronous
Figure 4. Block diagram of the hierarchical transform model (rtt_h).
4. Conclusions A mobile terminal system incorporating a run-time congurable hardware accelerator was successfully simulated in a dataow environment. The terminal uses this application-specic hardware to perform two transforms required in symbol decoding and image decompression. Dynamic transform scheduling in the dataow environment was enabled by implementing a number of asynchronous models to control the synchronous hardware model. The 2-D IDCT operation required a feedback loop in the block diagram which causes a potential simulation deadlock. However, deadlock-free system simulation was accomplished by using a simple interleaving scheme.
3.2. Hierarchical Transform Model The hierarchical transform model (rtt_h) References incorporates both synchronous and asynchronous models [1] E.A. Lee and D.G. Messerschmitt, Static Scheduling of as illustrated in Figure4. In order to enable deadlock-free Synchronous Data Flow Programs for Digital Signal simulation, the hardware model (rtt_vhdl) must be Processing, IEEE Transactions on Computers, Vol. C-36, supported by a number of complementary models: 8x8 No. 1, pp. 24-35, Jan. 1987. matrix transposing (trans), redundancy insertion and [2] E.A. Lee and D.G. Messerschmitt, Synchronous Data removal (ired, rred), input and output reordering (in_ro, Sep. 1987. out_ro) and dataow multiplexing and demultiplexing [3] J. Buck et al., The Token Flow Model, Proc. of the Data (mux, dmux). Flow Workshop, Hamilton Island, Australia, May 1992. A potential deadlock is caused by the input [4] S. Ha and E.A. Lee, Compile -Time Scheduling of multiplexer which will not activate unless there are at Dynamic Constructs in Dataow Program Graphs, IEEE least one data element available in all input ports. Transactions on Computers, Vol. 46, No. 7, pp. 768-778, Therefore, dummy data elements must be interleaved July 1997. with the valid data to make sure that the multiplexer will [5] J. Mikkonen and J. Kruys, The Magic WAND: a Wireless operate in a proper manner. This redundancy is removed ATM Access System, Proc. of ACTS Mobile Summit, appropriately before the data elements are forwarded to Granada, Spain, pp. 535-542, Nov. 1996. the input reordering model. According to the scheduled [6] K. Wallace, The JPEG Still Picture Compression Standtransform, data elements are reordered and passed to the hardware model together with two control signals. The most interesting part in the hierarchical model is the hardware model containing the RTT unit. A synchronous hardware model was generated a in such a manner that it executes exactly one clock cycle on each activation. Therefore, in order to execute one clock cycle in the hardware, one data element must be written to each of the input ports. Thus a complete transform is accomplished when a sequence of data elements is passed to the hardware model. Because the hardware model is synchronous, it produces output on each activation even though several activations are required before valid data values are produced. For this reason, the hardware issues a control signal to indicate when the output contains valid data values. The output reordering model uses this control signal to identify transformed data values in the output stream. The values are rearranged and stored in an internal buffer until the transform operation has been nished. Finally, a demultiplexer directs the resulting data values to the output of the hierarchical model or to the feedback loop.
ard, Communications of the ACM, pp. 30 -45, April 1991. [7] G. Marmigre et al., MASCARA, a MAC Protocol for Wireless ATM, Proc. of ACTS Mobile Summit, Granada, Spain, pp. 647-651, Nov. 1996. [8] J.P. Aldis, M.P. Althoff, and R. Van Nee, Physical Layer Architecture and Performance in the WAND User Trial System, Proc. of ACTS Mobile Summit, Granada, Spain, pp. 196-203, Nov. 1996. [9] J. Astola and D. Akopian, Architecture Oriented Regular Algorithms for Discrete Sine and Cosine Transforms, Proc. IS&T/SPIE Symp. Electronic Imaging, Science and Technology, pp. 9-20, 1996. [10] D. Akopian, Systematic Approaches to Parallel Architectures for DSP Algorithms, Dr. Tech. dissertation, Acta Polytechnica Scandinavica, El89, The Finnish Academy of Technology, Espoo, Finland, 1997. [11] G.S. Taylor and G.M. Blair, Design for the Discrete Cosine Transform in VLSI, IEE Proceedings, Vol. 145, No. 2, pp. 127-133, March 1998. Flow, IEEE Proceedings, Vol. 75, No. 9, pp. 1235-1245,
766
FP2-8.4
PUBLICATION 7
M. Kuulusa and J. Nurmi, Baseband implementation aspects for W-CDMA mobile terminals, in Proc. Baiona Workshop on Emerging Technologies in Telecommunications, Baiona, Spain, Sep. 68 1999, pp. 292296. Copyright c 1999ServiciodePublicacionsdaUniversidadedeVigo,Spain. Reprinted,with permission, from the proceedings of BWETT99.
BASEBAND IMPLEMENTATION ASPECTS FOR W-CDMA MOBILE TERMINALS

Mika Kuulusa and Jari Nurmi Signal Processing Laboratory Tampere University of Technology, P.O. Box 553 (Hermiankatu 12), FIN-33101 Tampere, Finland Fax +358 3 3653095, Tel. +358 3 3652111, E-mail: mika.kuulusa@cs.tut.fi
ABSTRACT This paper addresses several implementation aspects in the baseband section of a W-CDMA mobile terminal that is based on the UMTS terrestrial radio access (UTRA) radio transmission technology proposal. The objective was to construct suitable transceiver architectures for the next generation multi -mode terminals which support both TDD and FDD modes of operation. 1. INTRODUCTION The third generation mobile communications will be based on code-division multiple access (CDMA). The future CDMA systems will employ 2 GHz carrier frequencies in combination with a wide transmission bandwidth to provide variable data rates of up to 2 Mbit/s. Both packet and circuit-switched connections will be supported. In addition to conventional speech service, high-speed data rates allow realization of a diverse set of multimedia and data services for the next generation mobile terminals. UTRA specification, also often more simply referred to as W-CDMA, is the European candidate proposal for the global standard of the W-CDMA air interface [1]. UTRA employs direct-sequence spreadspectrum technology with a chip rate of 4.096 Mchip/s to spread quadrature phase-shift keyed (QPSK) data symbols to a 5 MHz transmission bandwidth. Spectrum-spreading is performed with a combination of complex and dual-channel spreading operations. Downlink (base station to mobile) and uplink (mobile to base station) transmissions are based on a 10 ms frame that contains a total of 16 time sl ots. Thus a time slot corresponds to 0.625 ms or 2560 chips. Variable data rates can be realized either by allocating several physical code channels for one user or by adjusting data rate of the physical code channel, i.e., the spreading factor. These are called multi-code and variable spreading factor methods, respectively. First W-CDMA receiver implementations are most likely to be based on conventional Rake receivers. In the past, Rake receivers have been utilized in systems for, e.g., wireless LANs [2,3,4,5], cellular [6], and space communications [7]. CDMA systems are interference-limited because several users use the same frequency band for transmissions. Therefore, con ve ntio nal Ra ke rec ei ver s will be follo wed b y advanced receivers that implement sophisticated interference cancellation techniques, such as successive interference cancellation (SIC) or linear minimum mean-squared error (LMMSE) methods, to remove at least the dominant interferes causing the most of the multiple-access interference on the radio channel. In this paper, downlink receiver and uplink transmitter architectures realizing the baseband si g nal p roce ssin g func t ions fo r W -C DM A mobile terminals are described. Although the implementation aspects presented in this paper are focused on the U T R A p r o po s al , t he a r c h i t e c t u r es fo r t he o th e r proposals will very similar those described in the following sections. 2. TRANSCEIVER ARCHITECTURE FOR W-CDMA According to downlink (DL) and uplink (UL) frequency usage, UTRA specifies two modes of operation: time-division duplex (TDD) and frequencydivision duplex (FDD). The main differences between these two modes are the following: DL/UL frequency allocation: TDD single band, FDD paired band DL/UL transmissions: TDD time -multiplexed, FDD continuous Placement of the DL pilot symbols: TDD midambles, FDD preambles Symbol spreading factors: TDD 1 -16, FDD 4-256 Spreading code generators: TDD OVSF, FDD OVSF/Gold/VL-Kasami Symbol rates: TDD 256k - 4M symbol/s, FDD 16k1M symbol/s
292
Wideband Power
Multipath Delay Estimation
Narrowband Power Complex Channel Estimation FED Symbol Symbol Deskew Buffer Scaling Multipath Combiner SIR Estimation
AGC AFC Channel Decoding
Gain Control
RX RF
ADC
Pulse Matched Filter Rake Finger Bank Mux
Frequency Control
Data Bits
SIR
Code Generators
Figure 1: Block diagram of a W-CDMA receiver.
In addition to orthogonal variable rate spreading factor (OVSF) codes, the TDD mode also uses a cell-specific code of length 16 in the spreading. The symbol rates for the TDD mode are instantaneous values since the actual symbol rate in depends on the downlink/uplink slot allocation.
2.1. W-CDMA RECEIVER

Block diagram of a W -CDMA receiver is depicted in Figure 1. The radio frequency (RF) frontend is realized as a traditional I/Q downconversion to the baseband. A stream of complex baseband samples with 4 -8 bits of precision is produced by two analog-to-digital converters. To obtain sufficient time-domain resolution, the baseband signal is oversampled at 4-8 times the chip rate, i.e., at 1632MHz sample rates. Downlink and uplink transmissions are bandlimited by employing root-raised cosine (RRC) pulse shaping filtering. In order to maximize the received signal energy, the I/Q baseband samples are first filtered with a receiver counterpart of the RRC filter to collect the full energy of the transmitted pulses. In addition, the receiver filter can be realized so that it compensates non-idealities of the analog RF processing. Typically, the receiver filter is implemented as a FIR filter with approximately 9-15 taps. Separate FIR filters are required for both I and Q baseband sample streams. A Rake finger bank typically contains 2-4 Rake fingers that are used to receive several multipath components of the transmitted downlink signal. Conceptually, a Rake finger is composed of a complex despreader and an integrate-and-dump filter. Wideband signal samples are despread with a synchronous complex-valued replica of the spreading code and the despread results are integrated over a symbol period. Thus a Rake finger effectively
reconstructs the narrowband data symbol stream from one multipath. Multipath delay estimation unit is responsible for allocating a certain multipath tap to each of the Rake fingers to enable coherent reception of the spread-spectrum signal. The multipath delay estimator unit also serves as a searcher which periodically looks for the signal strengths of the nearby base stations. Code generators needed in the downlink receiver consist of OVSF and Gold code generators. By using a shift register to store the output of the code generators and several shifters, synchronous codes can be generated for each of the despreaders. The code generators may also use some methods to restrict the phase transitions of the successive complex spreading codes. Complex channel estimation is necessary to adjust phases of the received QPSK symbols. In UTRA, complex channel estimates are determined with the aid of time-multiplexed pilot symbols or chip sequences, i.e., preambles and midambles. Multipath combiner coherently sums energies of multipath components by employing maximal ratio combining (MRC). In MRC, phase-corrected symbols from the Rake fingers are selectively combined into one symbol maximizing the received signal SNR. Soft -decision symbols are further processed by a channel decoder which employs deinterleaving, rate dematching and fo r wa r d e r ror cor re c tion de codi ng ope ra tio ns to determine the transmitted binary data. Moreover, various measurements has to be performed. The received signal power is estimated by calculating both wideband and narrowband signal energies from a stream of samples and symbols. The symbol-to-interference (SIR) ratio has to be computed in order to enable closed-loop power control in the downlink so that the transmission power stays at suitable levels with respect to the desired quality of service.
293
Special Chip Sequences
Mux
Channel Bits
Symbol Mapping
Spreading
Scrambling
Pulse Shaping
Quadrature Modulation
DAC
TX RF
Code Generators
Figure 2: Block diagram of a W-CDMA transmitter.
By using successive data symbols, a frequency error detector (FED) produces an estimate of the frequency error [8,9]. The output of the FED is passed to an automatic frequency control (AFC) algor ithm which adjusts the frequency of local oscillator to that of the base station transmitter. Automatic gain control (AGC) is employed to rapidly adjust the input voltage to the ADCs so that the signal levels stay in appropriate range for proper reception. It should also be noted that the transmitter uses an estimate of received signal strength to adjust the transmitter power in the TDD mode.
number representation, the sign is effectively stored in a single bit. Thus a multiplication by 1 can be realized with a single exclusive -or (XOR) logic operation resulting in a minimal hardware overhead. True complex multiplications are employed only in the multipath combiner which rotates and weights each of the multipath symbol with the corresponding channel estimate.
3.1. FULL CODE-MATCHED FILTER

Due to rapidly changing mobile radio channels, fast code acquisition is crucial for the Rake receiver performance. The most suitable acquisition device for multipath estimation is a full code-matched filter [11]. The structure of the filter is depicted in Figure 3. Conceptually, the full code -matched filter is a correlation device which effectively performs one large parallel correlation of a given length with the I/Q s a m pl e s s to r e d i n t wo d el a y l i ne s. A nu m be r o f complex-valued matching sequences are stored in a register bank so that different matching sequences can be rapidly selected. Although the fundamental structure is quite simple, the code -matched filter has to execute a massive amount of operations. For example, a filter realizing matching to a complex code sequence of 256 chips requires a total of 1024 multiply -accumulate operations. At the chip rate, this corresponds to 4G multiply-accumulate operations per second. However, since the multiplications are performed with parallel XOR operations, the correlation reduces into a sum of 1024 products. Moreover, further optimizations can be
2.2. W-CDMA TRANSMITTER

A W-CDMA transmitter for UTRA is considerably more straightforward when compared to the receiver side. Basically, the transmitter can be constructed with a simple dataflow structure tha t comprises of QPSK symbol mapping, complex/dual channel spreading, transmitter pulse shaping and q uad ra tu re mod ula tion o per atio ns, as sho wn in Figure2. Due to paper length limitations, the transmitter will not be studied in detail. However, an interesting treatment on the pulse shaping filtering can be found in [10]. 3. HARDWARE IMPLEMENTATION ASPECTS From the hardware implementation point of view, the number representation of the I/Q samples throughout the receiver requires special attention. Traditionally, two s complement representation has been employed for its simplicity in arithmetic operations. However, in a Rake receiver most of the multiplications are with performed with values of 1. If the samples are in two s complement representation, the multiplication by -1 requires inversion of all bits of the sample and an addition by one. When a large number of these multiplications has to be performed in a Rake receiver, an alternative number representation may be more appropriate when the power consumption is considered. By employing sign -and-magnitude
Is Qs
D D D
...
D D D Icorr |.|
...
Complex Matching Sequences
Multipath Power Averaging
Multipath Delay Profile
|.| Qcorr
Figure 3: Structure of the full code-matched filter
294
Is
I&D
Isym
The final carry-propagate additions of the positive and negative branches are carried out in the dump section which operates at a symbol rate.
IcodeQcode
3.3. SYMBOL DESKEW BUFFER +

I&D Qsym
Qs
Figure 4: Rake finger with a complex despreader and integrate-and-dump filter
Because symbol dumps from the Rake fingers are asynchronous with respect to each other, a symbol buffer is necessary to store time-skewed symbols from different multipaths [6]. In practice, the size of the buffers is determined by the maximum allowable delay spread and the supported spreading factors. Assuming a minimum spreading factor of four and a delay spread of 16 s, the deskew buffer must have capacity for at least 16 symbols. Moreover, a larger deskew buffer can be used to store the first data part of the TDD downlink burst because of the pilot sequence is located in the middle of the burst. After the first data part has been received, the multiplexer routes the pilot to the channel estimator. After the channel estimates have been calculated, the multipath combiner can proceed.
made if the code sequence does not contain pure complex values. For example, midamble chips are alternating real and imaginary in the TDD mode, thus reducing multiply-accumulate operations by half. The output of the code-matched filter is further processed by a power estimator that employs an observation window of finite length to create a power estimates for each window position. In order to obtain reliable power estimates, the results are averaged. The multipath estimation and averaging could cover the pilot sequences of 32 time slots. This would allow an update to the Rake finger allocation in 20 ms intervals.
3.4. CHANNEL ESTIMATION

In UTRA, complex channel estimates is are determined with the known pilot symbols or certain c hi p se quen ces . T he chan nel es tim ates a re then interpolated to provide valid estimates for the duration of a time slot. It may also be a feasible solution that the channel estimator switches to a decision-directed mode after initial estimation from the known pilot symbols. Moreover, the movement of the mobile receiver causes Doppler shifts to the multipath components. Interestingly, channel estimates can used to compensate also for these frequency shifts that are approximately 220 Hz at maximum for a mobile speed of 120km/h. Optimal channel estimator is a FIR filter essentially performing a moving average on a number of received symbols [13]. However, also exponential tail type IIR filters have been employed in some receivers. The channel estimator itself should be adaptive to the changing conditions in the radio channel. Thus, the number of symbols in the averaging FIR filter and the loop gains in the IIR filters should be made adjustable.
3.2. RAKE FINGER

A complex despreader together with an integrate-and-dump filter is depicted in Figure 4. A complex correlation is performed with a total four multiplications and two additions. The despread samples from one symbol period are summed in the accumulator at the chip rate and the results are dumped out at the symbol rate. A Rake finger employing sign-and-magnitude number representation of the samples is shown in Figure 5 [12]. The Rake finger can be conveniently divided into despreader, integration and dump sections. By using XOR sign -flips, the despreader multiplies the I/Q samples with the spreading codes and employs two separate branches to accumulate the positive and negative sums. The accumulation is performed in carry-save arithmetic at the chip rate.
Positive Sum
I
s
D D
CS Add
CS Add
D D
DD
+ +
3.5. MULTIPATH COMBINER

Isym
D Qs D
CS Add CS Add
Negative Sum
DD
Icode Qcode
Figure 5: Partial implementation of a Rake finger (boxed section in Figure 4).
Multipath combiner uses the complex estimates from the channel estimation unit to produce phase-corrected symbols. In addition to the phase correction, the symbols are also multiplied with the estimates of the corresponding symbol magnitudes. Thus, the combiner effectively employs maximal-ratio combining by simply summing the phase-corrected and
295
weighted data symbols from the Rake fingers. The multipath combiner may also contain some decision logic to discard weak multipath components with low SNR. 4. BASEBAND PARTITIONING FOR DSP/ASIC The core of a W-CDMA mobile terminal will be implemented as a system -on-a-chip (SOC) that contains programmable processors, dedicated hardware accelerators, memories, peripherals, and m i x e d -s i g na l de v i ce s to r e a li z e a ll t he r eq u i re d functions. Depending on the terminal capabilities, such as transceiver performance, supported data rates and multimedia capabilities, different trade-offs can be justified. For example, an advanced multimedia terminal supporting 2 Mbit/s data rates has quite different system requirements as opposed to a low-end 144 kbit/s terminal. The W-CDMA receiver and transmitter architectures can be divided into domains that operate at sample, chip, and symbol rates. Because the sample/ chip rates are quite high and a high level of parallelism is needed, the receiver blocks that are most likely to be implemented as dedicated application-specific integrated circuits (ASICs) are the RRC filter, full code-matched filter, code generators, and the Rake fingers. Symbol dumps from the Rake fingers and the averaged multipath tap profiles can be processed at rates that can be handled with a high-performance digital signal processor (DSP). In addition to fast FIR filtering operations, the latest DSPs calculate true complex multiplications effectively with their powerful datapaths comprising of two or even four multiply a c c u m u la t e u n i ts . Mo r eo v e r , a n o th e r b e ne fi t o f employing a programmable DSP is the flexibility of the implementation. When a DSP controls the general operation of the transceiver, the system can easily be ma de ada p tive to va riable sy mbol ra tes an d the changing conditions on the radio channel. The transmitter implementation, however, will be heavily hardware-oriented since the baseband operations are relatively simple and short data word lengths can be employed. 5. CONCLUSIONS
realized with simple parallel operations. From the receiver architecture, the full code-matched filter and the RRC filter are clearly the toughest parts to be realized. A programmable DSP provides flexible means f o r t r a n s ce i ve r c o n t ro l a n d o th e r s y s t em t as ks . Moreover, a DSP can also take care of the symbol rate processing at relatively low data rates. REFERENCES
[1] Tdoc SMG2 260/98, The ETSI UMTS Terrestrial Radio Acce ss (UTRA) ITU -R RTT Candidate Submission, European Telecommunications Standards Institute (ETSI), Sophia Antipolis, France, 1998. [2] S . D . L i n g w o o d , H . K a u f m a n n , a n d B . H a l l e r , ASIC Implementation of a Direct -Sequence Spread-Spectrum RAKE-Receiver, Proc. IEEE Vehicular Technology Conference (VTC), 1994, pp. 1326-1330. [3] D.T. Magill, A Fully-Integrated, Digital, Direct Sequence, Spread Spectrum Modem ASIC, Proc. IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), 1992, pp. 42-46. [4] J.S. Wu, M.L. Liou, H.P. Ma, and T.D. Chiueh, A 2.6-V, 33MHz All-Digital QPSK Direct Sequence Spread-Spectrum Transceiver IC, IEEE Journal of Solid-State Circuits, Vol. 32., No. 10., October 1997, pp. 1499-1510. [5] H.M. Chang and M.H. Sunwoo, Implementation of a DSSS Modem ASIC Chip for Wireless LAN, Proc. IEEE Workshop on Signal Processing Systems (SIPS) 1998, pp. 243-252. [6] J.K. Hinderling et al., CDMA Mobile Station Modem ASIC, IEEE Journal of Solid-State Circuits, Vol. 28, No. 3, March 1993, pp. 253-260. [7] C. Uhl, J.J. Monot, and M. Margery, Single ASIC Receiver for Space Applications, Proc. IEEE Vehicular Technology Conference (VTC), 1994, pp. 1331-1335. [8] H . M e y r , M . M o e n e c l a e y , a n d S . A . F e c h t e l , D i g i t a l Communication Receivers: Synchronization, Channel Estimation, and Signal Processing. New York: John Wiley & Sons Inc., 1998. [9] U. Fawer, A Coherent Spread-Spectrum RAKE-Receiver with Maximum-Likelihood Frequency Estimation, IEEE International Conference on Communications (ICC), 1992, pp. 471-475. [10]G.L. Do and K. Feher, Efficient Filter Design for IS-95 CDMA Syst e ms, I E E E T r a n s a c t i o n s o n C o n s u m e r Electronics, Vol. 42, Issue 4, Nov. 1996, pp. 1011-1020. [11] D.T. Magill and G. Edwards, Digital Matched Filter ASIC, Proc. Military Communications Conference (MILCOM), 1990, pp. 235-238.
The presented W-CDMA transceiver [12]S. Sheng and R. Brodersen, Low -Power CMOS Wireless architectures comprise of a number of blocks which Co mmunicat io ns: A Wide band CDMA Syst em De sign. perform signal processing at the sample, chip and symbol rates. Due to relatively high sample rates and the level of parallelism, especially in the receiver, the first mobile terminals will be based on dedicated hardware. The baseband blocks gaining most of a hardware implementation are th ose which can be
Kluwer Academic Publishers, 1998. [13]S.D. Lingwood, A 65 MHz Digital Chip Matched Filter for DS-Spread Spectrum Applications, , Proc. International Zurich Seminar (IZS), Zurich, Switzerland, March 1994, pp. 1326-1330.
296

Wireless Communication

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wireless Communication

Uploaded by

Copyright:

Available Formats

Tampereen teknillinen korkeakoulu Julkaisuja 296 Tampere University of Technology Publications 296

DSP Processor Core-Based Wireless System Design

DSP Processor Core-Based Wireless System Design

Dr.Tech. Thesis, 156 pages

Tampere, August 2000

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1. Introduction to Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 1.2 Objectives of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4. Programmable DSP Processors 4.1 4.2 4.3 4.4

Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Address Generator . . . . . . . . . . . . . . . . . . . . . . .

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 5.3.2 Processor Hardware . . . . . . . . . . . . . . . . . . . . . . . . . Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6. Summary of Publications 6.1 6.2 6.3

7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 7.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2. WIRELESS COMMUNICATIONS SYSTEM DESIGN

Digital Signal Processing

Real-world signals are analog by nature.

0101...01 0110...11 0010...00 1010...01

0011...00 0110...10 1100...11 0011...01

Wireless Communications Systems

Transmitter Speech Audio Video Data

RF Back-End Physical Channel

Receiver Speech Audio Video Data

Wideband Power MultipathDelay Estimation

Complex Rake Finger Bank Mux Multipath Combiner ChannelEstimation

Symbol Scaling SIR Estimation

Wireless System Design

System Design Flow

ValidationVerification Simulationand Cosimulation

Figure4. System-leveldesignprocessofembeddedsystems. Adaptedfrom[Gaj95].

DRAM Flash EEPROM SRAM MCU

Processor Core-Based Design

3. PROGRAMMABLE PROCESSOR ARCHITECTURES

Figure6. Processor memory architectures:

a) von Neumann architecture, b) basic Harvard

Constant Register Operand

Constant Operand Register

Constant Operand ProgramCounter

Enhancing Processor Performance

4x16-bits Op3 Op4 Op5 Op6 Op7 Op8

Abs Abs Abs Abs Abs Abs Abs Abs

QSPI MCORE RISCMCU UART MCU Debug RAM/ROM Memories

Protocol Timer Misc. Timers

4. PROGRAMMABLE DSP PROCESSORS

Conventional DSP Processors

Fetch(Generate/Send) Fetch(Read) Decode DataAccess(Send/Modify) DataAccess(Read)

8x16 Address Registers EXP

BSH MUL ALU ALU VIT AALU

8x16 Address Registers

4x40 Arithmetic Registers

VLIW DSP Processors

Fetch(Send) Fetch(Wait) Fetch(Read) Decode(Dispatch) Decode(Decode)

Fetch(Generate) Fetch(Read) Decode(Decode/Dispatch)

24x32 Address Registers

16x40 Arithmetic Registers

16x32 Arithmetic Registers

16x32 Arithmetic Registers

MAC BSH EXP

MAC BSH EXP

MAC BSH EXP

MAC BSH EXP

ALU BSH BMU

ALU BSH BMU

VLIWDSPProcessor StarCoreSC140 TMSC62x TMS320C64x TriMediaTM-1100 MPact21 TigerSharcADSP-TS0011 ZSPLSI402Z2