Elec327b DSP Processors 1

DSP PROCESSORS AND DSP IMPLEMENTATION - 1
Introduction
General and special purpose DSP processors
Computer architectures for signal processing
General purpose fixed point DSP processors
Selecting DSP Processors
Implementation of DSP algorithms
Special purpose DSP processors
Summary and Problems
Professor E C Ifeachor
11 March, 2003.
1.
Introduction
DSP processors are used to implement and execute DSP algorithms in real-time (often real-time implies 'as soon
as possible', but within specified time limits).
The main objectives of this section of the DSP course (lectures session and associated laboratory/course work)
are to provide an understanding of
(1) Key issues underlying DSP processors and their hardware/software architectures.
(2) How DSP algorithms are implemented for real-time execution using fixed point DSP processors (digital
filtering will be used as a vehicle for this).
(3) Finite word length effects in fixed point DSP systems (using digital filtering as a vehicle for in the
discussions).
2.
General and special purpose DSP processors
For convenience, DSP processors can be divided into two broad categories:
(1)
General purpose DSP processors these are basically high speed microprocessors with hardware and
instruction sets optimized for DSP operations. Examples of such processors include fixed-point devices
such as Texas Instruments TMS320C54x and Motorola DSP563x processors, and floating point
processors such as Texas Instruments TMS320C4x and Analog Devices ADSP21xxx SHARC
processors.
(2)
Special purpose DSP processors these include: (i) hardware designed for efficient execution of
specific DSP algorithms and (some times called algorithm-specific hardware), e.g. FFT, and (ii)
hardware designed for specific applications (some times called application specific processors), e.g. for
PCM in telecommunications or audio applications. Examples of special-purpose DSP processors are
Cirrus's processor for digital audio sampling rate converters (CS8420), Mitel's multi-channel telephony
voice echo canceller (MT9300), FFT processor (PDSP16515A) and programmable FIR filter
(VPDSP16256).
3.
Computer architectures for signal processing
Standard microprocessors are based on the von Neumann concepts where operations are performed sequentially.
Increase in processor speed is only achieved by making the individual units of the processor operate faster, but
there is a limit to this (see Figure 1). For real-time operation, DSP processors must have architecture optimised
for executing DSP operations. Figure 1b depicts a generic hardware architecture for DSP.
Figure 1 A simplified architecture for standard microprocessors
Figure 2 A simplified generic hardware architecture for DSP
The characteristic features of the architecture of Figure 2 include:
Multiple bus structure, with separate memory spaces for data and programs.
Arithmetic units for logical and arithmetic operations, include a hardware multiplier/accumulator.
Why is such an architecture necessary? In DSP most algorithms, e.g. digital filtering and FFT, involve
repetitive arithmetic operations such as multiplication, additions, memory accesses and heavy data flow through
the CPU.
The architecture of standard microprocessors is not suited to this type of activity. An important goal in DSP
hardware design is to optimise both hardware architecture and instruction set to increase speed and make real
time execution possible whilst keeping quantization errors low. In DSP, this is achieved by making extensive use
of the concepts of parallelism. In particular, the following techniques are used:
Harvard architecture
Pipelining
Fast, dedicated hardware multiplier/accumulator
Specialised instructions dedicated to DSP
Replication
On-chip memory/cache.
Extended parallelism SIMD, VLIW and static super scalar processing.
We will examine some of the above techniques to gain more understanding of the architectural features of DSP
processors.
3.1
Harvard architecture
In a standard microprocessor, the program codes and the data are held in one memory space. Thus, the
fetching of the next instruction while the current one is executing is not allowed, because the fetch and
execution phases each require memory access (see Figure 3).
Figure 3 An illustration of instruction fetch, decode and execute in a non-Harvard architecture with single
memory space (a) instruction fetch from memory; (b) timing diagram
NB: The example illustrates reading of a value op 1 at address ADR1 in memory into the accumulator and
then storing it at two other addresses, ADR2 and ADR3. The instructions could be:
LDA
STA
STA
ADR1 Load the operand op1 into the accumulator from ADR1
ADR2 Store op1 in address ADR2
ADR3 Store op1 in address ADR3
Typically, an instruction in a microprocessor involves three distinct steps:
Instruction fetch
Instruction decode
Instruction execute.
The main feature of the Harvard architecture is that the program and data memories lie in two separate
spaces, see Figure 4. This permits a full overlap of instruction fetch and execution.
Figure 4 The basic Harvard architecture with separate data and program spaces;
Figure 5 An illustration of instruction overlap made possible by Harvard architecture.

In a Harvard architecture, since the program codes and data lie in separate memory spaces, the fetching of
the next instruction can overlap the execution of the current instruction. Normally, the program memory
holds the program codes, whilst the data memory stores variables such as the input data samples.
3.2
Pipelining
This is a technique used extensively in DSP to increase speed as it allows two or more operations to
overlap during execution. In pipelining, a task is broken down into a number of distinct sub-tasks which
are then over lapped during execution.
A pipeline is akin to a typical production line in a factory, such as a car or TV assembly plant. As in the
production line, the task is broken down into small, independent sub-tasks called pipe stages which are
connected in series to form a pipe. Execution is sequential.
Figure 6 An illustration of the concepts of pipelining.

Figure 6 gives a timing diagram of a 3-stage pipeline. Typically, each step in the pipeline takes one
machine cycle to complete. Thus, during a given cycle up to three different instructions may be active at
the same time, although each will be at a different stage of completion.
The speedup
average instruction time (non pipeline)

average instruction time (pipeline)
(1)
Example 1
In a non pipeline processor, the instruction fetch, decode and execute take 35 ns, 25 ns, and 40 ns,
respectively. Determine the increase in throughput if the instruction steps were pipelined. Assume a 5 ns
pipeline overhead at each stage, and ignore other delays.
Solution
In an ideal non pipeline processor, the average instruction time is simply the sum of the times for
instruction fetch, decode and execute:
35 + 25 + 40 ns = 100 ns.
However, if we assume a fixed machine cycle then each instruction time would take three machine cycles
to complete: 40 ns x 3 = 120 ns (the execute time maximum time determines the cycle time). This
corresponds to a throughput of 8.3 x10 6 instructions per second.
In the pipeline processor, the clock speed is determined by the speed of the slowest stage plus overheads,
i.e. 40 + 5 = 45 ns. The through put (when the pipeline is full) is 22.2 x10 6 instructions per second.
Speed up =
average instruction time (non pipeline) = 120/45 = 2.67

average instruction time (pipeline)
Pipelining has a major impact on the system memory because it leads to an increased number of memory
accesses (typically by the number of stages). The use of Harvard architecture where data and instructions
lie in separate memory spaces promotes pipelining.
Drill
Assuming the times in the above example are as follows:
fetch
decode
execute
overhead
20 nS
25 nS
15 ns
1 nS
Determine the increase in throughput if the instructions were pipelined.
Solution
Example 2
Most DSP algorithms are characterised by multiply-and-accumulate operations typified by the following
equation:
y (n) a 0 x( n) a1 x (n 1) a 2 x( n 2) ... a N 1 x (n ( N 1))
Figure 5 shows a non pipeline configuration for an arithmetic element for executing the above equation.
Assume a transport delay of 200 ns, 100ns and 100 ns, respectively for the memory, multiplier and the
accumulator.
(1) What is the system throughput?
(2) Reconfigure the system with pipelining to give a speed increase of 2:1. Illustrate the operation of the
new configuration with a timing diagram.
Figure 7 Non-pipelined MAC configuration.

Solution
(1)
The coefficients, a k , and the data arrays are stored in memory as shown in Figure 7. In the nonpipelined mode, the coefficients and data are accessed sequentially and applied to the multiplier.
The products are summed in the accumulator. Successive MAC will be performed once every 400
ns (200 + 100 + 100), that is a throughput of 2.5 x10 6 operations per second.
(2)
The arithmetic operations involved can be broken up into three distinct steps: memory read,
multiply, and accumulate. To improve speed these steps can be overlapped. A speed improvement
of 2:1 can be achieved by inserting pipeline registers between the memory and multiplier and
between the multiplier and accumulator as shown in Figure 8. The timing diagram for the pipeline
configuration is shown in Figure 9. As is evident in the timing diagram, the MAC is performed
once every 200 ns. The limiting factor is the basic transport delay through the slowest element, in
this case the memory. Pipeline overheads have been ignored.
Figure 8 Pipelined MAC configuration. The pipeline registers serve as temporary store for coefficient
and data sample pair. The product register also serves as a temporary store for the product.
Figure 9 Timing diagram for a pipelined MAC unit. When the pipeline is full, a MAC operation is
performed every clock cycle (200 ns).
DSP algorithms are often repetitive but highly structured, making them well suited to multilevel
pipelining. Pipelining ensures a steady flow of instructions to the CPU, and in general leads to a
significant increase in system through put. However, on occasions pipelining may caused problems (e.g.
an unwanted instruction execution, especially near branch instructions).
3.3
Multiplier/Accumulator
The basic numerical operations in DSP are multiplication and addition. Multiplication in software is time
consuming. Additions are even worse if floating point arithmetic is used.
To make real-time DSP possible, a fast dedicated hardware MAC, using either fixed point or floating point
arithmetic is mandatory. Characteristics of a typical fixed point MAC include:
16 x 16 bit 2's complement inputs
16 x 16 bit multiplier with 32-bit product in 25 ns
32/40 bit accumulator

3.4
Special instructions
These are instructions optimised for DSP and lead to compact codes and increased speed of execution of
operations that are repeated. For example, digital filtering requires data shifts or delays to make room for
new data, followed by multiplication of the data samples by the filter coefficients, and then accumulation
of products. Recall that FIR filters are characterised by the following equation:
N 1
y ( n) h( k ) x( n k ) , where N is the filter length.

k 0
In the TMS320C50, for example, the FIR equation can be efficiently implemented using the instruction
pair:
RPT
MACD
NM1
HNM1, XNM1
The first instruction, RPT NM1, loads the filter length minus 1 (N-1) into the repeat instruction counter,
and causes the multiply-accumulate with data move (MACD) instruction following it to be repeated N
times. The MACD instruction performs a number of operations in one cycle:
(1)
(2)
(3)
multiplies the data sample, x( n k ) , in the data memory by the coefficient, h(k ) , in the
program memory;
adds previous product to the accumulator;
implements the unit delay, symbolized by z 1 , by shifting the data sample, x(n-k), up to update
the tapped delay line.
In the Motorola DSP56000 DSP processor family, as in the TMS320 family, the MAC instruction,
together with the repeat instruction (REP) may be used to implement an FIR filter efficiently:
REP
MAC
#N-1
X0, Y0, A
X: (R0)+, X0
Y: (R4)+, Y0
Here the repeat instruction is used with the MAC instruction to perform sustained multiplication and sums
of product operations. Again, notice the ability to perform multiple operations with one instruction, made
possible by having multiple data paths.
The contents of the registers X0 and Y0 are multiplied together and the product added to the accumulator.
At the same time, the next data sample and corresponding coefficient are fetched from the X and Y
memories for multiplication.
10
In most modern DSP processors, the concept of instruction repeat has been taken further by providing
instructions that allow a block of code, not just a single instruction, to be repeated a specified number of
times. In the TMS320 family (e.g. TMS320C50, TMS320C54 and TMS320C30), the format for repeat
execution of a block of instructions, with a zero-overhead loop, is:
loop
RPTB loop
:
:
(last instruction)
Repeat instructions provided by some DSP processors have high level language features. In Motorola
DSP56000 and DSP56300 families zero-overhead DO loops are provided which may also be nested. The
example below illustrates a nested Do loop in which the outer loop is executed N times and the inner loop
NM times.
DO #N, LOOP1
:
DO #M, LOOP2
:
LOOP2 (last instruction is placed here)
:
LOOP1 (last instruction in the outer loop is placed here)
Nested loops are useful for efficient implementation of DSP functions such as FFT algorithms and 2-D
dimensional signal processing.
Analog Devices DSP processors (e.g. ADSP-2115 and SHARC processors) also have nested-looping
capability. The ADSP-2115 supports up to 4 levels of nested loops. The format for looping is:
CNTR = N
DO LOOP UNTIL CE
:
:
LOOP: (last instruction in the loop)
The loop is repeated until the counter expires. The loop can contain a large block of instructions, not just a
single instruction. The format for nested looping is essentially the same as for DSP56000 family.
Modern DSP processors also feature application-oriented instructions for applications such as speech
coding (e.g. those for codebook search), digital audio (e.g. those for surround sound ) and
telecommunications (e.g. those for Viterbi decoding). Other application oriented instructions include those
that support coefficient update for adaptive filters and bit reverse addressing for FFTs (see later).
11
3.5
Extended parallelism - SIMD, VLIW and static superscaler processing.
The trend in DSP processor architecture design is to increase both the number of instructions executed in
each cycle and the number of operations performed per instruction to enhance performance. In newer DSP
processor architectures, parallel processing techniques are extensively used to achieve increased
computational performance. The three techniques that are used, often in combination, are:
Single instruction, multiple data (SIMD) processing.

Very-long-instruction-word (VLIW) processing
Superscalar processing
Figure 10 An illustration of the use of SIMD processing and multiple data size capability to extend the
number of multiplier/accumulators (MACs) from one to four in a TigerSHARC DSP processor.
Note: SIMD processing is used to increase the number of operations performed per instruction. Typically, in DSP
processors with SIMD architectures the processor has multiple data paths and multiple execution units. Thus, a
single instruction may be issued to the multiple execution units to process blocks of data simultaneously and in this
way the number of operations performed in one cycle is increased.
12
Figure 11 Principles of very long instruction word (VLIW) architecture and data flow in the
advanced, fixed point DSP processor, TMS320C62x.
Note: The Very-long-instruction-word processing is an important approach for substantially increasing the number of
instructions that are processed per cycle. A very-long-instruction word is essentially a concatenation of several short
instructions and require multiple execution units, running in parallel, to carry out the instructions in a single cycle. In
the TMS320C62x, the CPU contains two data paths and eight independent execution units, organised in two sets (L1, S1, M1and D1) and (L2, S2, M2 and D2). In this case, each short instruction is 32-bits wide and eight of these
are linked together to form a very long instruction word packet which may be executed in parallel. The VLIW
processing starts when the CPU fetches an instruction packet (eight 32-bit instructions) from the on-chip program
memory. The eight instructions in the fetch packet are formed into an execute packet, if they can be executed in
parallel, and then dispatched to the eight execution units as appropriate. The next 256-bit instruction packet is
fetched from the program memory while the execute packet is decoded and executed. If the eight instructions in a
fetch packet are not executable in parallel, then several execute packets will be formed and dispatched to the
execution units, one at a time. A fetch packet is always 256-bit wide (eight instructions), but an execute packet may
vary between 1 and 8 instructions.
13
Figure 12 Principles of superscalar architecture and data flow in the

TigerSHARC DSP processor
Note: Superscalar processing is used to increase the instruction rate of a DSP processor by exploiting instructionlevel parallelism. Traditionally, the term superscalar refers to computer architectures that enable multiple
instructions to be executed in one cycle. Such architectures are widely used in general purpose processors, such as
PowerPC and Pentium processors. In superscalar DSP processors, multiple execution units are provided and several
instructions may be issued to the units for concurrent execution. Extensive use is also made of pipelining techniques
to increase performance further. The TigerSHARC is described as a static superscalar DSP processor because
parallelism in the instructions is determined before run-time. In fact, the TigerSHARC processor combines SIMD,
VLIW and superscalar concepts. This advanced, DSP processor has multiple data paths and two sets of independent
execution units, each with a multiplier, ALU, a 64-bit shifter and a register file. TigeSHARC is a floating point
processor, but it supports fixed arithmetic with multiple data types (8-, 16-, and 32-bit numbers). The instruction
width is not fixed in the TigerSHARC processor. In each cycle, up to four 32-bit instructions are fetched from the
internal program memory and issued to the two sets of execution units in parallel. An instruction may be issued to
both units in parallel (SIMD instructions) or to each execution unit independently. Each execution unit (ALU,
multiplier or shifter) takes its inputs from and returns its results to the register file. The register files are connected to
the three data paths and so can simultaneously read two inputs and write an output to memory in a cycle. This
load/store architecture is suited to basic DSP operations which often take two inputs and computes an output.
Because the processor can work on several data sizes, the execution units allow further levels of parallel
computation. Thus, in each cycle the TigerSHARC can execute up to eight addition/subtract operations and eight
multiply-accumulate operations with 16-bit inputs, in stead of two multiply-accumulate operations with 32-bit
inputs.
14
4.
General purpose fixed point DSP processors
General-purpose DSP processors have evolved substantially over the last decade as a result of the neverending quest to find better ways to perform DSP operations, in terms of computational efficiency,
ease of implementation, cost, power consumption, size, and application-specific needs. The
insatiable appetite for improved computational efficiency has led to substantial reductions in
instruction cycle times and, more importantly, to increasing sophistication in the hardware and
software architectures. It is now common to have dedicated, on-chip arithmetic hardware units
(e.g. to support fast multiply/accumulate operations), large on-chip memory with multiple access
and special instructions for efficient execution of inner core computations in DSP. We have also
seen a trend towards increased data word sizes (e.g. to maintain signal quality) and increased
parallelism (to increase both the number of instructions executed in one cycle and the number of
operations performed per instruction. Thus, we find that in newer general purpose DSP processors
increasing use is made of multiple data paths/arithmetic to support parallel operations. DSP
processors based on SIMD, VLIW and superscalar architectures are being introduced to support
efficient parallel processing. In some DSP processors, performance is enhanced further by using
specialised, on-chip co-processors to speed up specific DSP algorithms such as FIR filtering and
Viterbi decoding. The explosive growth in communications and digital audio technologies have had
a major influence in the evolution of DSP processors, as has growth in embedded DSP processor
applications.
A summary of key features of four generations of fixed-point DSP processors from four leading
semiconductor manufacturers is given in Table 1. The classification of DSP processors into the four
generations is partly based on historical reasons, architectural features and computational performance.
The basic architecture of the first generation fixed point DSP processor family (TMS320C1x), first
introduced in 1982 by Texas Instruments, is depicted in Figure 13. A typical second generation DSP
processor is depicted in Figures 14 (Motorola DSP5600x).
15
Figure 13 A simplified architecture of a first generation fixed point DSP processor (Texas Instruments
TMS320C10)
Figure 14 A simplified architecture of a second generation fixed point DSP

(Motorola DSP56000).
16
Third generation fixed point DSP processors are essentially enhancements of second generation DSP
processors. Compared to the second generation DSP processors, features of the third generation DSP
processors include more data paths (typically three compared to two in the second generation), wider data
paths, larger on-chip memory and instruction cache and in some cases a dual MAC. As a result, the third
generation DSP processors have performance that are typically 2 or 3 times superior to that of the second
generation DSP processors of the same family. Simplified architectures of two third generation DSP
processors are depicted in Figure 14 (TMS320C54x) and Figure 15 (DSP563x). Most of the third
generation fixed-point DSP processors are aimed at applications in digital communications and digital
audio, reflecting the enormous growth and influence of these application areas on DSP processor
development. Thus, we find features in some of the processors that support these applications.
The TMS320C54x, for example, includes special instructions for adaptive filtering (which is often used
for echo cancellation and adaptive equalisation in telecommunications) and to support Viterbi decoding.
In the third generation processors, semiconductor manufacturers have also take the issue of power
consumption seriously (because of its importance in portable and hand held devices such as the mobile
phone). Most of the third generation DSP processors are low power and have power management facility.
Fourth generation fixed point DSP processors with their new architectures are primarily aimed at large
and/or emerging multichannel applications, such as digital subscriber loops, remote access server
modems, wireless base stations, third generation mobile systems and medical imaging. The new fixed
point architecture that has attracted a great deal of attention in the DSP community is the very long
instruction word (VLIW). The new architecture makes extensive use of parallelism whilst retaining some
of the good features of previous DSP processors. Compared to previous generations, fourth generation
fixed point DSP processors, in general, have wider instruction words, wider data paths, more registers,
larger instruction cache and multiple arithmetic units, enabling them to execute many more instructions
and operations per cycle. The texas Instruments TMS320C62x family of fixed point DSP processors is
based on the VLIW architecture. The core processor has two independent arithmetic paths, each with four
execution units a logic unit (Li), a shifter/logic unit (Si), a multiplier (Mi) and a data address unit (Di).
Typically, the core processor fetches eight 32-bit instructions at a time, giving an instruction width of 256
bits (and hence the term very long instruction word). With a total of eight execution units, four in each
data path, the TMS320C62x can execute up to eight instructions in parallel in one cycle. The processor
has a large program and data cache memories (typically, 4 Kbyte of level 1 program/data caches and 64
Kbyte of level 2 program/data cache). Each data path has its own register file (sixteen 32-bit registers), but
can also access registers on the other data path. Advantages of VLIW architectures include simplicity and
high computational performance. Disadvantages include increased program memory usage (organisation
of codes to match the inherent parallelism of the processor may lead to inefficient use of memory).
Further, optimum processor performance can only be achieved when all the execution units are busy
which is not always possible because of data dependencies, instruction delays and restrictions in the use of
the execution units. However, sophisticated programming tools are available for code packing, instruction
scheduling, resource assignment and in general to exploit the vast potential of the processor.
5.
Floating-point DSP processors.
The ability of DSP processors to perform high speed, high precision DSP operations using floating point
arithmetic has been a welcome development. This minimises finite word length effects such as overflows,
round off errors, and coefficient quantization errors inherent in DSP. It also facilitates algorithm
development, as a designer can develop an algorithm on a large computer in a high level language and
then port it to a DSP device more readily than with fixed point.
Floating point DSP processors retain key features of fixed point processors such as special instructions for
DSP operations and multiple data paths for multiple operations. As in the case of fixed point DSP
processors, floating point DSP processors available are significantly different architecturally. Some of the
key features of the three generations of floating point DSP processors from Texas Instruments and Analog
Devices are summarised in Table 2.
17
Table 1. Features of general purpose fixed-point DSPs from Texas Instruments, Motorola and Analog Devices.
Table 2 Features of general purpose floating-point DSPs from Texas Instruments, Motorola and Analog
Devices.
18
6.
Selecting DSP Processors
The choice of a DSP processor for a given application is an important issue because of the wide range of
processors available. Specific factors that may be considered when selecting a DSP processor for an application
include architectural features, execution speed, type of arithmetic and word length:
(1). Architectural features Most DSP processors available today have good architectural features, but these
may not be adequate for a specific application. Key features of interest include size of on-chip memory, special
instructions and I/O capability. On-chip memory is an essential requirement in most real-time DSP applications
for fast access to data and rapid program execution. For memory hungry applications (e.g. digital audio Dolby
AC-2, FAX/Modem, MPEG coding/decoding), the size of internal RAM may become an important
distinguishing factor. Where internal memory is insufficient this can be augmented by high speed, off-chip
memory, although this may add to system costs. For applications that require fast and efficient communication
or data flow with the outside world, I/O features such interface to ADC and DACs, DMA capability and support
for multiprocessing may be important. Depending on the application, a rich set of special instructions to support
DSP operations are important, e.g. zero-overhead looping capability, dedicated DSP instructions, and circular
addressing.
(2). Execution speed The speed of DSP processors is an important measure of performance because of the
time critical nature of most DSP tasks. Traditionally, the two main units of measurement for this are the clock
speed of the processor, in MHz, and the number of instructions performed, in millions of instructions per second
(MIPS) or in the case of floating point DSP processors, in millions of floating point operations per second
(MFLOPS). However, such measures may be inappropriate in some cases because of significant differences in
the way different DSP processors operate with most able to perform multiple operations in one machine
instruction. For example, the C62x family of processors can execute as many as eight instructions in a cycle.
The number of operations performed in each cycle also differs from processor to processor. Thus, comparison of
execution speed of processors based on such measures may not be meaningful. An alternative measure is based
on the execution speed of benchmark algorithms e.g. DSP kernels such as FFT, FIR and IIR filters. In Tables
1 and 2, performance indices based on such benchmarks are given to give an indication of the relative
performance of a number of popular DSP processors.
(3). Type of arithmetic The two most common type of arithmetic used in modern DSP processors are fixed
and floating point arithmetic. Floating arithmetic is the natural choice for applications with wide and variable
dynamic range requirements (dynamic range may be defined as the difference between the largest and smallest
signal levels that can be represented or the difference between the largest signal and the noise floor, measured in
decibel). Fixed point processors are favoured in low cost, high volume applications (e.g. cellular phones and
computer disk drives). The use of fixed point arithmetic raises issues associated with dynamic range constraints
which the designer must address (see later). In general, floating processors are more expensive than fixed point
processors, although the cost difference has fallen significantly in recent years. Most floating point DSP
processors available today also support fixed point arithmetic.
(4). Word length Processor data word length is an important parameter in DSP as it can have a significant
impact on signal quality. It determines how accurately parameters and results of DSP operations can be
represented (see later for details). In general, the longer the data word the lower the errors that are introduced by
digital signal processing. In fixed point audio processing, for example, a processor word length of at least 24bits is required to keep the smallest signal level sufficiently above the noise floor generated by signal processing
to maintain CD quality. A variety of processor word length is used in fixed point DSP processors, depending on
application (see Table 1). Fixed point DSP processors aimed at telecommunications markets tend to use a 16-bit
word length (e.g. TMS320C54x), whereas those aimed at high quality audio applications tend to use 24-bits
(e.g. DSP56300). In recent years, we have seen a trend towards the use of more bits for the ADC and DAC (e.g.
Cirrus 24-bit audio codec, CS4228) as the cost of these devices falls to meet the insatiable demand for increased
quality. Thus, we are likely to see an increased demand for larger processor word lengths for audio processing.
In fixed point processors, it may also be necessary to provide guard bits (typically 1 to 8 bits) in the
accumulators to prevent arithmetic overflows during extended multiply and accumulate operations. The extra
bits effectively extend the dynamic range available in the DSP processor. In most floating point DSP
processors, a 32-bit data size (24-bit mantissa and 8-bit exponents) are used for single-precision arithmetic.
This size is also compatible with the IEEE floating point format (IEEE 754). Most floating point DSP
19
processors also have fixed point arithmetic capability, and often support variable data size, fixed point
arithmetic.
In practice, factors such as experience/familiarity with a particular DSP processor family, ease of use, time to
market and costs may be the over-riding factors in selecting a given processor.
20
Problems
1.
2.
3.
4.
5.
6.
Analogue I/O
IIR filter design amplitude distortion and filter order
FIR filter design - half band filters.
DSP processors and DSP implementation
Multirate systems
Adaptive systems.
21

Elec327b DSP Processors 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Elec327b DSP Processors 1

Uploaded by

Copyright:

Available Formats

DSP PROCESSORS AND DSP IMPLEMENTATION - 1

General and special purpose DSP processors

Computer architectures for signal processing

Figure 1 A simplified architecture for standard microprocessors

Figure 2 A simplified generic hardware architecture for DSP

The characteristic features of the architecture of Figure 2 include:

Typically, an instruction in a microprocessor involves three distinct steps:

Figure 5 An illustration of instruction overlap made possible by Harvard architecture.

Figure 6 An illustration of the concepts of pipelining.

average instruction time (non pipeline)

average instruction time (non pipeline) = 120/45 = 2.67

Determine the increase in throughput if the instructions were pipelined.

Figure 7 Non-pipelined MAC configuration.

16 x 16 bit 2's complement inputs

16 x 16 bit multiplier with 32-bit product in 25 ns

32/40 bit accumulator

y ( n) h( k ) x( n k ) , where N is the filter length.

Extended parallelism - SIMD, VLIW and static superscaler processing.

Single instruction, multiple data (SIMD) processing.

Figure 12 Principles of superscalar architecture and data flow in the

General purpose fixed point DSP processors

Figure 14 A simplified architecture of a second generation fixed point DSP

Floating-point DSP processors.

Selecting DSP Processors

You might also like