Types of DSP Architectures

Types of Architectures
VON-Neumann Architecture.
Harvard Architecture.
Super Harvard Architecture
VLIW Architecture.
The VON-Neumann Architecture
In 1946 , Developed by John Von Neumann.
Development took place at University of Pennsylvanias
Moore school of Electrical Engineering.
This is the most widely used today, and is implemented by
the majority of Microprocessors on the market.
For this architecture, all the elements of the computer are
interconnected by a single system of 3 busses: Data ,
Address & Control Bus.
The Von Neumann Architecture
Model for designing and building computers,
based on the following three characteristics:
1) The computer consists of four main sub-systems:
Memory
ALU (Arithmetic/Logic Unit)
Control Unit
Input/output System (I/O)
2) Program is stored in memory during execution.
3) Program instructions are executed sequentially.
The Von Neumann Architecture
Memory
Processor (CPU)
Input-Output
Control Unit
ALU
Store data and program
Execute program
Do arithmetic/logic operations
requested by program
Communicate with
"outside world",
e.g.
Screen
Keyboard
Storage devices
...
Bus
VON-Neumann Architecture
The Data Bus :
Transports data between the CPU and its peripherals.
It is bi-directional.
The CPU can read or write data in the peripherals.
The Address Bus:
The CPU uses the address bus to indicate which peripherals it
wants to access, and within each peripheral which specific register.
The address bus is unidirectional.
The CPU always writes the address, which is read by the
peripherals.
Control Bus:
This bus carries signals that are used to manage and
Synchronize the exchanges between the CPU and its peripherals.
Special lines of the control bus, such as interrupt or wait-state
signals carry information from the peripherals to the CPU.
Memory Subsystem
Memory, also called RAM (Random Access
Memory)
Consists of many memory cells (storage units) of a fixed
size. Each cell has an address associated with it: 0, 1,
All accesses to memory are to a specified address.
A cell is the minimum unit of access (fetch/store a
complete cell).
The time it takes to fetch/store a cell is the same for all
cells.
When the computer is running, both
Program
Data (variables)
are stored in the memory.
Memory Size / Speed
Typical memory in a personal computer (PC):
64MB - 256MB
Memory sizes:
Kilobyte (KB) = 2
10
= 1,024 bytes ~ 1 thousand
Megabyte (MB) = 2
20
= 1,048,576 bytes ~ 1 million
Gigabyte (GB) = 2
30
= 1,073,741,824 bytes ~ 1 billion
Memory Access Time (read from/ write to memory)
50-75 nanoseconds (1 nsec. = 0.000000001 sec.)
RAM is
volatile (can only store when power is on)
relatively expensive.
Operations on Memory
Fetch (address):
Fetch a copy of the content of memory cell with the
specified address.
Non-destructive, copies value in memory cell.
Store (address, value):
Store the specified value into the memory cell specified by
address.
Destructive, overwrites the previous value of the memory
cell.
The memory system is interfaced via:
Memory Address Register (MAR)
Memory Data Register (MDR)
Fetch/Store signal
Structure of the Memory Subsystem
Fetch(address)
Load address into MAR.
Decode the address in MAR.
Copy the content of memory cell
with specified address into
MDR.
Store(address, value)
Load the address into MAR.
Load the value into MDR.
Decode the address in MAR
Copy the content of MDR into
memory cell with the specified
address.
MAR MDR
---
Memory
decoder
circuit
Fetch/Store
controller
F/S
Input/output Subsystem
Handles devices that allow the computer system to:
Communicate and interact with the outside world
Screen, keyboard, printer, ...
Store information (mass-storage)
Hard-drives, floppies, CD, tapes,
Mass-Storage Device Access Methods:
Direct Access Storage Devices (DASDs)
Hard-drives, floppy-disks, CD-ROMs, ...
Sequential Access Storage Devices (SASDs)
Tapes (for example, used as backup devices)
The ALU Subsystem
The ALU (Arithmetic/Logic Unit) performs
mathematical operations (+, -, x, /, )
logic operations (=, <, >, and, or, not, ...)
In today's computers integrated into the CPU
Consists of:
Circuits to do the arithmetic/logic operations.
Registers (fast storage units) to store intermediate
computational results.
Bus that connects the two.
Structure of the ALU
Registers:
Very fast local memory cells,
that store operands of
operations and intermediate
results.
CCR (condition code register),
a special purpose register that
stores the result of <, = , >
operations.
ALU circuitry:
Contains an array of circuits to
do mathematical/logic
operations.
Bus: Data path interconnecting
the registers to the ALU circuitry.
ALU circuitry
GT EQ LT
R0
R1
R2
Rn
The Control Unit
Program is stored in memory
as machine language instructions, in binary.
The task of the control unit is to execute programs
by repeatedly:
Fetch from memory the next instruction to be
executed.
Decode it, that is, determine what is to be done.
Execute it by issuing the appropriate signals to the
ALU, memory, and I/O subsystems.
Continues until the HALT instruction.
Machine Language Instructions
A machine language instruction consists of:
Operation code, telling which operation to perform
Address field(s), telling the memory addresses of the
values on which the operation works.
Example: ADD X, Y (Add content of memory locations X
and Y, and store back in memory location Y).

Assume: opcode for ADD is 9, and addresses X=99, Y=100
00001001 0000000001100011 0000000001100100
Opcode (8 bits) Address 1 (16 bits) Address 2 (16 bits)
Instruction Set Design
Two different approaches:

Reduced Instruction Set Computers (RISC)
Instruction set as small and simple as possible.
Minimizes amount of circuitry --> faster computers

Complex Instruction Set Computers (CISC)
More instructions, many very complex
Each instruction can do more work, but require more
circuitry.
Structure of the Control Unit
PC (Program Counter):
stores the address of next instruction to fetch
IR (Instruction Register):
stores the instruction fetched from memory
Instruction Decoder:
Decodes instruction and activates necessary circuitry
Instruction
Decoder
IR
+1
PC
von Neumann
Architecture
How does this all work together?
Program Execution:

PC is set to the address where the first program
instruction is stored in memory.

Repeat until HALT instruction or fatal error
Fetch instruction
Decode instruction
Execute instruction
End of loop

Program Execution (cont.)
Fetch phase
PC --> MAR (put address in PC into MAR)
Fetch signal (signal memory to fetch value into MDR)
MDR --> IR (move value to Instruction Register)
PC + 1 --> PC (Increase address in program counter)
Decode Phase
IR -> Instruction decoder (decode instruction in IR)
Instruction decoder will then generate the signals to
activate the circuitry to carry out the instruction.
Program Execution (cont.)
Execute Phase
Differs from one instruction to the next.
Example:
LOAD X (load value in addr. X into register)
IR_address -> MAR
Fetch signal
MDR --> R
ADD X
left as an exercise
Instruction Set for Our Von Neumann Machine
Opcode Operation Meaning

0000 LOAD X CON(X) --> R
0001 STORE X R --> CON(X)
0010 CLEAR X 0 --> CON(X)
0011 ADD X R + CON(X) --> R
0100 INCREMENT X CON(X) + 1 --> CON(X)
0101 SUBTRACT X R - CON(X) --> R
0101 DECREMENT X CON(X) - 1 --> CON(X)
0111
COMPARE X
If CON(X) > R then GT = 1 else 0
If CON(X) = R then EQ = 1 else 0
If CON(X) < R then LT = 1 else 0
1000 JUMP X Get next instruction from memory location X
1001 JUMPGT X Get next instruction from memory loc. X if GT=1
... JUMPxx X xx = LT / EQ / NEQ
1101 IN X Input an integer value and store in X
1110 OUT X Output, in decimal notation, content of mem. loc. X
1111 HALT Stop program execution
HARVARD ARCHITECTURE
A DSP Chip is a microprocessor
specially designed for DSP
applications
Harvard architecture allows
multiple memory reads
Architecture optimized to provide
rapid processing of discrete time
signals, e.g. Multiply and
Accumulate (MAC) in one cycle
Program Memory Data Bus
ALU
Accumulator
Program
Memory
Data
Memory
Multiplexer Multiplexer
Data Memory Data Bus
The Harvard architecture as the name implies was
developed at Harvard University.
By contrast to the Von Neumann architecture, it uses two
separate bus systems to transport the instruction codes
and the data being processed by the CPU.
The Program Bus System: Is used exclusively to transport
instruction codes from the program memory to the CPU
during the fetch cycle.
The Data Bus System: Is used exclusively to transport
data from/to the CPU, to/from the memory and
peripherals.
PROS AND CONS OF EACH ARCHITECTURE
HA is capable of simultaneously reading an instruction
code, and reading or writing a memory or peripheral as
part of the execution of the previous instruction.
It has a speed advantage over the Von Neumann
architecture.
The HA is also safer, since it is not possible for the CPU to
mistakenly write codes into the program memory and
therefore corrupt the code while it is executing.
HA is less flexible. It needs two independent memory
banks (one for program and another one for data). These
two resources are not interchangeable.
For an Embedded system that always runs the same
application, Harvard architecture is most suitable.(memory
needs can be easily anticipated)
PROS AND CONS OF EACH ARCHITECTURE
A computer system however may run a wide variety of
applications, requiring large data and program memories
The Von Neumann architecture is better suited for this
type situation, because program and data memories are
interchangeable, and it will lead to a better usage of the
memory resources.
The SHARC
Super Harvard Architecture Computer

The SHARC
Developed by Analog Devices.
Optimized for demanding DSP and imaging
applications.
32 Bit floating point, with 40 bit extended floating
point capabilities.
Large on-chip memory.
Ideal for scalable multi-processing applications.
Super Harvard Architecture
Many processor employ Harvard Architecture by
having two separate memories or caches integrated
into the processor chip.

The SHARC is unique in that its internal memory
is capable of holding a large program as well a
large amount of data. This is what makes it
SUPER!!!.
SHARC DSP
The SHARC incorporates features aimed at
optimizing such loops.
High-Speed Floating Point Capability.
Extended Floating Point.
These features are DSP specific.
Meaning, when applied to a non-DSP application
performance may not be as optimal.
Floating Point and Extended Floating Point
The SHARC supports Floating, Extended-Floating
and Non-Floating point operations.
No additional clock cycles for floating point
computations.
Data automatically truncated and zero padded
when moved between 32-bit memory and internal
registers.
Excellent signal to noise ratio.
SHARCs Internal Memory
Makes SHARC unique.
Size
Allows many complex functions to be preformed on-
chip. Eliminating the need to move data between
internal and external memory.
Memory size is significantly larger then most other
high speed computational devices.
Dual-block, Dual-port
Optimizes the Harvard Architecture by allowing the
fetch of instructions while performing data memory
accesses.

Multiply and Accumulate Instructions on the
SHARC
Like most DSPs the SHARC is able to compute a
product and add the product to a running total in a
single clock cycle.
The SHARCs super instruction is that it can
multiply and accumulate while adding, subtracting,
or averaging data in two other registers.
These instructions give the SHARC its 120
megaflop rating.

DAGs on the SHARC
Data Address Generators are integer computation
units that manage the indexing of registers.
Allows the SHARC to to fetch a value and update
the index value.
If the updated value exceeds a limit, the DAB
adjusts the index so that it wraps.
This occurs in the same clock cycle as the read or
write.
DAG Capabilities
Circular Buffering
Rather then actually moving data in and out of a
vector, circular buffers are used.
Updating the index modulo, the oldest entry can be
conveniently replaced by the newest entry.

Bit Reverse Addressing
The bit pattern of a vector index is reversed.
Done automatically by the SHARC.
Required for Fast Fourier Transform (FFT), which is
often critical to DSP applications.
SHARC DSP
What Makes the SHARC unique?
It also has some features not related directly related to
optimizing numeric computations.
Pipelining
Handling Branches.

Why has this not emerged sooner?
Technology has only recently become available to
make it economical to integrate general single
computing devices.
SHARCs Pipeline
3 stages
1 Instruction Fetch
2 Decode
3 Execution
Takes three clock cycles for an instruction to
propagate through the pipeline.
The processor execution speed is one instruction
per clock cycle even though each instruction
requires three clock cycles.
Multi-processing
SHARC is uniquely equipped for multi-processing.
Links to ports are very powerful multi-processing
capabilities.
Two main program models depending on the
application.
Adapts well to different multi-processing
architectures.
Multi-processing SHARC Links
SHARC has 6 link ports that can transport data at
rates up to 40Mbytes/sec.

Links designed for point-to-point connections.

Data can be transmitted in either direction but not
both simultaneously.
Multi-processing Program Model MIMD
Multiple instruction, multiple data.

Good for applications that require multiple
instruction threads to execute concurrently.

Processors operate individually.
Each processor executes different code.

Typically used for image reconstruction and multi-
channel DSP.
Multi-processing Program Model SIMD
Single instruction, multiple data.
Works best when all processors execute identical
instruction sequences.
Do not require overhead for inter-processor
synchronization.
Typically used for synthetic aperture radar (SAR)
and automatic target recognition.
( SAR is a form of radar which is used to create
images of an object, such as a landscape)
Summary of what makes the SHARC Super
It performs excellently for DSP applications.

Employs a Harvard Architecture with very
large on chip memory.

Respectable Megaflop rating.

Its multiprocessing capabilities.
VLIW(Very Large Instruction Width) Architecture
VLIW architectures execute multiple instructions/cycle
and use simple, regular instruction sets
More parallelism, higher performance.
Better compiler target.

Multiple independent instructions per cycle, packed into
single large "instruction word" or "packet.

Large, uniform register sets.
Wide program and data buses.
Invented by Josh Fisher in his research group at Yale University in
the early 1980s.
Very Long Instruction Word (VLIW)
A technique for instruction-
level parallelism by
executing instructions
without dependencies
(known at compile-time) in
parallel
Example of a single VLIW
instruction:
F=a+b; c=e/g; d=x&y; w=z*h;
VLIW instruction
F=a+b c=e/g d=x&y w=z*h
PU
PU
PU
PU
a
b
F
c
d
w
e
g
x
y
z
h
VLIW Simplified Architecture Example
Program
Memory
256 bits consisting of 8 instructions
Each instruction is 32 bits
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Execution
Units
Each unit executing
one instruction

Types of DSP Architectures

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Types of DSP Architectures

Uploaded by

Copyright:

Available Formats

Types of Architectures

You might also like