You are on page 1of 13

8/21/2012

Introduction
2

General-Purpose Processor
Processor

designed for a variety of computation tasks Low unit cost, in part because manufacturer spreads NRE over large numbers of units
n Motorola

sold half a billion 68HC05 microcontrollers in 1996 alone

GENERAL PURPOSE PROCESSOR

Carefully
n Can

designed since higher NRE is acceptable


yield good performance, size and power

Low

NRE cost for Embedded system designer, short time-to-market/prototype, high flexibility
n User

just writes software; no processor design

Basic Architecture
3 4

Datapath Operations
Processor

Control unit and datapath

Datapath ALU

Load

Processor Control unit Datapath ALU Controller Control /Status

Control unit

Similar to singlepurpose processor Datapath is general Control unit doesnt store the algorithm the algorithm is programmed into the memory

Read memory location into register

Controller

Control /Status Registers

ALU operation

+1

Key differences

Input certain registers through ALU, store back in register


Store

Registers

PC

IR

Write register to memory location


I/O Memory

10
PC IR

11

I/O Memory

...
10 11 ...

Control Unit
5

Instruction Cycles
6 Processor Control unit Datapath ALU Controller Control /Status Registers clk

Control unit: configures the datapath operations

PC=10

Processor Control unit Datapath ALU Controller Control /Status Registers

Sequence of desired operations (instructions) stored in memory program

0 Fetch DecodeFetch Exec. Store

ops

result s

Instruction cycle broken into several sub-operations, each one clock cycle, e.g.:

Fetch: Get next instruction into IR Decode: Determine what the instruction means Fetch operands: Move data from memory to datapath register Execute: Move data through the ALU Store results: Write data from register to memory

10
PC IR R0 R1 PC 100 IR load R0, M[500] R0 R1

I/O 100 load R0, M[500] 101 inc R1, R0 102 store M[501], R1 Memory

...
500 10 501

I/O 100 load R0, M[500] 101 inc R1, R0 102 store M[501], R1 Memory

...
500 10 501

...

...

8/21/2012

Instruction Cycles
7 8 Processor Control unit Datapath ALU Controller Control /Status

Instruction Cycles
PC=10
+1
clk Processor Control unit Datapath ALU Controller Control /Status Registers

PC=10
clk

0 Fetch DecodeFetch Exec. Store

ops

result s

0 Fetch DecodeFetch Exec. Store

ops

result s

PC=10
clk

1 Fetch DecodeFetch Exec. Store

Registers

PC=10
clk

ops

result s
PC 101 IR inc R1, R0 R0

1 Fetch DecodeFetch Exec. Store

ops

10

11
R1

result s
PC 102 IR store M[501], R1 R0

10

11
R1

PC=10
I/O 100 load R0, M[500] 101 inc R1, R0 102 store M[501], R1 Memory

2 Fetch DecodeFetch Exec. Store

...
500 10 501 clk

ops

result s

I/O 100 load R0, M[500] 101 inc R1, R0 102 store M[501], R1 Memory

...
500 10 501 11 ...

...

Architectural Considerations
9 10

Architectural Considerations

Datapath ALU

N-bit processor
N-bit ALU,
Control unit

Processor

Clock frequency
Inverse

Processor Control unit Datapath ALU Controller Control /Status Registers

registers, buses, memory data interface Embedded: 8-bit, 16-bit, 32-bit common Desktop/servers: 32-bit, even 64

of clock

Controller

Control /Status Registers

PC

IR

I/O Memory

PC size determines address space

period Must be longer than longest register to register delay in entire processor Memory access is often the longest

PC

IR

I/O Memory

ARM RISC Design Philosophy


Smaller die size Shorter Development time Higher performance


Insects

flap wings faster than small birds Complex instruction will make some high level function more efficient but will slow down the clock for all instructions

ARM
Introduction

8/21/2012

ARM Design philosophy

Instruction set for embedded systems

Reduce power consumption and extend battery life High Code density Low price
Embedded

Variable cycle execution for certain instructions


Multi

systems prefer slow and low cost memory Reduce area of the die taken by embedded processor
n Leave

registers Load-store instructions Faster if memory access is sequential Higher code density common operation at start and end of function

Inline barrel shifting leads to complex instructions


Improved E.g. ADD

space for specialized processor n Hardware debug capability

code density r0,r1,r1, LSL #1

ARM is not a pure RISC Architecture


Designed primarily for embedded systems

Instruction set for embedded systems

Arm Based Embedded device

Thumb 16 bit instruction set


n Code

can execute both 16 or 32 bit instruction

Conditional execution
n Improved n Reduce

code density branch instructions n CMP r1,r2 n SUBGT r1,r1,r2 n SUBLT r2,r2,r1

Enhanced instructions DSP Instructions


n Use

one processor instead of traditional combination of two

Peripherals

ARM Datapath

ALL ARM Peripherals are Memory Mapped Interrupt Controllers

Registers
R0-R15

Standard Interrupt Controller


Sends a interrupt signal to processor core Can be programmed to ignore or mask an individual device or set of devices n Interrupt handler read a device bitmap register to determine which device requires servicing
n n

VIC- Vectored interrupt controller


n n

General Purpose registers R13-stack pointer R14-Link register R15 program counter R0-R13 are orthogonal Two program status registers
n CPSR n SPSR

Assigned priority and ISR handler to each device Depending on type will call standard Int. Hand. Or jump to specific device handler directly

8/21/2012

ARMs visible registers


r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 (PC)

BANK Registers

usable in user mode system modes only

Total 37 registers
20

r8_fiq r9_fiq r10_fiq r11_fiq r12_fiq r13_fiq r14_fiq

r13_svc r14_svc

r13_abt r14_abt

r13_irq r14_irq

r13_und r14_und

are hidden from program at different time Also called Banked Registers Available only when processor in certain mode Mode can be changed by program or on exception
n Reset,

CPSR

SPSR_fiq

SPSR_svc

SPSR_abt

SPSR_irq SPSR_und

interrupt request, fast interrupt request software interrupt, data abort, prefetch abort and undefined instruction

user mode

fiq mode

svc mode

abort mode

irq mode

undefined mode

No

SPSR access in user mode

CPSR

Instruction execution

Condition flags NZCV Interrupt masks IF Thumb state- T , Jazelle J Mode bits 0-4 processor mode

Six privileged modes


Abort failed attempt to access memory Fast interrupt request n Interrupt request n Supervisor mode after reset, Kernel work in this mode n System special version of user mode full RW access to CPSR n Undefined mode when undefined or not supported inst. Is exec.
n n

28 Mode 31 User 27

8 7 6 5 4

NZ CV

unused

IF T

mode

3 Stage pipeline ARM Organization

ARM7 Core Diagram

Fetch

The instruction is fetched from the memory and placed in the instruction pipeline The instruction is decoded and the datapath control signals prepared for the next cycle. In this stage inst. Owns the decode logic but not the datapath The inst. owns the datapath; the register bank is read, an operand shifted, the ALU result generated and written back into a destination register.

Decode

Execute

8/21/2012

3 stage Pipeline Single Cycle Inst.

3 stage Pipeline Multi-Cycle Inst.

PC Behavior

To get Higher performance


R15 increment twice before an instruction executes

due to pipeline operation

R15=current instruction address+8


Offset

Tprog =(Ninst X CPI ) / fclk Ninst No of inst. Executed for a program Constant Increase the clock rate
The

is +4 for thumb instruction

clock rate is limited by slowest pipeline stage


the logic complexity per stage the pipeline depth

n Decrease n Increase

Improve the CPI


Instruction

that take more than one cycle are reimplemented to occupy fewer cycles Pipeline stalls are reduced

Typical Dynamic Instruction usage


Statistics for a print preview program in an ARM Inst. Emulator

Memory Bottleneck

Von Neumann Bottleneck


Single

Instruction Type Data Movement Control Flow Arithmetic operation Comparisons Logical Operation Other

Dynamic Usage 43% 23% 15% 13% 5% 1%

inst and data memory Limited by available memory bandwidth A 3 stage ARM core accesses memory on (almost) every clock

Harvard Architecture in higher performance arm cores

8/21/2012

The 5 stage pipeline

Data Forwarding

Fetch
n

Inst. Fetched and placed in Inst. Pipeline Inst. Is decoded and register operand read from the register file An operand is shifted and the ALU result generated. For load and store memory address is computed Data Memory is accessed if required otherwise ALU result is simply buffered

Read after write pipeline hazard


An

Decode
n

instruction needs to use the result of one of its predecessors before that result has returned to the register file
n n

Execute
n

e.g. Add r1,r2,r3 Add r4,r5,r1

Buffer/Data
n

Data In

Write Back
n

forwarding is used to eliminate stall following case even with forwarding it is not possible to avoid a pipeline stall
n n

The results are written back to register file

E.g LDR rN, [..] ADD r2,r1,rN

; Load rN from somewhere ; and use it immediately

Processor

cannot avoid one cycle stall

Data Hazards
Handling

Data Hazards

data hazard in software

Complex addressing
require more complex hardware to decode and execute them Cause the pipeline to stall

n Solution-

Encourage compiler to not put a depended instruction immediately after a load instruction

Side

effects

n When

a location other than one explicitly named in an instruction as destination operand is affected

Pipelining features
Access to an operand does not require more than one access to memory Only load and store instruction access memory The addressing modes used do not have side effects

Addressing
n Complex

modes

addressing modes doesnt necessarily leads to faster execution n E.g. Load (X(R1)),R2 n Add #X,R1,R2 n Load (R2),R2 n Load (R2),R2

Register, register indirect, index modes

Condition codes

Flags are modified by as few instruction as possible Compiler should be able to specify in which instr. Of the program they are affected and in which they are not

Complex Addressing Mode


Load (X(R1)), R2

Simple Addressing Mode


Add #X, R1, R2 Load (R2), R2 Load (R2), R2

Time Clock cycle 1 2 3 4 5 6 7 Add F D X+ [R1] W

Load

X+ [R1]

[X +[R1]]

[[X +[R1]]] Forward

Load Next instruction F D E W Load (a) Complex addressing mode

[X +[R1]]

[[X +[R1]]]

Next instruction

(b) Simple addressing mode

8/21/2012

ARM 5 Stage Pipeline

Instruction hazards - Overview

Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline stalls. Cache miss Branch

Unconditional Branches
Time Clock cycle Instruction I1 I 2 (Branch) F1 E1 F2 E2 Execution unit idle 1 2 3 4 5 6

Branch Timing
I1 I3

Time 1 F1 2 D1 F2 3 E1 D2 F3 4 W1 E2 D3 F4 X X Fk Dk Ek Wk 5 6 7 8

Clock cycle

I 2 (Branch)

- Branch penalty - Reducing the penalty

I4 Ik I k+1

Fk+1 Dk+1 E k+1

(a) Branch address computed inecute stage Ex

Time

I3

F3

Clock cycle I1

1 F1

2 D1 F2

3 E1 D2 F3

4 W1

Ik

Fk

Ek

I 2 (Branch) I3

X Fk Dk Ek Wk

I k+1

Fk+1

Ek+1

Ik I k+1

Fk+1 D k+1 E k+1

(b) Branch address computed in Decode stage

Figure 8.8. An idle cycle caused by a branch instruction.


Figure 8.9. Branch timing.

Instruction Queue and Prefetching


Instruction fetch unit Instruction queue F : Fetch instruction

Branch Timing with Instruction Queue


Clock cycle Queue length 1 1 F1 2 1 D1 F2 3 1 E1 D2 F3 F4 F5 D5 F6 X Fk Dk Ek Wk 4 1 E1 5 2 E1 6 3 W1 E2 D3 W2 E3 D4 W3 E4 W4 7 2 8 1 9 1 10 1 Time I1 I2 I3 I4

Branch folding

D : Dispatch/ Decode unit

I5 (Branch)

E : Execute instruction

W : Write results

I6 Ik Ik +1

Fk +1 D k +1 Ek +1

Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.

Figure 8.11. Branch timing in the presence of an instruction queue. Branch target address is computed in the D stage.

8/21/2012

Branch Folding

Conditional Braches

Branch folding executing the branch instruction concurrently with the execution of other instructions. Branch folding occurs only if at the time a branch instruction is encountered, at least one instruction is available in the queue other than the branch instruction. Therefore, it is desirable to arrange for the queue to be full most of the time, to ensure an adequate supply of instructions for processing. This can be achieved by increasing the rate at which the fetch unit reads instructions from the cache. Having an instruction queue is also beneficial in dealing with cache misses.

A conditional branch instruction introduces the added hazard caused by the dependency of the branch condition on the result of a preceding instruction. The decision to branch cannot be made until the execution of that instruction has been completed. Branch instructions represent about 20% of the dynamic instruction count of most programs.

Delayed Branch

Delayed Branch
LOOP Shift_left Decrement Branch=0 Add (a) Original program loop R1 R2 LOOP R1,R3

The instructions in the delay slots are always fetched. Therefore, we would like to arrange for them to be fully executed whether or not the branch is taken. The objective is to place useful instructions in these slots. The effectiveness of the delayed branch approach depends on how often it is possible to reorder instructions.

NEXT

LOOP

NEXT

Decrement Branch=0 Shift_left Add (b) Reordered instructions

R2 LOOP R1 R1,R3

Figure 8.12. Reordering of instructions for a delayed branch.

Delayed Branch
Time Clock cycle Instruction Decrement F E 1 2 3 4 5 6 7 8

Branch Prediction

Branch

Shift (delay slot)

Decrement (Branch tak en)

Branch F E

To predict whether or not a particular branch will be taken. Simplest form: assume branch will not take place and continue to fetch instructions in sequential address order. Until the branch is evaluated, instruction execution along the predicted path must be done on a speculative basis. Speculative execution: instructions are executed before the processor is certain that they are in the correct execution sequence. Need to be careful so that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed be executed.

Shift (delay slot)

Add (Branch not tak n) e

Figure 8.13. Execution timing showing the delay slot being filled during the last two passes through the loop in Figure 8.12.

8/21/2012

Incorrectly Predicted Branch


Time Clock cycle Instruction I 1 (Compare) I 2 (Branch>0) I3 I4 Ik F1 D1 F2 E1 D2/P2 F3 W1 E2 D3 F4 X 1 2 3 4 5 6

Branch Prediction

Dk

Fk

Better performance can be achieved if we arrange for some branch instructions to be predicted as taken and others as not taken. Use hardware to observe whether the target address is lower or higher than that of the branch instruction. Let compiler include a branch prediction bit. So far the branch prediction decision is always the same every time a given instruction is executed static branch prediction.

Figure 8.14.Timing when a branch decision has been incorrectly predicted as not taken.

Superscalar operation

Superscalar
F : Instruction fetch unit

Maximum Throughput - One instruction per clock cycle Multiple processing units
More

Instruction queue

than one instruction per cycle


Floatingpoint unit Dispatch unit W : Write results Integer unit

Figure 8.19. A processor with two execution units.

Timing
Time Clock c ycle I 1 (Fadd) 1 2 3 4 5 6 7

ALU

Logic operation

F1

D1

E1A

E1B

E1C

W1

OR,AND, XOR, NOT, NAND, NOR etc. No dependencies among bits Each result can be calculated in parallel for every bit ADD, SUB, INC, DEC, MUL, DIVIDE Involve long carry propagation chain
n

I 2 (Add)

F2

D2

E2

W2

Arithmetic operation

I 3 (Fsub)

F3

D3

E3

E3

E3

W3

Major source of delay


n

I 4 (Sub)

F4

D4

E4

W4

Require optimization

Suitability of algorithm based on


n n

Resource usage physical space on silicon die Turnaround time

Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19, assuming no hazards are encountered.

8/21/2012

The original ARM1 ripple-carry adder circuit


Cout A B

The ARM2 4-bit carry look-ahead scheme


Cout[3]

sum

A[3:0]

G P 4-bit adder logic sum[3:0]

Cin

B[3:0]

Cin[0]

The ARM2 ALU logic for one result bit


fs: NB bus 5 01 23 carry logic G ALU bus P NA bus 4

ARM2 ALU function codes


fs5 0 0 0 0 0 1 0 0 0 0 0 fs4 0 0 0 1 1 1 0 0 0 0 0 fs3 0 1 1 1 0 0 0 0 0 1 1 fs2 1 0 0 0 1 1 0 0 1 0 1 fs1 0 0 0 0 1 1 0 0 0 1 0 fs 0 0 0 1 1 0 0 0 1 1 0 0 ALU o ut p ut A an d B A an d not B A xor B A plus not B plus carry A plus B plus carry not A plus B plus carry A A or B B not B zero

The ARM6 carry-select adder scheme


a,b[3:0] + c +, +1 +, +1 s s+1 mux a,b[31:28]

Conditional Sum Adder


Extension of carry-select adder Carry select adder


One-level Two-level

mux

using k/2-bit adders using k/4-bit adders Three-level using k/8-bit adders Etc.

mux

sum[3:0] sum[7:4]

sum[15:8]

sum[31:16]

Assuming k is a power of two, eventually have an extreme where there are log2k-levels using 1-bit adders
This

is a conditional sum adder

10

8/21/2012

Conditional sum - example

Conditional Sum Adder: Top-Level Block for One Bit Position

The ARM6 ALU organization


A operand latch invert A XOR gates B operand latch invert B XOR gates

The cross-bar switch barrel shifter principle


right 3 right 2 right 1 no shift in[3] in[2] left 1 left 2 left 3

function logic functions adder

C in C V

in[1] in[0]

logic/arithmetic

result mux N
out[0] out[1] out[2] out[3]

zero detect result

Shift implementation

Multiplier

For left or right shift one diagonal is turned on


Shifter

operate in negative logic Precharging sets all output logic to 0.

For rotate right, the right shift diagonal is enabled together with complimentary left shift diagonal Arithmetic shift uses sign extension rather than 0 fill

ARM include hardware support for integer multiplication Older ARM cores include low cost multiplication hardware
Support

32 bit result multiply and multiply accumulate Uses the main datapath iteratively
n Barrel

shifter and ALU to generate 2 bit product in each cycle n Employ a modified booths algorithm to produce 2 bit product

11

8/21/2012

Multiplier

Modified Booths Recoding


xi+1 xi xi1 yi+1 yi zi/2 Explanation 0 0 0 0 0 0 No string of 1s in sight 0 0 1 0 1 1 End of string of 1s 0 1 0 0 1 1 Isolated 1 0 1 1 1 0 2 End of string of 1s 1 2 0 Beginning of string of 1s 1 0 0 1 1 1 0 1 1 End a string, begin new one 1 1 1 1 0 0 Beginning of string of 1s 1 1 1 0 0 0 Continuation of string of 1s Recoded radix-2 digits Radix-4 digit Context

Radix 2 multiplication Radix 4 multiplication Radix 2 Booth algorithm Radix 4 booth algorithm

Example 1 0 0 1 (1) 1 0 1 0 (1)


2

1 1 0 1 0 1 1 0
1

1 0 1 0 1 1 1 0 1 1 1 0 0 1 0
1 1

Operand x Recoded version y Radix-4 version z

Example : Modified Booths Recoding ================================


a 0 1 1 0 x 1 0 1 0 1 2 z Radix-4 ================================ p(0) 0 0 0 0 0 0 +z0a 1 1 0 1 0 0 4p(1) 1 1 0 1 0 0 p(1) 1 1 1 1 0 1 0 0 +z1a 1 1 1 0 1 0 4p(2) 1 1 0 1 1 1 0 0 p(2) 1 1 0 1 1 1 0 0 ================================

a x (x x ) a 4 0 1 0 two (x3x 2 two 4 1 ) a p Multiplicand Multiplier Product

High speed multiplier

Recent cores have high performance multiplication hardware


Support

64 bit result multiply and multiply accumulate

Multiplier: Carry Save Addition


CSA-Basic unit - (3,2)Counter

In Multiplication multiple partial products are added simultaneously using 2-operand adders Time-consuming carry-propagation must be repeated several times: k operands - k-1 propagations Techniques for lowering this penalty exist - Carry-save addition Carry propagates only in last step - other steps generate partial sum and sequence of carries Basic CSA accepts 3 n-bit operands; generates 2n-bit results: n-bit partial sum, n-bit carry Second CSA accepts the 2 sequences and another input operand, generates new partial sum and carry CSA reduces number of operands to be added from 3 to 2 without carry propagation

Simplest implementation - full adder (FA) with 3 inputs x,y,z x+y+z=2c+s (s,c - sum and carry outputs)

Outputs - weighted binary representation of number of 1's in inputs FA called a (3,2) counter n-bit CSA: n(3,2)counters in parallel with no carry links

12

8/21/2012

(a)Carry-propagate (b)carry-save
(a) A B Cin

Cascaded CSA for four 4 bit operands


Upper 3rd

Cout S

Cout

B Cin S

B Cin

Cout

B Cin

2 levels - 4-bit CSAs level - 4-bit carry-propagating adder (CPA)

Cout

A (b)

B Cin

Cout S

Cout

B Cin S

Cout

B Cin S

B Cin

Cout

Wallace Tree

ARM high-speed multiplier organization


initiali zation for MLA

Better Organization for CSA faster operation time

registers

Rs >> 8 bits/cycle Rm
rotate sum and carry 8 bits/cycle

carry-save adders

partial sum partial carry ALU (add partials)

13

You might also like