Professional Documents
Culture Documents
Lectures
Instructor: Chen, Chang-jiu
1.Computer Abstractions and Technology 002 050(049) 2.The Role of Performance 052 102(051) 3.Instructions: Language of the Machine 104 206(103) 4.Arithmetic for Computers 208 335(128) 5.The Processor: Datapath and Control 336 432(097) 6.Enhancing Performance with Pipelining 434 536(103) 7.Large and Fast: Exploiting Memory Hierarchy 538 635(098) 8.Interfacing Processors and Peripherals 636 709(074) 9.Multiprocessors 710 759(050)
Chapter 1
Computer Abstractions and Technology
1. Introduction 2. Below Your Program 3. Under the Cover 4. Integrated Circuits: Fueling Innovation 5. Real Stuff: Manufacturing Pentium Chips 6. Fallacies and Pitfalls
Introduction
Rapidly changing field:
vacuum tube -> transistor -> IC -> VLSI (see section 1.4) doubling every 1.5 years: memory capacity processor speed (Due to advances in technology and organization) Things youll be learning: how computers work, a basic foundation how to analyze their performance (or how not to!)
What is a computer?
Components: input (mouse, keyboard) output (display, printer) memory (disk drives, DRAM, SRAM, CD) network Our primary focus: the processor (datapath and control) implemented using millions of transistors Impossible to understand by looking at each transistor We need...
Abstraction
Delving into the depths reveals more information
An abstraction omits unneeded detail, helps us cope with complexity
Assembly language program (for MIPS) High-level language program (in C) swap(int v[], int k) {int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; }
C compiler
swap: muli $2, $5,4 add $2, $4,$2 lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) jr $31
What are some of the details that appear in these familiar abstractions?
Assembler
Modern instruction set architectures: 80x86/Pentium/K6, PowerPC, DEC Alpha, MIPS, SPARC, HP
Chapter 2
The Role of Performance
1. Introduction 2. Measuring Performance 3. Relating the Metrics 4. Choosing Programs to Evaluate Performance 5. Comparing and Summarizing Performance 6. Real Stuff: The SPEC95 Benchmarks and Performance of Recent Processors 7. Fallacies and Pitfalls
Performance
Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation
Why is some hardware better than others for different programs? What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating system?)
Airplane
Passengers
101 470 132 146
10
11
Execution Time
Elapsed Time counts everything (disk and memory accesses, I/O , etc.) a useful number, but often not good for comparison purposes CPU time doesn't count I/O or time spent running other programs can be broken up into system time, and user time
Our focus: user CPU time time spent executing the lines of code that are "in" our program
12
Problem: machine A runs a program in 20 seconds machine B runs the same program in 25 seconds
13
Clock Cycles
Instead of reporting execution time in seconds, we often use cycles
seconds cycles seconds program program cycle
time
cycle time = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec) A 200 Mhz. clock has a
1 200 10 6 10 9 5 nanoseconds cycle time
14
So, to improve performance (everything else being equal) you can either
15
3rd instruction
1st instruction
4th
5th
6th
...
time
This assumption is incorrect, different instructions take different amounts of time on different machines. Why? hint: remember that these are machine instructions, not lines of C code
16
time
Important point: changing the cycle time often changes the number of cycles required for various instructions (more later)
17
Example
Our favorite program runs in 10 seconds on computer A, which has a 400 Mhz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?"
Don't Panic, can easily work this out from basic principles
18
19
Performance
Performance is determined by execution time Do any of the other variables equal performance? # of cycles to execute program? # of instructions in program? # of cycles per second? average # of cycles per instruction? average # of instructions per second?
Common pitfall: thinking one of the variables is indicative of performance when it really isnt.
20
CPI Example
Suppose we have two implementations of the same instruction set architecture (ISA).
For some program,
Machine A has a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has a clock cycle time of 20 ns. and a CPI of 1.2
What machine is faster for this program, and by how much?
If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical?
21
# of Instructions Example
A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively).
The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? What is the CPI for each sequence?
22
MIPS example
Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions.
The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time?
23
Benchmarks
Performance best determined by running a real application Use programs typical of expected workload Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. Small benchmarks nice for architects and designers easy to standardize can be abused SPEC (System Performance Evaluation Cooperative) companies have agreed on a set of real program and inputs can still be abused (Intels other bug) valuable indicator of performance (and compiler technology)
24
SPEC 89
Compiler enhancements and performance
800
700
600
500
400
300
200
100
gcc
espresso
spice
doduc
nasa7
li
eqntott
matrix300
fpppp Compiler
tomcatv
25
SPEC 95
Benchmark go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu trub3d apsi fpppp wave5
Description Artificial intelligence; plays the game of Go Motorola 88k chip simulator; runs test program The Gnu C compiler generating SPARC code Compresses and decompresses file in memory Lisp interpreter Graphic compression and decompression Manipulates strings and prime numbers in the special-purpose programming language Perl A database program A mesh generation program Shallow water model with 513 x 513 grid quantum physics; Monte Carlo simulation Astrophysics; Hydrodynamic Naiver Stokes equations Multigrid solver in 3-D potential field Parabolic/elliptic partial differential equations Simulates isotropic, homogeneous turbulence in a cube Solves problems regarding temperature, wind velocity, and distribution of pollutant Quantum chemistry Plasma physics; electromagnetic particle simulation
26
SPEC 95
Does doubling the clock rate double the performance? Can a machine with a slower clock rate have better performance?
10 9 8 7
10 9 8 7
SPECint
SPECfp
6 5 4 3 2 1 0 50 100 150 Clock rate (MHz) 200 Pentium Pentium Pro 250
6 5 4 3 2 1 0 50 100 150 200 Pentium Pentium Pro 250 Clock rate (MHz)
27
SPEC CPU2000
http://www.spec.org/cpu2000/docs/readme1st.html
CINT2000 contains eleven applications written in C and 1 in C++ (252.eon) that are used as benchmarks: Name Ref Time Remarks 164.gzip 1400 Data compression utility 175.vpr 1400 FPGA circuit placement and routing 176.gcc 1100 C compiler 181.mcf 1800 Minimum cost network flow solver 186.crafty 1000 Chess program 197.parser 1800 Natural language processing 252.eon 1300 Ray tracing 253.perlbmk 1800 Perl 254.gap 1100 Computational group theory 255.vortex 1900 Object Oriented Database 256.bzip2 1500 Data compression utility 300.twolf 3000 Place and route simulator CFP2000 contains 14 applications (6 Fortran-77, 4 Fortran-90 and 4 C) that are used as benchmarks: Name Ref Time Remarks 168.wupwise 1600 Quantum chromodynamics 171.swim 3100 Shallow water modeling 172.mgrid 1800 Multi-grid solver in 3D potential field 173.applu 2100 Parabolic/elliptic partial differential equations 177.mesa 1400 3D Graphics library 178.galgel 2900 Fluid dynamics: analysis of oscillatory instability 179.art 2600 Neural network simulation; adaptive resonance theory 183.equake 1300 Finite element simulation; earthquake modeling 187.facerec 1900 Computer vision: recognizes faces 188.ammp 2200 Computational chemistry 189.lucas 2000 Number theory: primality testing 191.fma3d 2100 Finite element crash simulation 200.sixtrack 1100 Particle accelerator model 301.apsi 2600 Solves problems regarding temperature, wind, velocity and distribution of pollutants
28
Amdahl's Law
Execution Time After Improvement =
Execution Time Unaffected +( Execution Time Affected / Amount of Improvement )
Example: "Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?" How about making it 5 times faster?
29
Example
Suppose we enhance a machine making all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions?
We are looking for a benchmark to show off the new floating-point unit described above, and want the overall benchmark to show a speedup of 3. One benchmark we are considering runs for 100 seconds with the old floating-point hardware. How much of the execution time would floatingpoint instructions have to account for in this program in order to yield our desired speedup on this benchmark?
30
Remember
Performance is specific to a particular program/s
Total execution time is a consistent summary of performance For a given architecture performance increases come from:
increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction count
Pitfall: expecting improvement in one aspect of a machines performance to affect the total performance
You should not always believe everything you read! Read carefully!
31
Chapter 3
Instructions: Language of the Machine
1. Introduction 2. Operations of the Computer Hardware 3. Operands of the Computer Hardware 4. Representing Instructions in the Computer 5. Instructions for Making Decisions 6. Supporting Procedures in Computer Hardware 7. Beyond Numbers 8. Other Styles of MIPS Addressing 9. Starting a Program 10. An Example to Put It Together 11. Arrays versus Pointers 12. Real Stuff: PowerPC and 80x86 Instructions 13. Fallacies and Pitfalls
32
Instructions:
Language of the Machine More primitive than higher level languages e.g., no sophisticated control flow Very restrictive e.g., MIPS Arithmetic Instructions
Well be working with the MIPS instruction set architecture similar to other architectures developed since the 1980's used by NEC, Nintendo, Silicon Graphics, Sony
Design goals: maximize performance and minimize cost, reduce design time
33
MIPS arithmetic
All instructions have 3 operands Operand order is fixed (destination first)
Example:
C code:
MIPS code:
A = B + C
add $s0, $s1, $s2 (associated with variables by compiler)
34
MIPS arithmetic
Design Principle: simplicity favors regularity. Of course this complicates some things...
C code: A = B + C + D; E = F - A; add $t0, $s1, $s2 add $s0, $t0, $s3 sub $s4, $s5, $s0
Why?
MIPS code:
Operands must be registers, only 32 registers provided Design Principle: smaller is faster. Why?
35
Arithmetic instructions operands must be registers, only 32 registers provided Compiler associates variables with registers What about programs with lots of variables
Input
Output
I/O
36
Memory Organization
Viewed as a large, single-dimension array, with an address. A memory address is an index into the array "Byte addressing" means that the index points to a byte of memory.
0 1 2 3 4 5 6 ...
8 bits of data 8 bits of data 8 bits of data 8 bits of data 8 bits of data 8 bits of data 8 bits of data
37
Memory Organization
Bytes are nice, but most data items use larger "words" For MIPS, a word is 32 bits or 4 bytes.
0 32 bits of data 4 32 bits of data Registers hold 32 bits of data 32 bits of data 8 12 32 bits of data ... 232 bytes with byte addresses from 0 to 232-1 230 words with byte addresses 0, 4, 8, ... 232-4 Words are aligned i.e., what are the least 2 significant bits of a word address?
38
Instructions
Load and store instructions Example:
C code: A[8] = h + A[8];
MIPS code:
Store word has destination last Remember arithmetic operands are registers, not memory!
39
40
41
Machine Language
Instructions, like registers and words of data, are also 32 bits long Example: add $t0, $s1, $s2 registers have numbers, $t0=9, $s1=17, $s2=18
Instruction Format: 000000 10001 op rs 10010 rt 01000 rd 00000 shamt 100000 funct
42
Machine Language
Consider the load-word and store-word instructions, What would the regularity principle have us do? New principle: Good design demands a compromise Introduce a new type of instruction format I-type for data transfer instructions other format was R-type for register Example: lw $t0, 32($s2)
35 op 18 rs 9 rt 32 16 bit number
43
Processor
Memory
Fetch & Execute Cycle Instructions are fetched and put into a special register Bits in the register "control" the subsequent actions Fetch the next instruction and continue
44
Control
Decision making instructions alter the control flow, i.e., change the "next" instruction to be executed
MIPS conditional branch instructions: bne $t0, $t1, Label beq $t0, $t1, Label Example: if (i==j) h = i + j;
bne $s0, $s1, Label add $s3, $s0, $s1 Label: ....
45
Control
46
So far:
Instruction
add $s1,$s2,$s3 sub $s1,$s2,$s3 lw $s1,100($s2) sw $s1,100($s2) bne $s4,$s5,L beq $s4,$s5,L j Label R I J Formats: op op op rs rs rt rt rd shamt funct
Meaning
$s1 = $s2 + $s3 $s1 = $s2 $s3 $s1 = Memory[$s2+100] Memory[$s2+100] = $s1 Next instr. is at Label if $s4 $s5 Next instr. is at Label if $s4 = $s5 Next instr. is at Label
16 bit address
26 bit address
47
Control Flow
We have: beq, bne, what about Branch-if-less-than? New instruction: if $s1 < $s2 then $t0 = 1 slt $t0, $s1, $s2 else $t0 = 0
Can use this instruction to build "blt $s1, $s2, Label" can now build general control structures Note that the assembler needs a register to do this, there are policy of use conventions for registers
48 2
49
Constants
Small constants are used quite frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; Solutions? Why not? put 'typical constants' in memory and load them. create hard-wired registers (like $zero) for constants like one.
MIPS Instructions: addi $29, $29, 4 slti $8, $18, 10 andi $29, $29, 6 ori $29, $29, 4
50 3
Then must get the lower order bits right, i.e., ori $t0, $t0, 1010101010101010
0000000000000000 1010101010101010
1010101010101010
1010101010101010
51
52
Other Issues
Things we are not going to cover support for procedures linkers, loaders, memory layout stacks, frames, recursion manipulating strings and pointers interrupts and exceptions system calls and conventions Some of these we'll talk about later We've focused on architectural issues basics of MIPS assembly language and machine code well build a processor to execute these instructions.
53
Overview of MIPS
simple instructions all 32 bits wide very structured, no unnecessary baggage only three instruction formats
R I J op op op rs rs rt rt rd shamt funct
16 bit address
26 bit address
rely on compiler to achieve performance what are the compiler's goals? help compiler where we can
54
Next instruction is at Label if $t4<>$t5 Next instruction is at Label if $t4 = $t5 Next instruction is at Label
26 bit address
Addresses are not 32 bits How do we handle this with load and store instructions?
55
Addresses in Branches
Instructions: bne $t4,$t5,Label beq $t4,$t5,Label
Formats: I op rs rt 16 bit address
Could specify a register (like lw and sw) and add it to address use Instruction Address Register (PC = program counter) most branches are local (principle of locality) Jump instructions just use high order bits of PC address boundaries of 256 MB
56
To summarize:
MIPS operands Name 32 registers Example Comments $s0-$s7, $t0-$t9, $zero, Fast locations for data. In MIPS, data must be in registers to perform $a0-$a3, $v0-$v1, $gp, arithmetic. MIPS register $zero always equals 0. Register $at is $fp, $sp, $ra, $at reserved for the assembler to handle large constants. Memory[0],
Accessed only by data transfer instructions. MIPS uses byte addresses, so sequential words differ by 4. Memory holds data structures, such as arrays, and spilled registers, such as those saved on procedure calls.
30
words
Category
add
Instruction
MIPS assembly language Example Meaning add $s1, $s2, $s3 $s1 = $s2 + $s3 sub $s1, $s2, $s3 $s1 = $s2 - $s3 $s1 = $s2 + 100 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 $s1 = 100 * 2
16
Comments
Three operands; data in registers
Arithmetic
subtract
Data transfer
addi $s1, $s2, 100 lw $s1, 100($s2) sw $s1, 100($s2) store word lb $s1, 100($s2) load byte sb $s1, 100($s2) store byte load upper immediate lui $s1, 100
add immediate load word branch on equal
Used to add constants Word from memory to register Word from register to memory Byte from memory to register Byte from register to memory Loads constant in upper 16 bits
if ($s1 == $s2) go to PC + 4 + 100 if ($s1 != $s2) go to PC + 4 + 100 if ($s2 < $s3) $s1 = 1; else $s1 = 0 else $s1 = 0
Conditional branch
$s1, $s2, 100 if ($s2 < 100) $s1 = 1; 2500 $ra 2500
Unconditional jump
Jump to target address go to 10000 For switch, procedure return go to $ra $ra = PC + 4; go to 10000 For procedure call
57
Register
Byte
Halfword
Word
PC
Word
PC
Word
58
Alternative Architectures
Design alternative:
provide more powerful operations goal is to reduce number of instructions executed danger is a slower cycle time and/or a higher CPI Sometimes referred to as RISC vs. CISC virtually all new instruction sets since 1982 have been RISC VAX: minimize code size, make assembly language easy instructions from 1 to 54 bytes long! Well look at PowerPC and 80x86
59
PowerPC
Indexed addressing example: lw $t1,$a0+$s3
What do we have to do in MIPS? Update addressing update a register as part of load (for marching through arrays) example: lwu $t0,4($s3) #$t0=Memory[$s3+4];$s3=$s3+4 What do we have to do in MIPS? Others: load multiple/store multiple a special counter register bc Loop decrement counter, if not 0 goto loop
#$t1=Memory[$a0+$s3]
60
80x86
1978: The Intel 8086 is announced (16 bit architecture) 1980: The 8087 floating point coprocessor is added 1982: The 80286 increases address space to 24 bits, +instructions 1985: The 80386 extends to 32 bits, new addressing modes 1989-1995: The 80486, Pentium, Pentium Pro add a few instructions (mostly designed for higher performance) 1997: MMX is added
This history illustrates the impact of the golden handcuffs of compatibility adding new features as someone might add clothing to a packed bag an architecture that is difficult to explain and impossible to love
61
what the 80x86 lacks in style is made up in quantity, making it beautiful from the right perspective
62
Summary
Instruction complexity is only one variable lower instruction count vs. higher CPI / lower clock rate Design Principles: simplicity favors regularity smaller is faster good design demands compromise make the common case fast Instruction set architecture a very important abstraction indeed!
63
Chapter Four
Arithmetic for Computers
1. Introduction 2. Signed and Unsigned Numbers 3. Addition and Subtraction 4. Logical Operations 5. Constructing an Arithmetic Logic Unit 6. Multiplication 7. Division 8. Floating Point 9. Real Stuff: Floating Point in the PowerPC and 80x86 10. Fallacies and Pitfalls
64
Arithmetic
Where we've been: Performance (seconds, cycles, instructions) Abstractions: Instruction Set Architecture Assembly Language and Machine Language What's up ahead: Implementing the Architecture
operation
a
32
ALU
result
32
b
32
65
Numbers
Bits are just bits (no inherent meaning) conventions define relationship between bits and numbers Binary numbers (base 2) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001... decimal: 0...2n-1 Of course it gets more complicated: numbers are finite (overflow) fractions and real numbers negative numbers e.g., no MIPS subi instruction; addi can add a negative number) How do we represent negative numbers? i.e., which bit patterns will represent which numbers?
66
Possible Representations
Sign Magnitude: One's Complement Two's Complement
Issues: balance, number of zeros, ease of operations Which one is best? Why?
67
MIPS
32 bit signed numbers:
0000 0000 0000 ... 0111 0111 1000 1000 1000 ... 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000two = 0ten 0000 0000 0000 0000 0000 0000 0001two = + 1ten 0000 0000 0000 0000 0000 0000 0010two = + 2ten
= = = = =
+ +
maxint
minint
1111 1111 1111 1111 1111 1111 1101two = 3ten 1111 1111 1111 1111 1111 1111 1110two = 2ten 1111 1111 1111 1111 1111 1111 1111two = 1ten
68
69
70
Detecting Overflow
No overflow when adding a positive and a negative number No overflow when signs are the same for subtraction Overflow occurs when the value affects the sign: overflow when adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive and get a negative or, subtract a positive from a negative and get a positive Consider the operations A + B, and A B Can overflow occur if B is 0 ? Can overflow occur if A is 0 ?
71
Effects of Overflow
An exception (interrupt) occurs Control jumps to predefined address for exception Interrupted address is saved for possible resumption Details based on software system / language example: flight control vs. homework assignment Don't always want to detect overflow new MIPS instructions: addu, addiu, subu
note: addiu still sign-extends! note: sltu, sltiu for unsigned comparisons
72
73
result
74
A B
0 1
75
Different Implementations
Not easy to decide the best way to build something
Don't want too many inputs to a single gate Dont want to have to go through too many gates for our purposes, ease of comprehension is important Let's look at a 1-bit ALU for addition:
CarryIn
a Sum b
CarryOut
How could we build a 1-bit ALU for add, and, and or? How could we build a 32-bit ALU?
76
a0 b0
Operation CarryIn
a1
a 0 1
b1
Result
a2
2 b
b2
CarryOut
a31 b31
77
Binvert
Operation CarryIn
a 0 1
Result
0 1
CarryOut
78
79
Supporting slt
a
Binvert
Operation CarryIn
1 Result 2
a.
CarryOut
Binvert
Operation CarryIn
Overflow
Binvert
CarryIn
Operation
a0 b0
Result0
a1 b1 0
Result1
a2 b2 0
Result2
CarryIn
a31 b31 0
81
a0 b0
Result0
a1 b1 0
Result1 Zero
a2 b2 0
Result2
a31 b31 0
82
Conclusion
We can build an ALU to support the MIPS instruction set
key idea: use multiplexor to select the output we want we can efficiently perform subtraction using twos complement we can replicate a 1-bit ALU to produce a 32-bit ALU
Our primary focus: comprehension, however, Clever changes to organization can improve performance (similar to using better algorithms in software) well look at two examples for addition and multiplication
83
Can you see the ripple? How could you get rid of it? c1 c2 c3 c4 = = = = b0c0 b1c1 b2c2 b3c3 + + + + a0c0 a1c1 a2c2 a3c3 + + + + a0b0 a1b1c2 = a2b2 a3b3
84
Carry-lookahead adder
An approach in-between our two extremes Motivation: If we didn't know the value of carry-in, what could we do? When would we always generate a carry? gi = ai bi When would we propagate the carry? pi = ai + bi
Did we get rid of the ripple? = = = = g0 g1 g2 g3 + + + + p0c0 p1c1 c2 = p2c2 c3 = p3c3 c4 = Feasible! Why?
c1 c2 c3 c4
85
Cant build a 16 bit adder this way... (too big) Could use ripple carry of 4-bit CLA adders Better: use the CLA principle again!
86
Multiplication
More complicated than addition accomplished via shifting and addition More time and more area Let's look at 3 versions based on gradeschool algorithm
0010 __x_1011
(multiplicand) (multiplier)
Negative numbers: convert and multiply there are better techniques, we wont look at them
87
Multiplication: Implementation
Start
Multiplier0 = 1
1. Test Multiplier0
Multiplier0 = 0
1a. Add multiplicand to product and place the result in Product register
64-bit ALU
Control test
32nd repetition?
Yes: 32 repetitions
Done
88
Second Version
Start
Multiplier0 = 1
1. Test Multiplier0
Multiplier0 = 0
Multiplicand 32 bits
1a. Add multiplicand to the left half of the product and place the result in the left half of the Product register
32-bit ALU
Multiplier Shift right 32 bits Product 64 bits Shift right Write Control test
32nd repetition?
Yes: 32 repetitions
Done
89
Final Version
Start
Product0 = 1
1. Test Product0
Product0 = 0
Multiplicand 32 bits
1a. Add multiplicand to the left half of the product and place the result in the left half of the Product register
32-bit ALU
Product 64 bits
Control test
32nd repetition?
Yes: 32 repetitions
Done
90
4.7 Division
(p.265)
91
(p.266)
Divisor
(Dividend)
The 32-bit divisor starts in the left half of the Divisor reg. The remainder is initialized w/ the dividend.
92
Figure 4.37 93
(p.268)
<Ans>
94
Divisor
Dividend
Quotient
95 Remainder
Second Version
(p.268)
Dividend Remainder
* * * *
32-bit divisor, 32-bit ALU 32-bit dividend starts in the right half of the Remainder reg. The remainder will be found in the left half of the Remainder reg. # of iterations of the algorithm: 32 96
(p.269)
Dividend
Remainder
Quotient
* The quotient will be found in the right half of the Remainder reg. 97
Remainder63 = 0
Remainder63 = 1
Remainder63
98
(p.271)
<Ans>
Divisor
Dividend
Remainder Quotient
99
Signed Division
(p.272)
Simplest solution: remember the signs of the divisor and dividend and then negate the quotient if the signs disagree Note: The dividend and the remainder must have the same signs!
Dividend Divisor = Quotient Remainder Dividend = Quotient Divisor + Remainder
100
Nonrestoring Division
(Exercise 5.54, p.333)
1. Rem(L) Rem(L) Divisor Rem >= 0 2a. shl Rem, Rem0 1 Yes Rem < 0 2b. shl Rem, Rem0 0 Yes
Test Rem
101
Representation:
sign, exponent, significand: (1)sign significand 2exponent more bits for significand gives more accuracy more bits for exponent increases range
IEEE 754 floating point standard: single precision: 8 bit exponent, 23 bit significand double precision: 11 bit exponent, 52 bit significand
102
Floating-Point Representation
IEEE 754 floating point standard:
(p.276)
103
104
Floating-Point Addition
(p.280)
105
-7/16ten = -7/24ten = -0.0111two = -0.0111two x 20 = -1.110two x 2-2 Now we follow the algorithm: Step 1. The significand of the number with the lesser exponent (-1.11two x 2-2) is shifted right until its exponent matches the larger number: -1.110two x 2-2 = -0.111two x 2-1 -0.4375ten =
106
Step
2. Add the significands: 1.0two x 2-1 + (-0.111two x 2-1) = 0.001two x 2-1 Step 3. Normalize the sum, checking for overflow and underflow: 0.001two x 2-1 = 0.010two x 2-2 = 0.100two x 2-3 = 1.000two x 2-4 Since 127 -4 -126, there is no overflow or underflow. (The biased exponent would be -4 + 127, or 123, which is between 1 and 254, the smallest and largest unreserved biased exponents.) Step 4. Round the sun: 1.000two x 2-4 The sum already fits exactly in 4 bits, so there is no change to the bit due to rounding. This sum is then 1.000two x 2-4 = 0.0001000two = 0.0001two = 1/24ten = 1/16ten = 0.0625ten This sum is what we would expect from adding 0.5ten to -0.4375ten
107
(p.285)
Figure 4.45
108
Floating-Point Multiplication
(p.283)
Figure 4.46
109
(p.287)
110
1.110 23
111
(p.288)
112
113
114
Accurate Arithmetic
(p.297)
Rounding: FP numbers are normally approximations for a number they cant really represent. requires the hardware to include extra bits in the calculation Measurement for the accuracy in floating point:
the number of bits in error in the LSBs of the significand i.e., the number of units in the last place (ulp)
Rounding in IEEE 754 keeps 2 extra bits on the right during intermediate calculations:
115
(p.297)
116
Implementing the standard can be tricky Not using the standard can be even worse see text for description of 80x86 and Pentium bug!
117
We are ready to move on (and implement the processor) you may want to look back (Section 4.12 is great reading!)
118