You are on page 1of 63

CS 465

Computer Architecture
Fall 2009

Lecture 01: Introduction
Daniel Barbar ( cs.gmu.edu/~dbarbara)
[Adapted from Computer Organization and Design,
Patterson & Hennessy, 2005, UCB]
Course Administration
Instructor: Daniel Barbar
dbarbara@gmu.edu
4420 Eng. Bldg.


Text: Required: Computer Organization & Design
The Hardware Software Interface, Patterson &
Hennessy, the 4th Edition

Grading Information
Grade determinates
Midterm Exam ~25%

Final Exam 1 ~35%
Homeworks ~40%
- Due at the beginning of class (or, if its code to be submitted
electronically, by 17:00 on the due date). No late assignments
will be accepted.
Course prerequisites
grade of C or better in CS 367
Acknowledgements

Slides adopted from Dr. Zhong
Contributions from Dr. Setia
Slides also adopt materials from many other universities
IMPORTANT:
- Slides are not intended as replacement for the text
- You spent the money on the book, please read it!


Course Topics (Tentative)
Instruction set architecture (Chapter 2)
MIPS
Arithmetic operations & data (Chapter 3)
System performance (Chapter 4)
Processor (Chapter 5)
Datapath and control
Pipelining to improve performance (Chapter 6)
Memory hierarchy (Chapter 7)
I/O (Chapter 8)

Focus of the Course
How computers work
MIPS instruction set architecture
The implementation of MIPS instruction set architecture MIPS
processor design
Issues affecting modern processors
Pipelining processor performance improvement
Cache memory system, I/O systems
Why Learn Computer Architecture?
You want to call yourself a computer scientist
Computer architecture impacts every other aspect of computer science
You need to make a purchasing decision or offer expert advice
You want to build software people use sell many, many copies-
(need performance)
Both hardware and software affect performance
- Algorithm determines number of source-level statements
- Language/compiler/architecture determine machine instructions (Chapter 2
and 3)
- Processor/memory determine how fast instructions are executed (Chapter 5,
6, and 7)
- Assessing and understanding performance(Chapter 4)
Outline Today
Course logistics
Computer architectures overview
Trends in computer architectures
Computer Systems
Software
Application software Word Processors, Email, Internet
Browsers, Games
Systems software Compilers, Operating Systems
Hardware
CPU
Memory
I/O devices (mouse, keyboard, display, disks, networks,..)
O p e r a t i n g
s y s t e m s
A p p l i c a t i o n s
s o f t w a r e
l a T
E
X
V i r t u a l
m e m o r y
F i l e
s y s t e m
I / O d e v i c e
d r i v e r s
A s s e m b l e r s
a s
C o m p i l e r s
g c c
S y s t e m s
s o f t w a r e
S o f t w a r e
Software
D.Barbar
instruction set
software
hardware
Instruction Set Architecture
One of the most important abstractions is ISA
A critical interface between HW and SW
Example: MIPS
Desired properties
Convenience (from software side)
Efficiency (from hardware side)
D.Barbar
What is Computer Architecture
Programmers view: a pleasant environment
Operating systems view: a set of resources (hw
& sw)
System architecture view: a set of components
Compilers view: an instruction set architecture
with OS help
Microprocessor architecture view: a set of
functional units
VLSI designers view: a set of transistors
implementing logic
Mechanical engineers view: a heater!
D.Barbar
What is Computer Architecture
Patterson & Hennessy: Computer
architecture = Instruction set architecture
+ Machine organization + Hardware
For this course, computer architecture
mainly refers to ISA (Instruction Set
Architecture)
Programmer-visible, serves as the boundary
between the software and hardware
Modern ISA examples: MIPS, SPARC,
PowerPC, DEC Alpha
D.Barbar
Organization and Hardware
Organization: high-level aspects of a computers
design
Principal components: memory, CPU, I/O,
How components are interconnected
How information flows between components
E.g. AMD Opteron 64 and Intel Pentium 4: same ISA
but different organizations
Hardware: detailed logic design and the
packaging technology of a computer
E.g. Pentium 4 and Mobile Pentium 4: nearly identical
organizations but different hardware details
Types of computers and their applications
Desktop
Run third-party software
Office to home applications
30 years old
Servers
Modern version of what used to be called mainframes,
minicomputers and supercomputers
Large workloads
Built using the same technology in desktops but higher capacity
- Expandable
- Scalable
- Reliable
Large spectrum: from low-end (file storage, small businesses) to
supercomputers (high end scientific and engineering
applications)
- Gigabytes to Terabytes to Petabytes of storage
Examples: file servers, web servers, database servers

Types of computers
Embedded
Microprocessors everywhere! (washing machines, cell phones,
automobiles, video games)
Run one or a few applications
Specialized hardware integrated with the application (not your
common processor)
Usually stringent limitations (battery power)
High tolerance for failure (dont want your airplane avionics to
fail!)
Becoming ubiquitous
Engineered using processor cores
- The core allows the engineer to integrate other functions into the
processor for fabrication on the same chip
- Using hardware description languages: Verilog, VHDL

Where is the Market?
290
93
3
488
114
3
892
135
4
862
129
4
1122
131
5
0
200
400
600
800
1000
1200
1998 1999 2000 2001 2002
Embedded
Desktop
Servers
M
i
l
l
i
o
n
s

o
f

C
o
m
p
u
t
e
r
s

In this class you will learn
How programs written in a high-level language (e.g.,
Java) translate into the language of the hardware and
how the hardware executes them.
The interface between software and hardware and how
software instructs hardware to perform the needed
functions.
The factors that determine the performance of a program
The techniques that hardware designers employ to
improve performance.
As a consequence, you will understand what features may
make one computer design better than another for a
particular application
High-level to Machine Language
High-level language program
(in C)
Assembly language program
(for MIPS)
Binary machine language program
(for MIPS)
Compiler
Assembler
Evolution
In the beginning there were only bits and people spent
countless hours trying to program in machine language
01100011001 011001110100
Finally before everybody went insane, the assembler
was invented: write in mnemonics called assembly
language and let the assembler translate (a one to one
translation)
Add A,B
This wasnt for everybody, obviously (imagine how
modern applications would have been possible in
assembly), so high-level language were born (and with
them compilers to translate to assembly, a many-to-one
translation)
C= A*(SQRT(B)+3.0)


THE BIG IDEA
Levels of abstraction: each layer provides its own
(simplified) view and hides the details of the next.
Instruction Set Architecture (ISA)
ISA: An abstract interface between the hardware and
the lowest level software of a machine that encompasses
all the information necessary to write a machine
language program that will run correctly, including
instructions, registers, memory access, I/O, and so on.
... the attributes of a [computing] system as seen by the
programmer, i.e., the conceptual structure and functional
behavior, as distinct from the organization of the data flows and
controls, the logic design, and the physical implementation.
Amdahl, Blaauw, and Brooks, 1964
Enables implementations of varying cost and performance to run
identical software
ABI (application binary interface): The user portion of the
instruction set plus the operating system interfaces used
by application programmers. Defines a standard for
binary portability across computers.
ISA Type Sales
0
200
400
600
800
1000
1200
1400
1998 1999 2000 2001 2002
Other
SPARC
Hitachi SH
PowerPC
Motorola 68K
MIPS
IA-32
ARM
PowerPoint comic bar chart with approximate values (see
text for correct values)
M
i
l
l
i
o
n
s

o
f

P
r
o
c
e
s
s
o
r

Organization of a computer
Anatomy of Computer
Personal Computer
Processor

Computer
Control
(brain)
Datapath
(brawn)
Memory


(where
programs,
data
live when
running)
Devices
Input
Output
Keyboard,
Mouse
Display,
Printer
Disk
(where
programs,
data
live when
not running)
5 classic components
Datapath: performs arithmetic operation
Control: guides the operation of other components based on the user
instructions
PC Motherboard Closeup
Inside the Pentium 4
Moores Law
In 1965, Gordon Moore predicted that the number of
transistors that can be integrated on a die would double
every 18 to 24 months (i.e., grow exponentially with
time).

Amazingly visionary million transistor/chip barrier was
crossed in the 1980s.
2300 transistors, 1 MHz clock (Intel 4004) - 1971
16 Million transistors (Ultra Sparc III)
42 Million transistors, 2 GHz clock (Intel Xeon) 2001
55 Million transistors, 3 GHz, 130nm technology, 250mm
2
die
(Intel Pentium 4) - 2004
140 Million transistor (HP PA-8500)

Processor Performance Increase
1
10
100
1000
10000
1987 1989 1991 1993 1995 1997 1999 2001 2003
Year
P
e
r
f
o
r
m
a
n
c
e

(
S
P
E
C

I
n
t
)
SUN-4/260 MIPS M/120
MIPS M2000
IBM RS6000
HP 9000/750
DEC AXP/500
IBM POWER 100
DEC Alpha 4/266
DEC Alpha 5/500
DEC Alpha 21264/600
DEC Alpha 5/300
DEC Alpha 21264A/667
Intel Xeon/2000
Intel Pentium 4/3000
Year
T
r
a
n
s
i
s
t
o
r
s
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000
i80386
i4004
i8080
Pentium
i80486
i80286
i8086
CMOS improvements:
Die size: 2X every 3 yrs
Line width: halve / 7 yrs
Itanium II: 241 million
Pentium 4: 55 million
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
Moores Law
Trend: Microprocessor Capacity
Moores Law
Cramming More Components onto Integrated Circuits
Gordon Moore, Electronics, 1965
# of transistors per cost-effective integrated circuit doubles every 18 months
Transistor capacity doubles every 18-24 months
Speed 2x / 1.5 years (since 85);
100X performance in last decade

Trend: Microprocessor Performance

Memory
Dynamic Random Access Memory (DRAM)
The choice for main memory
Volatile (contents go away when power is lost)
Fast
Relatively small
DRAM capacity: 2x / 2 years (since 96);
64x size improvement in last decade
Static Random Access Memory (SRAM)
The choice for cache
Much faster than DRAM, but less dense and more costly
Magnetic disks
The choice for secondary memory
Non-volatile
Slower
Relatively large
Capacity: 2x / 1 year (since 97)
250X size in last decade
Solid state (Flash) memory
The choice for embedded computers
Non-volatile

Memory
Optical disks
Removable, therefore very large
Slower than disks
Magnetic tape
Even slower
Sequential (non-random) access
The choice for archival

DRAM Capacity Growth
10
100
1000
10000
100000
1000000
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002
Year of introduction
K
b
i
t

c
a
p
a
c
i
t
y
16K
64K
256K
1M
4M
16M
64M
128M
256M
512M
Trend: Memory Capacity
size
Year
B
i
t
s
1000
10000
100000
1000000
10000000
100000000
1000000000
1970 1975 1980 1985 1990 1995 2000
year size (Mbit)
1980 0.0625
1983 0.25
1986 1
1989 4
1992 16
1996 64
1998 128
2000 256
2002 512
2006 2048

Now 1.4X/yr, or 2X every 2 years.
more than 10000X since 1980!
Growth of capacity per chip
(Kilo, Mega, Giga, Tera, Peta, Exa, Zetta, Yotta = 10
24
)
Come up with a clever mnemonic, fame!
Dramatic Technology Change
State-of-the-art PC when you graduate:
(at least)
Processor clock speed: 5000 MegaHertz
(5.0 GigaHertz)
Memory capacity: 4000 MegaBytes
(4.0 GigaBytes)
Disk capacity: 2000 GigaBytes
(2.0 TeraBytes)
New units! Mega => Giga, Giga => Tera
Example Machine Organization
Workstation design target
25% of cost on processor
25% of cost on memory (minimum memory size)
Rest on I/O devices, power supplies, box
CPU
Computer
Control
Datapath
Memory Devices
Input
Output
MIPS R3000 Instruction Set Architecture
Instruction Categories
Load/Store
Computational
Jump and Branch
Floating Point
- coprocessor
Memory Management
Special
R0 - R31
PC
HI
LO
OP
OP
OP
rs
rt
rd sa funct
rs
rt
immediate
jump target
3 Instruction Formats: all 32 bits wide
Registers
Defining Performance
Which airplane has the best performance?
0 100 200 300 400 500
Douglas
DC-8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Passenger Capacity
0 2000 4000 6000 8000 10000
Douglas DC-
8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Cruising Range (miles)
0 500 1000 1500
Douglas
DC-8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Cruising Speed (mph)
0 100000 200000 300000 400000
Douglas DC-
8-50
BAC/Sud
Concorde
Boeing 747
Boeing 777
Passengers x mph

1
.
4

P
e
r
f
o
r
m
a
n
c
e

Response Time and Throughput
Response time
How long it takes to do a task
Throughput
Total work done per unit time
- e.g., tasks/transactions/ per hour
How are response time and throughput affected by
Replacing the processor with a faster version?
Adding more processors?
Well focus on response time for now
Relative Performance
Define Performance = 1/Execution Time
X is n time faster than Y
n = =
X Y
Y X
time Execution time Execution
e Performanc e Performanc
Example: time taken to run a program
10s on A, 15s on B
Execution Time
B
/ Execution Time
A

= 15s / 10s = 1.5
So A is 1.5 times faster than B
Measuring Execution Time
Elapsed time
Total response time, including all aspects
- Processing, I/O, OS overhead, idle time
Determines system performance
CPU time
Time spent processing a given job
- Discounts I/O time, other jobs shares
Comprises user CPU time and system CPU time
Different programs are affected differently by CPU and system
performance
CPU Clocking
Operation of digital hardware governed by a constant-rate clock
Clock (cycles)
Data transfer
and computation
Update state
Clock period
Clock period: duration of a clock cycle
e.g., 250ps = 0.25ns = 25010
12
s
Clock frequency (rate): cycles per second
e.g., 4.0GHz = 4000MHz = 4.010
9
Hz
CPU Time
Performance improved by
Reducing number of clock cycles
Increasing clock rate
Hardware designer must often trade off clock rate against cycle
count
Rate Clock
Cycles Clock CPU
Time Cycle Clock Cycles Clock CPU Time CPU
=
=
CPU Time Example
Computer A: 2GHz clock, 10s CPU time
Designing Computer B
Aim for 6s CPU time
Can do faster clock, but causes 1.2 clock cycles
How fast must Computer B clock be?
4GHz
6s
10 24
6s
10 20 1.2
Rate Clock
10 20 2GHz 10s
Rate Clock Time CPU Cycles Clock
6s
Cycles Clock 1.2
Time CPU
Cycles Clock
Rate Clock
9 9
B
9
A A A
A
B
B
B
=

=

=
= =
=

= =
Instruction Count and CPI
Instruction Count for a program
Determined by program, ISA and compiler
Average cycles per instruction
Determined by CPU hardware
If different instructions have different CPI
- Average CPI affected by instruction mix
Rate Clock
CPI Count n Instructio
Time Cycle Clock CPI Count n Instructio Time CPU
n Instructio per Cycles Count n Instructio Cycles Clock

=
=
=
CPI Example
Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA
Which is faster, and by how much?
1.2
500ps I
600ps I
A
Time CPU
B
Time CPU
600ps I 500ps 1.2 I
B
Time Cycle
B
CPI Count n Instructio
B
Time CPU
500ps I 250ps 2.0 I
A
Time Cycle
A
CPI Count n Instructio
A
Time CPU
=

=
= =
=
= =
=
A is faster
by this much
CPI in More Detail
If different instruction classes take different numbers of
cycles

=
=
n
1 i
i i
) Count n Instructio (CPI Cycles Clock
Weighted average CPI

=
|
.
|

\
|
= =
n
1 i
i
i
Count n Instructio
Count n Instructio
CPI
Count n Instructio
Cycles Clock
CPI
Relative frequency
CPI Example
Alternative compiled code sequences using instructions in classes A,
B, C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
Sequence 1: IC = 5
Clock Cycles
= 21 + 12 + 23
= 10
Avg. CPI = 10/5 = 2.0
Sequence 2: IC = 6
Clock Cycles
= 41 + 12 + 13
= 9
Avg. CPI = 9/6 = 1.5
Performance Summary
Performance depends on
Algorithm: affects IC, possibly CPI
Programming language: affects IC, CPI
Compiler: affects IC, CPI
Instruction set architecture: affects IC, CPI, T
c
The BIG Picture
cycle Clock
Seconds
n Instructio
cycles Clock
Program
ns Instructio
Time CPU =
Power Trends
In CMOS IC technology

1
.
5

T
h
e

P
o
w
e
r

W
a
l
l

Frequency Voltage load Capacitive Power
2
=
1000 30 5V 1V
Reducing Power
Suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction
0.52 0.85
F V C
0.85 F 0.85) (V 0.85 C
P
P
4
old
2
old old
old
2
old old
old
new
= =


=
The power wall
We cant reduce voltage further
We cant remove more heat
How else can we improve performance?
Uniprocessor Performance

1
.
6

T
h
e

S
e
a

C
h
a
n
g
e
:

T
h
e

S
w
i
t
c
h

t
o

M
u
l
t
i
p
r
o
c
e
s
s
o
r
s

Constrained by power, instruction-level parallelism,
memory latency
Multiprocessors
Multicore microprocessors
More than one processor per chip
Requires explicitly parallel programming
Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer
Hard to do
- Programming for performance
- Load balancing
- Optimizing communication and synchronization
SPEC CPU Benchmark
Programs used to measure performance
Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC)
Develops benchmarks for CPU, I/O, Web,
SPEC CPU2006
Elapsed time to execute a selection of programs
- Negligible I/O, so focuses on CPU performance
Normalize relative to reference machine
Summarize as geometric mean of performance ratios
- CINT2006 (integer) and CFP2006 (floating-point)
n
n
1 i
i
ratio time Execution
[
=
CINT2006 for Opteron X4 2356
Name Description IC10
9
CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean 11.7
High cache miss rates
SPEC Power Benchmark
Power consumption of server at different workload levels
Performance: ssj_ops/sec
Power: Watts (Joules/sec)
|
.
|

\
|
|
.
|

\
|
=

= =
10
0 i
i
10
0 i
i
power ssj_ops Watt per ssj_ops Overall
SPECpower_ssj2008 for X4
Target Load % Performance (ssj_ops/sec) Average Power (Watts)
100% 231,867 295
90% 211,282 286
80% 185,803 275
70% 163,427 265
60% 140,160 256
50% 118,324 246
40% 920,35 233
30% 70,500 222
20% 47,126 206
10% 23,066 180
0% 0 141
Overall sum 1,283,590 2,605
ssj_ops/ power 493
Pitfall: Amdahls Law
Improving an aspect of a computer and expecting a proportional
improvement in overall performance

1
.
8

F
a
l
l
a
c
i
e
s

a
n
d

P
i
t
f
a
l
l
s

20
80
20 + =
n
Cant be done!
unaffected
affected
improved
T
factor t improvemen
T
T + =
Example: multiply accounts for 80s/100s
How much improvement in multiply performance to get 5 overall?
Corollary: make the common case fast
Fallacy: Low Power at Idle
Look back at X4 power benchmark
At 100% load: 295W
At 50% load: 246W (83%)
At 10% load: 180W (61%)
Google data center
Mostly operates at 10% 50% load
At 100% load less than 1% of the time
Consider designing processors to make power
proportional to load
Pitfall: MIPS as a Performance Metric
MIPS: Millions of Instructions Per Second
Doesnt account for
- Differences in ISAs between computers
- Differences in complexity between instructions
6
6
6
10 CPI
rate Clock
10
rate Clock
CPI count n Instructio
count n Instructio
10 time Execution
count n Instructio
MIPS

=
CPI varies between programs on a given CPU
Concluding Remarks
Cost/performance is improving
Due to underlying technology development
Hierarchical layers of abstraction
In both hardware and software
Instruction set architecture
The hardware/software interface
Execution time: the best performance measure
Power is a limiting factor
Use parallelism to improve performance

1
.
9

C
o
n
c
l
u
d
i
n
g

R
e
m
a
r
k
s

You might also like