You are on page 1of 60

Computer Architecture

Based on slides by C. Kozyrakis

Based on slides by C. Kozyrakis

Introduction

What Computer Architecture is About


Many different views: How to build programmable digital systems Introduction to processor architecture Understanding why your programs sometimes run much more slowly then you expect or dont run at all Bottom line: Digital systems are ubiquitous Processors are one of the common idioms in digital design
Cant avoid them these days They are in your computers, TV, car, phones, door locks,

It pays to understand how they work


To understand what they can and cant do

Based on slides by C. Kozyrakis

Introduction

Major Topics
Hardware-software interface Machine language and assembly language programming Processor design Pipelined processor design Memory hierarchy Virtual memory & operating systems support I/O devices

Instruction sets: RISC Vs. SISC

Based on slides by C. Kozyrakis

Introduction

Course Information
Instructor: Yosi Keller E-mail: yosi.keller@gmail.com Office: Room 427 Office Hours: by appointment TA: Tal Darom

Based on slides by C. Kozyrakis

Introduction

Other Course Info


Course Text Computer Organization & Design, 4th Edition
By D. Patterson & J. Hennessy CD includes manuals, appendices, simulators, CAD tools,

Website:

ask

Tal

Online material: Google

Grading
Homework Sets Final
0% 100%

Based on slides by C. Kozyrakis

Introduction

Lecture 1 Introduction to Programmable Digital Systems

Based on slides by C. Kozyrakis

Introduction

Current State of the World


Electronic systems dominate almost everything And most of these systems use processors and memory Why? Break this question into three questions
Why electronics Why use ICs to build electronics Why use processors in ICs

Why use electronics Electrons are easy to move / control Easier than the current alternatives Result is that we move information / not real physical stuff Think phone, email, fax, TV, WWW, etc.

Based on slides by C. Kozyrakis

Introduction

Mechanical Alternative to Electronics


Picture of a version of the Babbage difference engine built by the Museum of Science UK

The calculating section of Difference Engine No. 2, has 4,000 moving parts (excluding the printing mechanism) and weighs 2.6 tons. It is seven feet high, eleven feet long and eighteen inches in depth

Based on slides by C. Kozyrakis

Introduction

Electronics
Building electronics: Started with tubes, then miniature tubes Transistors, then miniature transistors Components were getting cheaper, more reliable but
There is a minimum cost of a component (storage, handling ) Total system cost was proportional to complexity

Integrated circuits changed that Devices that integrate multiple transistors Printed a circuit, like you print a picture,
Create components in parallel Cost no longer depended on # of devices

What happens as resolution goes up?

Based on slides by C. Kozyrakis

Introduction

The Famous Moores Law


Devices get smaller Get more devices on a chip Devices get faster Initial graph from 1965 paper Prediction: 2x per year Not too many data points Slowed down to 2x Every 1.5 to 2 years? Is Moores Law really a Law? What does it say about performance?

Based on slides by C. Kozyrakis

Introduction

10

Sense of Scale
What fits on a chip today? Mainstream logic chip
10mm on a side (100mm2) 90nm drawn gate length 210nm wire pitch 10 wires levels
90nm

For comparison
32b RISC integer processor
1K x 2K wire grids 1100 processors 210nm

SRAM
About 4 x 4 grids / bit 138 M SRAM cells

64b FP Processor 32b RISC Processor 10mm (47,000 wire pitches)

DRAM
1 x 2 grids / bit 1.1 B cells

Based on slides by C. Kozyrakis

Introduction

11

Technology Scaling

1998

Chip density doubles every 3 years What can you do with this?

2004

More devices harder to design

2010

Based on slides by C. Kozyrakis

Introduction

12

The Complexity Problem


Complexity is the limiting factor in modern chip design Two problems

1. How do you make use of all that space? Uberappliance


Cellphone, PDA, iPod, mobile TV, video camera

Too many applications to cast all into hardware logic Takes too long to finish the design 2. How do you make sure it works? Verification problem Only way to survive complexity: Hide complexity in general-purpose components Reuse components

Based on slides by C. Kozyrakis

Introduction

13

Programmable Components aka Processors


An old approach to solve complexity problem Build a generic device and customize with memory (program) Best way to do this is with a general purpose processor Processor complexity grows with technology But software model stays roughly the same
C, C++, Java, run on Pentium 2, 3, and 4 True for sequential programs

This is getting much tougher to do


Recent hardware developments require software model changes Multi-core processors

Based on slides by C. Kozyrakis

Introduction

14

Microprocessor Complexity
Model has hidden the scaling of technology Efficiently transformed transistors to performance 8080 3,500 transistors, and ran at 200kHz (1975) Pentium4 42M transistors, runs at 3+GHz (2003) Performance changed from 0.06MIPS to >1,000MIPS

COMPUTER ARCHITECTURE, Lecture 1

22

Based on slides by C. Kozyrakis

Introduction

15

Key to Complexity: Nice Interfaces


Use abstraction to hide complexity Define an interface to allow people to use features without needing to understand all the implementation details Works for hardware and software Stable interfaces allows people to optimize below and above it
Applications C, C++ Instruction Set Arch. Functional Units Logic Gates Transistors

Based on slides by C. Kozyrakis

Introduction

16

But I Never Want to Build Hardware


Why should I care about how a computer works? And why should I have to learn about assembly code? No one codes in assembly any more, right?
Unfortunately that is not correct E.g. compilers, operating systems kernel E.g. Embedded systems, video games

It is still useful to look inside the box Understand limitations of the programmers model Understand strange performance issues
Efficiency and performance issues will become more important

Help you when things go wrong

Based on slides by C. Kozyrakis

Introduction

17

Reality #1 Ints are not Integers, Floats are not Reals


Examples Is x2 0?
Floats: 32b Ints:

Yes!

40,000 * 40,000 --> 1,600,000,000 50,000 * 50,000 --> ??

Is (x + y) + z = x + (y + z)?
Unsigned & Signed Ints: Floats:
(1e20 + -1e20) + 3.14 --> 3.14 1e20 + (-1e20 + 3.14) --> ??

Yes!

Based on slides by C. Kozyrakis

Introduction

18

Reality #2 Youve got to know assembly


Chances are, youll never write program in assembly Compilers are much better & more patient than you are Understanding assembly key to machine-level execution model Behavior of programs in presence of bugs
High-level language model breaks down

Tuning program performance


Understanding sources of program inefficiency

Implementing system software


Compiler has machine code as target Operating systems must manage process state

Based on slides by C. Kozyrakis

Introduction

19

Reality #3 Memory Matters


Memory is not unbounded It must be allocated and managed Many applications are memory dominated Memory referencing bugs especially pernicious Effects are distant in both time and space Memory performance is not uniform Cache and virtual memory can greatly affect program performance Adapting program to characteristics of memory system can lead to major speed improvements
10x to 100x in several cases

Security

Based on slides by C. Kozyrakis

Introduction

20

Class Goal
Provide a better understanding of modern digital systems design These systems almost always have a programmable processor Processors are a good example of a complex system
Pipelining and caches

Tie the hardware with the software Most people use processors and dont build them Interaction of HW and SW is fundamental to computer systems Write better software Provide a foundation for other classes in systems Networking, OS, Compilers, Embedded Systems, etc. Understand capabilities of Compilers, OS

Based on slides by C. Kozyrakis

Introduction

21

What is a Computer System?


Depends (a little) on what type of computer system We probably mostly think about PC systems

Based on slides by C. Kozyrakis

Introduction

22

What is a Computer System?


Actually most computers look like this

Based on slides by C. Kozyrakis

Introduction

23

5 components of any Computer

Personal Computer

Computer Processor Control (brain) Datapath Memory (where programs, data live when running) Devices Input

Keyboard, Mouse Disk


(where programs, data live when not running)

Output

Display, Printer

Based on slides by C. Kozyrakis

Introduction

24

What is in a Computer System?


Each system is different, but generally have similar parts: Must have: Processor, Memory Interface to outside world (I/O) Generally have: Cache memory System bus Memory controller I/O bus

Based on slides by C. Kozyrakis

Introduction

25

Example Processor Based Systems


MIPS processor board MIPS = Microprocessor without Interlocked Pipeline Stages Example: DSP boards PC Board Digital cell phone Game console

Based on slides by C. Kozyrakis

Introduction

26

PC Motherboard
Input/output interfaces Power reg/supply PCI slots AGP Memory controller Pentium 4 socket

DIMM slots
COMPUTER ARCHITECTURE, Lecture 1 35

Based on slides by C. Kozyrakis

Introduction

27

PC System
Pentium 4 2.66 GHz 8KB Data cache, 12 KB Instruction cache 512 KB L2 Cache 533 MHz System Bus 68 Watts Memory system 4 DDR DIMM slots Up to 4 GB I/O interfaces Ethernet USB Serial ATA (disk) Serial port Parallel port Firewire

Based on slides by C. Kozyrakis

Introduction

28

Digital Cell Phone (Nokia 8260) Front Side


Battery 900 mAhr 3.5 hr talk ~1W 8 days standby ~ 1mW

ARM processor

Based on slides by C. Kozyrakis

Introduction

29

Digital Cell Phone (Nokia 8260) Back Side

Based on slides by C. Kozyrakis

Introduction

30

64 b MIPS CPU 300 MHz Behavioral synthesis, geometry processing, main system control

PS2 Motherboard
Rendering Texture Framebuffer ops

R3000 CPU (120K transistors) R3010 FPU 32 KB Instruction cache 32 KB Data cache 256 KB secondary cache Memory controller chips

32 b MIPS CPU 34 MHz IO processing PS1 emulation

COMPUTER ARCHITECTURE, Lecture 1

39

Based on slides by C. Kozyrakis

Introduction

31

What do Computer Architects Do?


I/O Chan

Applications

Interfaces
IR Regs

Technology

Machine Organization

Computer Architect Software Requirements Measurement & Analysis

The science/art of constructing efficient systems for computing tasks

Based on slides by C. Kozyrakis

Introduction

Link

API

ISA

32

Application: Constraints & Opportunities


Applications drive machine balance
Scientific computations
Floating-point performance Main memory bandwidth

Transaction/web processing
??

Multimedia processing
??

Embedded control
??

Architecture concepts typically exploit application behavior


Based on slides by C. Kozyrakis Introduction 33

Applications Change over Time


Data-sets & memory requirements larger
Cache & memory architecture become more critical

Standalone networked
IO integration & system software become more critical

Single task multiple tasks


Parallel architectures become critical

Limited IO requirements rich IO requirements


60s: tapes & punch cards 70s: character oriented displays 80s: video displays, audio, hard disks 90s: 3D graphics; networking, high-quality audio 00s: real-time video, immersion,

Based on slides by C. Kozyrakis

Introduction

34

Application Properties to Exploit in Computer Design


Locality in memory/IO references
Programs work on subset of instructions/data at any point in time Both spatial and temporal locality

Parallelism
Data-level (DLP): same operation on every element of a data sequence Instruction-level (ILP): independent instructions within sequential program Thread-level (TLP): parallel tasks within one program Multi-programming: independent programs Pipelining

Predictability
Control-flow direction, memory references, data values

Based on slides by C. Kozyrakis

Introduction

35

Technology Trends & Constraints: Yearly Improvement


Integrated circuits: logic
60% more devices per chip 15% faster devices Long wires dont improve 1992 1995

Integrated circuits: DRAM


60% more devices per chip 7% reduction in latency 14% increase in bandwidth 1998

Magnetic Disks
60% to 100% increase in density

IO/networking
Little improvement in latency Large improvements in bandwidth through fast/wide signaling 2001 64x more devices since 1992 4x faster devices

Based on slides by C. Kozyrakis

Introduction

36

Changes in Technology & Applications lead to Changes in Architecture


1970s
Multi-chip CPUs Semiconductor memory very expensive Complex instruction sets (good code density) Microcoded control

1990s
1 M - 64M transistors, 64b CPUs Complex control to exploit instructionlevel parallelism Deep pipelines Multi-level caches

2000s
100 M - 5 B transistors Slow wires, power consumption, design, complexity, memory latency, IO bottlenecks, Multiprocessors & parallel systems Support & programming for parallelism?

1980s
5K 500 K transistors Single-chip, pipelined CPUs On-chip memory possible Simple, hard-wired control Simple instruction sets Small on-chip caches

Keeps computer architecture interesting and challenging


Based on slides by C. Kozyrakis Introduction 37

Architects use a Quantitative Approach

Iterative Process

Tools that help us analyze, estimate, and compare efficiency

Sort

New concepts created Good ideas worth implementing

Bad ideas

Mediocre ideas

Based on slides by C. Kozyrakis

Introduction

38

Metrics of Efficiency
Desktop computing ($500 - $3K)
Metrics: ?? Prominent processors: Intel Pentium, AMD Athlon, PowerPC G5

Server computing ($3K - $1M)


Metrics: ?? Prominent processors: IBM Power5, Sun UltraSparc, AMD Opteron

Embedded computing ($10 - $500)


Metrics: ?? Prominent processors: ARM, MIPS, Motorola 68K, many others

Diversity in requirements leads to diversity in architectures

Based on slides by C. Kozyrakis

Introduction

39

Performance Metrics
Plane Boeing 747 BAD/Sud Concorde DC to Paris 6.5 hours 3 hours Speed 610 mph 1350 mph Passengers 470 132 Throughput (pmph) 286,700 178,200

Latency or execution time or response time


Wall-clock time to complete a task Important if all we have to run is a single or a time-critical time to run

Bandwidth or throughput or execution rate


Number of tasks completed per unit of time
Bandwidth = total amount of work / total execution time

Metric is independent of exact number of tasks executed Important when we have many tasks to run

Based on slides by C. Kozyrakis

Introduction

40

Examples
Latency metric: program execution time in seconds
CPUtime = Seconds Cycles Seconds = Pr ogram Pr ogram Cycle

Instructions Cycles Seconds Pr ogram Instructio n Cycle

= ICC P ICCT
Your system architecture can affect all of them
CPI: memory latency, IO latency, CPI = Cycles per Instruction CCT: cache organization, CCT = Clock cycle time IC: OS overhead,

Bandwidth metrics:
Network bandwidth: 1 Gb/s ethernet Database server throughput: 106 transactions/sec

Based on slides by C. Kozyrakis

Introduction

41

Cycles Per Instruction


Average Cycles per Instruction
CPI(machine, program) = total number of clock cycles/ # of instructions executed CPU tim e= CycleiTme *

CPI I
i=1 i

Instruction Frequency
CPI =

CPI F *
i=1 i i

where

Fi =

instruction count

Invest Resources where time is Spent!

Based on slides by C. Kozyrakis

Introduction

42

Example: Calculating CPI


Base Machine (Reg / Reg) Op Freq Cycles CPI(i) ALU 50% 1 .5 Load 20% 2 .4 Store 10% 2 .2 Branch 20% 2 .4 1.5
Typical Mix in code

(% Time) (33%) (27%) (13%) (27%)

Based on slides by C. Kozyrakis

Introduction

43

A is Faster than B?
Given the CPUtime for machines A and B, A is X times faster than B means:

CPUTimeB X= CPUTime A
Example, CPUtimeA=3.4sec & CPUtimeB=5.3sec then
A is 5.3/3.4=1.55 times faster than B or 55% faster

If you start with bandwidth metrics of performance, use inverse ratio

X=

BandWidthA BandWidthB
Introduction 44

Based on slides by C. Kozyrakis

Speedup and Amdahls Law


Speedup = CPUtimeold / CPUtimenew Given an optimization x that accelerates fraction fx of program by a factor of Sx, how much is the overall speedup?
Speedup = CPUTimeold CPUTimeold 1 = = CPUTimenew CPUTime [(1 f ) + f x ] (1 f ) + f x old x x Sx Sx

Lessons from Amdhals law


Make common cases fast: as fx1, speedupSx But dont overoptimize common case: as Sx, speedup 1 / (1-fx)
Speedup is limited by the fraction of the code that can be accelerated Uncommon case will eventually become the common one

Based on slides by C. Kozyrakis

Introduction

45

Amdahls Law Example


If Sx=100, what is the overall speedup as a function of fx?
Speedup vs Optimized Fraction
100 90 80 70 60 Speedup 50 40 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of Code Optimized

Based on slides by C. Kozyrakis

Introduction

46

Performance Sensitivity
Definitions: Ci contributor i to performance P performance absolute change in x x x /x relative change in x Sensitivity of P to change in Ci: to absolute change in Ci: P(C1 ,..., Ci ,...C N ) Ci to relative change in Ci:
P(C1 ,..., Ci ,...C N ) Ci Ci
47

Based on slides by C. Kozyrakis

Introduction

Relative Sensitivity

Relative sensitivity of P to relative changes in Ci versus its relative sensitivity to relative changes in Cj:

P(C1 ,..., Ci ,...C N ) Ci P (C1 ,..., Ci ,...C N ) Ci Ci Ci = P (C1 ,..., Ci ,...C N ) P (C1 ,..., Ci ,...C N ) C j C j C j Cj

Based on slides by C. Kozyrakis

Introduction

48

Effect of changes on relative sensitivity example


Let P=C1 + C2 (larger P is better) How does increasing C1 affect the relative benefit of
increasing C1 rather than C2 by the same percentage?

Solution: relSens = C1 / C2 so, the relative benefit of increasing C1 grows. Underlying theory: Amdahls law.

Based on slides by C. Kozyrakis

Introduction

49

Aspects of CPU Performance


CPU time CPU time = Seconds = Instructions x Cycles x Seconds = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Program Program Instruction Cycle

Program Compiler Inst. Set. Organization Technology

Inst Count X X X

CPI

Clock Rate

X X X X X
i

Based on slides by C. Kozyrakis

Introduction

50

Evaluating Performance
What do we mean by performance? How do we select benchmark programs? How do we summarize performance across a suite of programs?
When to use the different types of means Statistics for architects

Based on slides by C. Kozyrakis

Introduction

51

Choosing Benchmark Programs


Criteria
Representative of real workloads in some way Hard to cheat (i.e. get deceptively good performance that will never be seen in real life)

Best solution: run substantial, real-world programs


Representative because real Improvements on these programs = improvements in the real world but require more effort than toy benchmarks

Examples:
SPEC CPU integer/floating-point suites TPC transaction processing benchmarks

Based on slides by C. Kozyrakis

Introduction

52

How do you summarize performance?


Combining different benchmark results into 1 number: sometimes misleading, always controversialand inevitable
Arithmetic mean: for times

Statistics for architects: benchmark suites as samples of a population


Distributions Confidence intervals

Based on slides by C. Kozyrakis

Introduction

53

(Weighted) Arithmetic Mean


1 n (Weighti ) Timei n i =1
Machine A Prog. 1 (sec) Prog. 2 (sec) Mean (50/50) Mean (75/25) 1 1000 500.5 250.75 Machine B 10 100 55 32.5 Speedup (B over A) 0.1 10 9.1 7.7

If you know your exact workload (benchmarks & relative frequencies), this is the right way to summarize performance.
Based on slides by C. Kozyrakis Introduction 54

Statistics for Architects


Means are nice, but they dont tell you the whole truth
More info when you run 1,000 programs on a machine More info when you run one program on 1,000 machine configurations

Next few slides: basic tools for statistics for computer architectus
How to observe large collections of experiment results How to represent large collections of experiment results

Based on slides by C. Kozyrakis

Introduction

55

Populations and Samples


Population: set of observations measured for ALL members of group
Forms a distribution Uncertainty: individual measurement errors

Sample: subset of population


Compute statistics Extra uncertainty: small samples or selection bias Population, N Parameters Sample size, representativeness

Sample, n Statistics
Introduction

Estimate population mean and std.dev Confidence interval for mean


56

Based on slides by C. Kozyrakis

Basic Assumptions
Measurements are repeatable
Same program + input gives same performance Valid for most programs/machines worth verifying Watch out for non-deterministic programs

Choice of input doesnt change relative performance of different machines


Usually true counterexample?

Number of benchmarks in suite (sample size) is large enough to yield good conclusions
Confidence intervals help verify this

Benchmarks are representative and not a biased sample


Can only address qualitatively
Based on slides by C. Kozyrakis Introduction 57

Data Distributions with Same Arithmetic Mean


Multi-modal (here, left-skewed) Right-skewed Uniform Symmetric Triangular Normal (+) Symmetric Terrific! Statistics toolkit OK, not much central tendency Good, more central tendency Uncertain Awful, but hope

Lognormal (*)
.001 .01 Based on slides by C. Kozyrakis .1 1 GM 10

100 1000 Introduction

Log-symmetric Terrific! Statistics toolkit


58

General Distribution Descriptions


Mean: measure of central tendency, 1st moment Variance: measure of dispersion, 2nd moment Standard deviation: measure of dispersion, same scale as Mean

1 AVERAGE : AM = = N 1 VARP : 2 = N

x
i =1

= Arithmetic Mean

(xi )2
i =1

SDEVP : = 2

Based on slides by C. Kozyrakis

Introduction

59

The Familiar Normal (Gaussian) Distribution


Arises from large number of small additive effects Completely specified by mean m and standard deviation Familiar, useful properties never automatically assume normal, but hope 68% within m -/+ ; 95% within -/+ 2 ; 99.7% within m -/+ 3 Symmetric around the mean = intuitive measure of central tendency
ms m 2s m 3s
0.0 0.5 1.0 1.5 2.0

m+s m + 2s m + 3s
2.5 3.0 3.5 4.0

m
---68%-------------95%----------------------------99.7%-----------------Based on slides by C. Kozyrakis Introduction 60

You might also like