You are on page 1of 58

Performance

1
What is performance?
Which computer performs better?
It depend on what you want to perform.

What does faster mean?


response time vs. throughput

2
You go to the grocery store
What is important?
response time: how fast one customer is
processed
throughput: how many customers are processed
per hour.

3
For now focus on response time.
Response time is also called execution time.
Maximum performance means minimum
execution time:
1
Performancex
ExecutionTimex

4
Comparing Performance
x is n times faster than y

Performance
x n
Performance
y

5
Time
Elapsed Time: wall clock time.
CPU time: the time the CPU spends on the
program.
user CPU time: the time spent in my program.
system CPU time: time spent in the O.S. (in
service of my program).
CPU times do not include time waiting for I/O.

6
The Unix time command
timetells you the breakdown of your
program in user, system and elapsed times.

7
User vs. Designer
Users worry about how long a program will
take.

Designers worry about how long individual


operations (addition, multiplication, etc)
take.

Programmers must worry about both!


8
Programs and cycles
CPU Execution Time
CPU cycles in program * cycle time
how many cycles will how long does
the program take each cycle take

9
Quiz
Computer A has a 400MHz clock.
Our program takes 10 seconds on A.

We want to make the program run in 6


seconds how fast should the clock be?

10
Find the total number of clock
cycles
# cycles CPUtime * clock rate

# cycles 10 s * 400 x10 6

# cycles 4000 x10 6

11
Use same equation solve for
new clock rate
# cycles CPUtime * clock rate

4000 x10 6 s * clock _ ratenew


6

clock _ ratenew 666 MHz

12
A realistic twist on the problem
Suppose in order to move to a faster clock we
need to make a change in CPU design.

What does the clock rate need to be to make the


program run in 6 seconds?
should this be more or less than 666MHz?

13
Instruction Set
Each processor has a set of instructions that
it can execute.
programs are just sequences of these
instructions.
Some instructions take longer than others
Instruction time is often measured in cycles.

14
CPI: Cycles per Instruction
CPI is the average number of cycles per
instruction.
average over the instructions executed in a
specific program.
You cant just take the average over all the
instructions in the instruction set!

15
IPS: Items per Scan
Average # of items the cashier can process per scan.
2 identical cans of a particular item might take only 1
scan (for an experienced scanner).
a bunch of bananas might take only 1 scan.
Given a cart full of items, the IPS is the total
number of items divided by the total number of
scans.
IPS will depend on what is in the cart!
CPI depends on what is in the program!

16
New Relationship

CPU Execution Time


# Instructions in program * CPI
Clock Rate

17
Another Problem
2 Computers: A and B.
Same instruction set (same program will run on
both computers).
A has cycle time of 1ns and CPI of 2.0 for
program P.
B has cycle time of 2ns and CPI of 1.2 for
program P.
Which machine is faster (and by how much)?

18
Computer A
CPU Execution Time
# Instructions in program * CPI

Clock Rate
# Instructions in program * CPI * Clock Cycle

TimeA = p * 2.0 * 1ns = 2.0p ns

NOTE: # instructions in program P is p

19
Computer B
CPU Execution Time
# Instructions in program * CPI

Clock Rate
# Instructions in program * CPI * Clock Cycle

TimeB = p * 1.2 * 2 ns = 2.4p ns

20
A vs. B on program P
A: 2.0p ns B: 2.4p ns

Performance A Time B 2.4


1.2
PerformanceB Time A 2.0

A is 1.2 times faster than B on program P

21
Important Relationship!

Instructions Cycles Seconds


Time
program Instruction Cycle

Knowing only one of these doesnt tell


you all you need to know!

My computer has a 100GHz clock!

22
Instructions Cycles Seconds
Time
program Instruction Cycle

Called the Instruction Count or IC

Depends on the instruction set architecture and on


the compiler.

The same C program could result in dramatically


different numbers of instructions on different
machines/compilers!

23
A Relevant Question
Given a program, how can we tell what the
instruction count is?

24
Instruction Count
IC is a count of the instructions executed at
runtime!
You have to watch the program flow when
running (called profiling) or simulate the
entire program.
You can't tell just by looking at the program
(even at the machine code).

25
Improving Performance

Our favorite program runs in 10 seconds on computer A, which has 4 GHz


clock. We are trying to help a computer designer build a computer, B, that
will run this program in 6 seconds. The designer has determined that a
substantial increase in the clock rate is possible, but this increase will affect
the rest of the CPU design, causing computer B to require 1.2 times as many
clock cycles as computer A for this program. What clock rate should we tell
the designer to target?
Solution:

(1)Lets first find the number of clock cycles required for the program on A:

CPU clock cyclesA


CPU time A
Clock rate A

CPU clock cyclesA


10 seconds
cycles
4 109
second

cycles
CPU clock cyclesA 10 second 4 109 40 109 cycles
second
(2)CPU time for B can be found using this equation:

1.2 CPU clock cyclesA


CPU time B
Clock rate B

1.2 40 109 cycles


6 seconds
Clock rate B

1.2 40 109 cycles 8 109 cycles


Clock rate B 8 GHz
6seconds second
computer B must therefore have twice the clock rate of A to run the program
in 6 seconds.
CPI Example
Suppose we have two implementations of the same instruction set
architecture (ISA).

For some program,

Machine A has a clock cycle time of 250 ps and a CPI of 2.0


Machine B has a clock cycle time of 500 ps and a CPI of 1.2

What machine is faster for this program, and by how much?


If two machines have the same ISA which of our quantities (e.g., clock
rate, CPI, execution time, # of instructions, MIPS) will always be
identical?
Solution:

CPU clock cyclesA I 2.0


CPU clock cyclesB I 1.2
Now we can compute the CPU time for each computer:

CPU time A CPU clock cyclesA Clock cycle time A


I 2.0 250 ps 500 I ps
Clearly, computer A is faster. The amount faster is given by the ratio of the
execution times:

CPU performance A Execution time B 600 I ps


1.2
CPU performance B Execution time A 500 I ps

We can conclude that computer A is 1.2 times as fast as computer B for this
program.
Figure 4.2 The basic components of performance and how
each is measured.

Components of performance Units of measure


CPU execution time for a program Seconds for the program
Instruction count Instructions executed for the program
Clock cycles per instruction (CPI) Average number of clock cycles per instruction
Clock cycle time Seconds per clock cycle
# of Instructions Example
A compiler designer is trying to decide between two code
sequences for a particular machine. Based on the hardware
implementation, there are three different classes of instructions:
Class A, Class B, and Class C, and they require one, two, and
three cycles (respectively).

The first code sequence has 5 instructions: 2 of A, 1 of B, and 2


of C
The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of
C.

Which sequence will be faster? How much?


What is the CPI for each sequence?
Solution:
We can use the equation for CPU clock cycles based on instruction count
and CPI to find the total number of clock cycles for each sequence:

n
CPU clock cycles (CPI i Ci )
This yields i 1

CPU clock cycles1 (2 1) (1 2) (2 3) 2 2 6 10 cycles


CPU
The clock
CPI can 2be computed
cycles
values (4 1) (1by
2) (1 3) 4 2 3 9 cycles

CPU clock cycles


CPI
Instruction count
CPU clock cycles1 10
MIPS example
Two different compilers are being tested for a 4 GHz. machine with
three different classes of instructions: Class A, Class B, and Class C,
which require one, two, and three cycles (respectively). Both
compilers are used to produce code for a large piece of software.

The first compiler's code uses 5 million Class A instructions, 1 million


Class B instructions, and 1 million Class C instructions.

The second compiler's code uses 10 million Class A instructions, 1


million Class B instructions, and 1 million Class C instructions.

Which sequence will be faster according to MIPS?


Which sequence will be faster according to execution time?
4.3 Evaluating Performance
Keywords
Workload A set of programs run on a computer that is either the actual
collection of applications run by a user or is constructed from real programs
to approximate such a mix. A typical workload specifies both the programs
as well as the relative frequencies.

Arithmetic mean The average of the execution times that is directly


proportional to total execution time.
1 n
AM Time i
n i 1
Weight arithmetic mean An average of the execution time of a workload
with weighting factors designed to reflect the presence of the programs in a
workload; computed as the sum of the products of weighting factors and
execution times.
Figure4.3 System description of a desktop system using the
Hardware
Hardware vendor
fastest
Dell
Pentium 4 available in 2003.
Model number Precision Workstation 360 (3.2 GHz Pentium 4 Extreme
Edition)
CPU Intel Pentium 4 (800 MHz system bus)
CPU MHz 3200
FPU Integrated
CPU(s) exabled 1
CPU(s) orderable 1
Parallel No
Primary cache 12K(I) micro-ops + 8KB(D) on chip
Secondary cache 512KB (I+D) on chip
L3 cache 2048KB(I+D) on chip
Other cache N/A
Memory 4*512MB ECC DDR400 SDRAM CL3
Disk subsystem 1*80GB ATA/100 7200 RPM
Other hardware
Software
Operating system Windows XP Professional SP1
Compiler Intel C++ Compiler 7.1 (20030402Z)
Microsoft Visual Studio.NET (7.0.9466)
MicroQuill SmartHeap Library 6.01
File system type NTFS
System state Default
Figure4.4 Execution times of two programs on two different
computers.

Computer A Computer B
Program 1 (seconds) 1 10
Program 2 (seconds) 1000 100
Total time (seconds) 1001 110
4.4 Real Stuff: Two SPEC
Benchmarks and the Performance
of Recent Intel Processors
Keywords
System performance evaluation cooperative (SPEC) benchmark A
set of standard CPU-intensive, integer and floating point benchmarks
based on real programs.
Benchmarks
Performance best determined by running a real application
Use programs typical of expected workload
Or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
Small benchmarks
nice for architects and designers
easy to standardize
can be abused
SPEC (System Performance Evaluation Cooperative)
companies have agreed on a set of real program and inputs
valuable indicator of performance (and compiler technology)
can still be abused
Benchmark Games
An embarrassed Intel Corp. acknowledged Friday that a bug in a software
program known as a compiler had led the company to overstate the speed of
its microprocessor chips on an industry benchmark by 10 percent. However,
industry analysts said the coding errorwas a sad commentary on a
common industry practice of cheating on standardized performance
testsThe error was pointed out to Intel two days ago by a competitor,
Motorola came in a test known as SPECint92Intel acknowledged that it
had optimized its compiler to improve its test scores. The company had
also said that it did not like the practice but felt to compelled to make the
optimizations because its competitors were doing the same thingAt the
heart of Intels problem is the practice of tuning compiler programs to
recognize certain computing problems in the test and then substituting
special handwritten pieces of code

Saturday, January 6, 1996 New York Times


SPEC 89
800

Compiler enhancements and performance


700

600
SPEC performance ratio

500

400

300

200

100

0
gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv

Benchmark
Compiler

Enhanced compiler
SPEC CPU2000
SPEC 2000
Does doubling the clock rate double the
1400

performance?
1200

Pentium 4 CFP2000

Can a machine with a slower clock rate have


1000
Pentium 4 CINT2000

better performance?
800

600
Pentium III CINT2000
400

Pentium III CFP2000


200

0
500 1000 1500 2000 2500 3000 3500
Clock rate in MHz
Figure4.7 SPEC web9999 performance for a variety of Dell PowerEdge systems using the
Xeon versions of the Pentium III and Pentium 4 microprocessors.

Number Number Clock


Number
System Processor of disk of rate Result
of CPUs
drives networks (GHz)
1550/1000 Pentium III 2 2 2 1 2765
1650 Pentium III 3 2 1 1.4 1810
2500 Pentium III 8 2 4 1.13 3435
2550 Pentium III 1 2 1 1.26 1454
2650 Pentium 4 Xeon 5 2 4 3.06 5698
4600 Pentium 4 Xeon 10 2 4 2.2 4615
6400/700 Pentium III Xeon 5 4 4 0.7 4200
6600 Pentium 4 Xeon XP 8 4 8 2 6700
8450/700 Pentium III Xeon 7 8 8 0.7 8001
1.6
Pentium M @ 1.6/0.6 GHz
Pentium 4-M @ 2.4/1.2 GHz
1.4
Pentium III-M @ 1.2/0.8 GHz

1.2

1.0

0.8

0.6

0.4

0.2

0.0
SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000
Always on/maximum clock Laptop mode/adaptive Minimum power/minimum
clock clock
Benchmark and power mode
Experiment
Phone a major computer retailer and tell them you are having
trouble deciding between two different computers, specifically
you are confused about the processors strengths and weaknesses

(e.g., Pentium 4 at 2Ghz vs. Celeron M at 1.4 GHz )

What kind of response are you likely to get?

What kind of response could you give a friend with the same
question?
4.5 Fallacies and Pitfalls
Keywords
Amdahls law A rule stating that the performance enhancement possible
with a given improvement is limited by the amount that the improved feature
is used.

Execution time after improvement


Execution time affected by improvement
Execution time unaffected
Amount of improvement
Million instruction per second (MIPS) A measurement of program
execution speed based on the number of millions of instructions. MIPS is
computed as the instruction count divided by the product of the execution
time and . 6
10
Instruction count
MIPS
Execution time 10 6
Amdahl's Law
Execution Time After Improvement =

Execution Time Unaffected +( Execution Time Affected / Amount of Improvement )

Example:

"Suppose a program runs in 100 seconds on a machine, with


multiply responsible for 80 seconds of this time. How much do we
have to improve the speed of multiplication if we want the program
to run 4 times faster?"

How about making it 5 times faster?

Principle: Make the common case fast


Solution:
Here we use three equations: CPU clock cycles
(1)Executi on time
Clock rate
n
Instructio n count
(2)CPU clock cycles (CPI i Ci ) (3)MIPS
i 1 Execution time 106
Then =>

CPU clock cycles1 (5 1 1 2 1 3) 109 10 109


CPU clock cycles2 (10 1 1 2 1 3) 109 15 109

10 109
Execution time1 2.5 seconds
4 10 9

15 109
Execution time 2 3.75 seconds
4 10 9
(5 1 1)
Now Lets compute the10 9
MIPS rate for each version of the program:
MIPS1 2800
2.5 10 6
(10 1 1) 109
MIPS2 3200
3.75(30) 10 6

So, the code from compiler 2 has a higher MIPS rating, but the code
from compiler 1 runs faster!
Example
Suppose we enhance a machine making all floating-point instructions
run five times faster. If the execution time of some benchmark before
the floating-point enhancement is 10 seconds, what will the speedup be
if half of the 10 seconds is spent executing floating-point instructions?

We are looking for a benchmark to show off the new floating-point


unit described above, and want the overall benchmark to show a
speedup of 3. One benchmark we are considering runs for 100 seconds
with the old floating-point hardware. How much of the execution time
would floating-point instructions have to account for in this program in
order to yield our desired speedup on this benchmark?
Remember
Performance is specific to a particular program/s
Total execution time is a consistent summary of performance

For a given architecture performance increases come from:


increases in clock rate (without adverse CPI affects)
improvements in processor organization that lower CPI
compiler enhancements that lower CPI and/or instruction count
Algorithm/Language choices that affect instruction count

Pitfall: expecting improvement in one aspect of a machines performance to


affect the total performance
4.6 Concluding Remarks
The execution time is related to other important measurements we can
make by the following equation:

Seconds Instructions Clock cycles Seconds



Program Program Instruction Clock cycle

You might also like