Measuring CPU Performance

Measuring CPU Performance How to measure and predict the performance of a new computer design?
Build a prototype, implement a compiler, and do some real timings, or Write an instruction accurate, cycle accurate simulator like WinMIPS64, implement a compiler, count instructions and cycles and estimate timings. This will run much more slowly. The second route is quicker, and cheaper. In either case implementing a compiler is a priority. Measuring CPI is more difficult. The number of cycles an instruction takes can depend on its context for example whether or not the instruction before used the same register. So it is vital that the simulator be cycle accurate. WinMIPS64 tells us exactly how many clock cycles a program needs. It is cycle accurate it displays the status of the processor after every clock tick.
Coherence
A program spends 90% of its time running 10% of its code. If the current instruction has a particular address, the next instruction to be fetched and executed will almost certainly come from the next address branches are rare. If a program reads data from a particular address, the next data to be read will most likely be from a nearby address.
Spatial coherence data with addresses close together tend to be referenced close together in time. Temporal coherence recently accessed data is likely to be accessed again soon.
Memory Hierarchy
On-chip memory access is much faster than off-chip access. Clearly all the computer memory + CPU cannot be put on a single chip (except for small embedded processors). A major design requirement is to increase memory bandwidth, and decrease memory access time. Basic DRAM is quite slow but much faster memory can be built, but it will be more expensive and bulky. The solution on-chip cache memory. Fast memory built on the same piece of silicon as the CPU. The idea is use Coherence to maintain in cache the memory (code & data) most likely to be accessed next. This is called the memory hierarchy.
CPU Registers (32 2 ns)
Cache memory (256 kilobyte) (5 ns)
Main Memory (256 Megabytes) (100 ns)
Memory Bus Internal Bus
I/O Bus
Hard Disk (16 Gigabytes) (3 ms)
When the CPU finds requested data in the cache it is called a cache hit. If it doesnt its a cache miss. If cache misses are rare, programs run as if all memory was as fast as cache memory. When a miss does occur the CPU stalls and a fixed-size block of memory is read into cache containing the instruction or data that could not be found. Latency time to complete a single memory access Bandwidth the rate at which data can be accessed from memory. Bandwidth > Latency (we can for example access many memory locations simultaneously) The bandwidth determines the time it takes to retrieve a block from main memory.
Movement between Main memory and the hard disk uses an analogous virtual memory system. Again fixed size blocks (pages) are read into memory when required. If the required information is not currently in memory a page fault occurs, and a new page read in from the disk.
Cache Performance
Suppose a cache is 10 times faster than main memory, and that the cache is used 90% of the time. Whats the speed-up? S=1/(0.1+0.9/10) = 5.3 When a cache miss occurs there will be typically a fixed penalty to be endured. A typical instruction requires one memory access (to itself be fetched) and if it is a load or store instruction, a second memory access (to load or store the data). CPU time = (CPU clock cycles+Memory stall cycles)/Clock Rate The number of memory stall cycles depends on the number of misses, and the cost per miss. Let MR be the average number of misses per memory access the Miss Rate. This is the fraction of cache accesses that result in a miss. Assume CPI =2 for cache hits. Assume 40% of the instructions are Load and Stores. Assume a miss rate of 2% and a miss penalty of 25 cycles. How much faster will a program run if there are no cache misses? If there are no misses:CPU time = IC*CPI/Clock Rate = IC*2/Clock Rate
With 2% misses:CPU time = IC*(2+(1+0.4)*0.02*25)/Clock Rate = IC*2.7/Clock Rate So its 35% slower! Note that this puts an onus on the compiler (and the programmer) to produce code that minimises cache misses. Performance aware programmers try to write code in such a way as to minimise cache misses. This is quite complex and requires an insight into the machine architecture.
Compilers and Compiler Optimisation

#include <stdio.h> int main() { int i,total,average; int a[20]; for (i=0;i<20;i++) a[i]=i; total=0; for (i=0;i<20;i++) { total+=a[i]; } average=total/20; return 0; }
Compiling this program with the Microsoft C compiler, without optimisation produces the following Assembler program. Note in particular how space for the local variables is created on the stack.
; ; with no optimization ; PUBLIC _main _i$ = -92 _total$ = -88 _average$ = -84 _a$ = -80 _main ; Line 5 push mov sub mov ebp ebp, esp esp, 92 ; create space for 23 4-byte integers DWORD PTR _i$[ebp], 0 ; i=0 PROC NEAR ; positions of variables on the stack ; a[20]
jmp $L341: mov add mov $L340: cmp jge mov mov mov jmp $L342: mov mov jmp $L344: mov add mov $L343: cmp jge mov mov add mov jmp $L345: mov cdq mov idiv mov xor mov pop ret ENDP ENDS
SHORT $L340 eax, DWORD PTR _i$[ebp] eax, 1 DWORD PTR _i$[ebp], eax DWORD PTR _i$[ebp], 20 SHORT $L342 ; eax=i ; eax=eax+1 ; i=eax ; if i==20 finished ; ecx = i ; edx = i ; a[i]=i ; total=0 ; i=0
ecx, DWORD PTR _i$[ebp] edx, DWORD PTR _i$[ebp] DWORD PTR _a$[ebp+ecx*4], edx SHORT $L341 DWORD PTR _total$[ebp], 0 DWORD PTR _i$[ebp], 0 SHORT $L343 eax, DWORD PTR _i$[ebp] eax, 1 DWORD PTR _i$[ebp], eax DWORD PTR _i$[ebp], 20 SHORT $L345
; i=i+1 ; if i==20 finish
ecx, DWORD PTR _i$[ebp] ; ecx=i edx, DWORD PTR _total$[ebp] ; edx=total edx, DWORD PTR _a$[ebp+ecx*4] ; edx = edx + a[i] DWORD PTR _total$[ebp], edx ; total = edx SHORT $L344 eax, DWORD PTR _total$[ebp] ; eax=total
ecx, 20 ecx DWORD PTR _average$[ebp], eax ; average = total/20 eax, eax esp, ebp ; restore stack pointer ebp 0
_main _TEXT END
Now compile the same program again, this time with /O2 optimisation.
TITLE average.c ; ; with /O2 optimization ; PUBLIC _main _a$ = -80 _main PROC NEAR sub xor lea $L397: mov inc add cmp jl push xor lea mov $L400: mov add add dec jne xor pop add ret ENDP ENDS esi, DWORD PTR [ecx] ; get a[i] ecx, 4 ; move pointer on to a[i+1] eax, esi ; total+=a[i] edx ; count down loop SHORT $L400 eax, eax esi esp, 80 0 ; restore stack DWORD PTR [ecx], eax ; a[i] = i eax ; i=i+1 ecx, 4 ; ecx moves to next address eax, 20 ; are we finished? SHORT $L397 ; save esi esi eax, eax ; re-use eax for total ecx, DWORD PTR _a$[esp+84] ; ecx = address of a[.] edx, 20 ; use edx as loop counter esp, 80 ; create space for 20 integers eax, eax ; set i=0 ; NOTE i is kept in register NOT memory ecx, DWORD PTR _a$[esp+80] ; ecx = address of a[.] ; Only a[20] on the stack!
_main _TEXT
Note that the code generated is for the 32-bit Pentium rather than the 16-bit 8086, and so the familiar 16-bit ax, dx, and cx registers are now referred to as the 32-bit eax, edx and ecx registers. The stack for the un-optimized version is used like this:Before
ESP
After i total
Average
Increasing memory addresses
a[20]
EBP
ESP
EBP
After studying this code carefully you should have a clear insight into How space for local variables is created on the stack How the generated assembly language relates to the high level C How memory is accessed which memory address modes are useful The role and effectiveness of an optimizer
Now lets try that again. This time we generate code for use by the Microsoft .NET environment
; ; ; ; ; Listing generated by Microsoft (R) Optimizing Compiler Generated by VC++ for Common Language Runtime Function Header: max stack depth = 2 function size = 24 bytes
; .proc.beg ; Line 8 ldc.i.0 stloc.0 ldloca.s stloc.1 $LL6@main: ldloc.1 ldloc.0 stind.i4 ldloc.0 ldc.i.1 add stloc.0 ldloc.1 ldc.i.4 conv.i8 add stloc.1 ldloc.0 ldc.i4.s blt.s ; Line 18 ldc.i.0 ret
0 2
; ; ; ;
i32 0x0 i$ a$ $T7642
; $T7642 ; i$ 1 ; i$ ; i32 0x1 ; i$ ; $T7642 ; i64 0x4 ; $T7642 ; i$ ; i32 0x14 ; i32 0x0
20 $LL6@main 0
In fact this time the compiler has converted the code into an intermediate language, to run on Microsofts own Virtual Machine. This is the assembly language of the virtual machine, not of the real hardware. The virtual machine is not a real computer at all, but a computer program which simulates a computer which understands these instructions.
The ARM processor some new ideas

The ARM processor is a more modern 32-bit processor which illustrates some new ideas in instruction set design compared with the 8086/Pentium 1. It has many more registers r0-r15. R13 is used as the stack pointer, r14 as the link register, to store the return address for a subroutine, and r15 is the PC, but all the rest are available to the programmer. 2. All instructions encode as 32-bits in the Pentium instruction length in bytes varies from instruction to instruction 3. Three argument instructions for example add r0,r1,r2 ; r0=r1+r2 4. Conditional execution. The ARM supports the usually CPU flags Carry, Zero, Overflow and Negative. However the programmer decides whether or not an instruction is to set the flags or not. For example add adds r0,r1,r2 r0,r1,r2 ; does not set flags ; does set flags
Furthermore the programmer can also specify that the instruction is to be executed only if a certain flag is set. addeq r0,r1,r2 ; execute if Z=1
These can be combined

addeqs r0,r1,r2 ; add if Z=1. Set flags
Example:
This example uses two implementations of Euclids Greatest Common Divisor (GCD) algorithm. It demonstrates how you can use conditional execution to improve code density and execution speed.
In C the algorithm can be expressed as:

int gcd(int a,int b) { while (a !=b)do { if (a >b) a =a -b; else b =b -a; } return a; }
You can implement the gcd function with conditional execution of branches only, in the following way: gcd: cmp r0,r1 beq end blt less sub r0,r0,r1 B gcd sub r1,r1,r0 B gcd ;a=r0, b=r1
less:
end:
By using the conditional execution feature of the ARM instruction set, you can implement the gcd function in only four instructions: gcd: cmp r0,r1 subgt r0,r0,r1 sublt r1,r1,r0 bne gcd

Measuring CPU Performance

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Measuring CPU Performance

Uploaded by

Copyright:

Available Formats

Measuring CPU Performance How to measure and predict the performance of a new computer design?

CPU Registers (32 2 ns)

Cache memory (256 kilobyte) (5 ns)

Main Memory (256 Megabytes) (100 ns)

Memory Bus Internal Bus

Hard Disk (16 Gigabytes) (3 ms)

Compilers and Compiler Optimisation

; i=i+1 ; if i==20 finish

_main _TEXT END

Increasing memory addresses

i32 0x0 i$ a$ $T7642

The ARM processor some new ideas

These can be combined

In C the algorithm can be expressed as:

You might also like