Lec04 PDF

Lecture 4
Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy
Announcements Partners?
Scott B. Baden / CSE 160 / Winter 2011
Todays lecture Why multicore? Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy
Why Mulitcore processors?

We cant keep increasing the clock speed If we simplify the processor, we can reduce power consumption and pack more processors in a given amount of space Some capabilities, like multi-issue, can be costly in terms of chip real estate and power consumption
http://forums.vr-zone.com/

realworldtech.com

Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

What makes a processor run faster?

Registers and cache Pipelining Instruction level parallelism Vectorization, SSE
Pipelining
Assembly line processing - an auto plant
Dave Pattersons Laundry example: 4 people doing laundry wash (30 min) + dry (40 min) + fold (20 min) = 90 min
6 PM
9
Time
30 40 40 40 40 20 A B C D
Sequential execution takes 4 * 90min = 6 hours Pipelined execution takes 30+4*40+20 = 3.5 hours Bandwidth = loads/hour Pipelining helps bandwidth but not latency (90 min) Bandwidth limited by slowest pipeline stage Potential speedup = Number pipe stages
Instruction level parallelism

Execute more than one instruction at a time x= y / z; a = b + c; Dynamic techniques
Allow
stalled instructions to proceed Reorder instructions

x = y / z; a = x + c; t = c q; x = y / z; t = c q; a = x + c;
Multi-execution pipeline
MIPS R4000
integer unit
ex
x= y / z; a = b + c; s = q * r; i++;
m7
FP/int Multiply
IF ID
m1 m2 m3 m4 m5 m6
MEM WB
FP adder
a1 a2 a3 a4
FP/int divider
Div (latency = 25)
x = y / z; t = c q; a = x + c;
David Culler
Scoreboarding, Tomasulos algorithm

Hardware data structures keep track of
When
Instruction Fetch
instructions complete Which instructions depend on the results When its safe to write a reg.
Scoreboard
Issue & Resolve
FU
Deals with data hazards

WAR
op fetch
op fetch
ex
ex
(Write after read) RAW (Read after write) WAW (Write after write)
a = b*c x= y b q=a/y y=x+b

9
What makes a processor run faster?

Registers and cache Pipelining Instruction level parallelism Vectorization, SSE
10
SIMD (Single Instruction Multiple Data)

Operate on regular arrays of data Two landmark SIMD designs

ILIAC IV (1960s) Connection Machine 1 and 2 (1980s)
Vector computer: Cray-1 (1976) Intel and others support SIMD for multimedia and graphics
1 4 2 6
1 2 = 2 3
1 2 * 1 2
SSE Streaming SIMD extensions, Altivec Operations defined on vectors
forall i = 0:N-1 p[i] = a[i] * b[i]
GPUs, Cell Broadband Engine Reduced performance on data dependent or irregular computations
forall i = 0 : n-1 if ( x[i] < 0) then y[i] = x[i] else y[i] = x[i] end if end forall
11
1/13/11
Streaming SIMD Extensions (SSE)

SSE (SSE4 on Intel Nehalem), Altivec Short vectors: 128 bits (256 bits coming)
a b
X X X X
r[0:3] = a[0:3]*b[0:3]
Courtesy of Mercury Computer Systems, Inc.
4 floats 2 doubles
Jim Demmel
13
How do we use the SSE instructions?

Low level: assembly language or libraries Higher level: a vectorizing compiler Lilliput.ucsd.edu (Intel Gainstown, Nehalem)
gcc O3 float a[N], b[N], c[N]; for (int i=0; i<N; i++) a[i] = b[i] / c[i]; t2a.cpp:11: note: created 2 versioning for alias checks. t2a.cpp:11: note: LOOP VECTORIZED. t2a.cpp:9: note: vectorized 1 loops in function.
(Gcc v 4.4.3, wont work on ieng6-203 until we get new compilers)
Performance
Single precision:
With vectorization : 0.52 sec. Without vectorization : 1.9 sec. Double precision: With vectorization: 1.5 sec. Without vectorization : 3.0 sec.
14
How does the vectorizer work?

Transformed code for (i = 0; i < 1024; i+=4) a[i:i+3] = b[i:i+3] + c[i:i+3]; Vector instructions for (i = 0; i < 1024; i+=4){ vB = vec_ld( &b[i] ); vC = vec_ld( &c[i] ); vA = vec_add( vB, vC ); vec_st( vA, &a[i] );
15
What prevents vectorization

Data dependencies
for (int i = 1; i < N; i++) b[i] = b[i-1] + 2; b[1] = b[0] + 2; b[2] = b[1] + 2; b[3] = b[2] + 2;
note: not vectorized: unsupported use in stmt.
16
Blocking for Cache
Matrix Multiplication An important core operation in many numerical algorithms Given two conforming matrices A and B, form the matrix product A B A is m n B is n p Operation count: O(n3) multiply-adds for an n n square matrix Discussion follows from Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect02.html
18
Unblocked Matrix Multiplication for i := 0 to n-1 for j := 0 to n-1 for k := 0 to n-1 C[i,j] += A[i,k] * B[k,j]
C[i,j]
A[i,:]
+=
B[:,j]
19
Analysis of performance
for i = 0 to n-1 // for each iteration i, load all of B into cache for j = 0 to n-1 // for each iteration (i,j), load A[i,:] into cache // for each iteration (i,j), load and store C[i,j] for k = 0 to n-1 C[i,j] += A[i,k] * B[k,j]
C[i,j]
A[i,:]
+=
B[:,j]
20
Analysis of performance
for i = 0 to n-1 // n n2 / L loads = n3/L, L=cache line size B[:,:] for j = 0 to n-1 // n2 / L loads = n2/L A[i,:] // n2 / L loads + n2 / L stores = 2n2 / L C[i,j] for k = 0 to n-1 C[i,j] += A[i,k] * B[k,j] Total:(n3 + 3n2) / L
C[i,j] A[i,:]
+=
B[:,j]
21
Flops to memory ratio Let q = # flops / main memory reference

2n 3 q = 3 n + 3n 2
2 as n
!
22
Blocked Matrix Multiply Divide A, B, C into N N sub blocks

Assume we have a good quality library to perform matrix multiplication on subblocks
Each sub block is b b

b=n/N
is called the block size How do we establish b?
C[i,j]
C[i,j]
A[i,k]
+
B[k,j]
23
Blocked Matrix Multiplication

for i = 0 to N-1 for j = 0 to N-1 // load each block C[i,j] into cache, once :
n2
Total:
// b = n/N = cache line size for k = 0 to N-1 // load each block A[i,k] and B[k,j] N3 times // = 2N3 (n/N)2 = 2Nn2 C[i,j] += A[i,k] * B[k,j] // do the matrix multiply // write each block C[i,j] once : n2
(2*N+2)*n2
A[i,k]
C[i,j]
C[i,j]
+
B[k,j]
24
Flops to memory ratio Let q = # flops / main memory reference

2n 3 n q = = 2 (2 N + 2) n N +1
n/N = b as n
25
The results
N,B 256, 64 512,128
Unblocked Blocked Time Time 0.6 0.002 15 0.24
Amortize memory accesses by increasing memory reuse
26
More on blocked algorithms

Data in the sub-blocks are contiguous within rows only We may incur conflict cache misses Idea: since re-use is so high lets copy the subblocks into contiguous memory before passing to our matrix multiply routine
The Cache Performance and Optimizations of Blocked Algorithms, M. Lam et al., ASPLOS IV, 1991
http://www-suif.stanford.edu/papers/lam91.ps
27
Fin
Partitioning
Splits up the data over processors Different partitionings according to the processor geometry
For P processors geometries are of the form p0 p1, where P = p0 p1 For P=4: 3 possible geometries 14
0 1 2 3
41
0 1 2 3
2 2
0 1
31

Lec04 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec04 PDF

Uploaded by

Copyright:

Available Formats

Lecture 4

Scott B. Baden / CSE 160 / Winter 2011

Scott B. Baden / CSE 160 / Winter 2011

Why Mulitcore processors?

Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

Scott B. Baden / CSE 160 / Winter 2011

What makes a processor run faster?

Scott B. Baden / CSE 160 / Winter 2011

Scott B. Baden / CSE 160 / Winter 2011

Instruction level parallelism

stalled instructions to proceed Reorder instructions

Scott B. Baden / CSE 160 / Winter 2011

Scoreboarding, Tomasulos algorithm

Issue & Resolve

Deals with data hazards

a = b*c x= y b q=a/y y=x+b

What makes a processor run faster?

Scott B. Baden / CSE 160 / Winter 2011

SIMD (Single Instruction Multiple Data)

ILIAC IV (1960s) Connection Machine 1 and 2 (1980s)

SSE Streaming SIMD extensions, Altivec Operations defined on vectors

forall i = 0:N-1 p[i] = a[i] * b[i]

Scott B. Baden / CSE 160 / Winter 2011

Streaming SIMD Extensions (SSE)

Courtesy of Mercury Computer Systems, Inc.

How do we use the SSE instructions?

(Gcc v 4.4.3, wont work on ieng6-203 until we get new compilers)

Scott B. Baden / CSE 160 / Winter 2011

How does the vectorizer work?

Scott B. Baden / CSE 160 / Winter 2011

What prevents vectorization

note: not vectorized: unsupported use in stmt.

Scott B. Baden / CSE 160 / Winter 2011

Blocking for Cache

Scott B. Baden / CSE 160 / Winter 2011

Scott B. Baden / CSE 160 / Winter 2011

Scott B. Baden / CSE 160 / Winter 2011

Scott B. Baden / CSE 160 / Winter 2011

Flops to memory ratio Let q = # flops / main memory reference

Scott B. Baden / CSE 160 / Winter 2011

Blocked Matrix Multiply Divide A, B, C into N N sub blocks

Each sub block is b b

is called the block size How do we establish b?

Blocked Matrix Multiplication

Flops to memory ratio Let q = # flops / main memory reference

Scott B. Baden / CSE 160 / Winter 2011

N,B 256, 64 512,128

Unblocked Blocked Time Time 0.6 0.002 15 0.24

Amortize memory accesses by increasing memory reuse

Scott B. Baden / CSE 160 / Winter 2011

More on blocked algorithms

Scott B. Baden / CSE 160 / Winter 2011

Scott B. Baden / CSE 160 / Winter 2011

You might also like