You are on page 1of 42

Data-Parallel Algorithms and Data

Structures

John Owens
UC Davis
Programming a GPU for Graphics
• Application specifies
geometry -> rasterized

• Each fragment is shaded w/


SIMD program

• Shading can use values


from texture memory

• Image can be used as


texture on future passes
Programming a GPU for GP Programs (Old-School)

• Draw a screen-sized quad

• Run a SIMD program over


each fragment

• “Gather” is permitted
from texture memory

• Resulting buffer can be


treated as texture on next
pass
Programming a GPU for GP Programs (New School)

• Define a computation
domain over elements

• Run a SIMD program over


each element/thread

• Global memory supports


both gather and scatter
• Resulting buffer can be
used in other general-
purpose kernels or as
texture
One Slide Summary of Today
• GPUs are great at running many closely-
coupled but independent threads in parallel
– The programming model is SPMD—single program,
multiple data
• GPU computing boils down to:
– Define a computation domain that generates many
parallel threads
• This is the data structure
– Iterate in parallel over that computation domain,
running a program over all threads
• This is the algorithm
Outline
• Data Structures
– GPU Memory Model
– Taxonomy
• Algorithmic Building Blocks
– Sample Application
– Map
– Gather & Scatter
– Reductions
– Scan (parallel prefix)
– Sort, search, …
Memory Hierarchy
• CPU and GPU Memory Hierarchy

Disk

CPU Main
Memory

GPU Main
CPU Caches
(Video) Memory

GPU Caches
CPU Registers
(Texture, Shared)

GPU Constant GPU Temporary


Registers Registers
Big Picture: GPU Memory Model
• GPUs are a mix of:
– Historical, fixed-function capabilities
– Newer, flexible, programmable capabilities
• Fixed-function:
– Known access patterns, behaviors
– Accelerated by special-purpose hardware
• Programmable:
– Unknown access patterns
– Generality good
• Memory model must account for both
– Consequence: Ample special-purpose functionality
– Consequence: Restricting flexibility may improve performance
» Take advantage of special-purpose functionality!
Taxonomy of Data Structures
• Dense arrays
– Address translations: n-dimension to m-dimension
• Sparse arrays
– Static sparse arrays (e.g. sparse matrix)
– Dynamic sparse arrays (e.g. page tables)
• Adaptive data structures
– Static adaptive data structures (e.g. k-d trees)
– Dynamic adaptive data structures (e.g. adaptive
multiresolution grids)
• Non-indexable structures (e.g. stack)
Details? Lefohn et al.’s Glift (TOG Jan. ‘06), GPGPU survey
GPU Memory Model
• More restricted memory access than CPU
– Allocate/free memory only before computation
– Transfers to and from CPU are explicit
– GPU is controlled by CPU, can’t initiate transfers,
access disk, etc.
• To generalize …
– GPUs are better at accessing data structures
– CPUs are better at building data structures
– As CPU-GPU bandwidth improves, consider doing
data structure tasks on their “natural” processor
GPU Memory Model
• Limited memory access during computation
(kernel)
– Registers (per fragment/thread)
• Read/write
– Local memory (shared among threads)
• Does not exist in general
• CUDA allows access to shared memory btwn threads
– Global memory (historical)
• Read-only during computation
• Write-only at end of computation (precomputed address)
– Global memory (new)
• Allows general scatter/gather (read/write)
– No collision rules!
– Exposed in AMD R520+ GPUs, NVIDIA G80+ GPUs
Data-Parallel Algorithms
• The GPU is a data-parallel processor
– Data-parallel kernels of applications can be accelerated on
the GPU
• Efficient algorithms require efficient building
blocks
• This talk: data-parallel building blocks
– Map
– Gather & Scatter
– Reduce
– Scan
• Next talk (Naga Govindaraju): sorting and searching
Sample Motivating Application
• How bumpy is a surface that we
represent as a grid of samples?
• Algorithm:
– Loop over all elements
– At each element, compare the value of that element
to the average of its neighbors (“difference”).
Square that difference.
– Now sum up all those differences.
• But we don’t want to sum all the diffs that are 0.
• So only sum up the non-zero differences.
– This is a fake application—don’t take it too seriously.
Picture courtesy http://www.artifice.com
Sample Motivating Application
for all samples:
neighbors[x,y] =
0.25 * ( value[x-1,y]+
value[x+1,y]+
value[x,y+1]+
value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0
for all samples where diff != 0:
result += diff
return result
Sample Motivating Application
for all samples:
neighbors[x,y] =
0.25 * ( value[x-1,y]+
value[x+1,y]+
value[x,y+1]+
value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0
for all samples where diff != 0:
result += diff
return result
The Map Operation
• Given:
– Array or stream of data elements A
– Function f(x)
• map(A, f) = applies f(x) to all ai∈ A
• GPU implementation is straightforward
– Graphics: A is a texture, ai are texels
– General-purpose: A is a computation domain, ai are
elements in that domain
– Pixel shader/kernel implements f(x), reads ai as x
– (Graphics): Draw a quad with as many pixels as
texels in A with f(x) pixel shader active
– Output stored in another texture/buffer
Sample Motivating Application
for all samples:
neighbors[x,y] =
0.25 * ( value[x-1,y]+
value[x+1,y]+
value[x,y+1]+
value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0
for all samples where diff != 0:
result += diff
return result
Scatter vs. Gather
• Gather: p = a[i]
– Vertex or Fragment programs
• Scatter: a[i] = p
– Historically vertex programs, but exposed in
fragment programs in latest GPUs
Sample Motivating Application
for all samples:
neighbors[x,y] =
0.25 * ( value[x-1,y]+
value[x+1,y]+
value[x,y+1]+
value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0
for all samples where diff != 0:
result += diff
return result
Parallel Reductions
• Given:
– Binary associative operator ⊕ with identity I
– Ordered set s = [a0, a1, …, an-1] of n elements
• reduce(⊕, s) returns a0 ⊕ a1 ⊕ … ⊕ an-1
• Example:
reduce(+, [3 1 7 0 4 1 6 3]) = 25

• Reductions common in parallel algorithms


– Common reduction operators are +, ×, min and max
– Note floating point is only pseudo-associative
Parallel Reductions on the GPU
1D parallel reduction:
add two halves of texture together repeatedly...
… until we’re left with a single row of texels

+ + N/4…
+ 1
N/2
N
O(log2N) steps, O(N) work
Multiple 1D Parallel Reductions
Can run many reductions in parallel
Use 2D texture and reduce one dimension

+ + + Mx1
MxN/4…
MxN/2
MxN
O(log2N) steps, O(MN) work
2D reductions
• Like 1D reduction, only reduce in both
directions simultaneously

– Note: can add more than 2x2 elements per pixel


• Trade per-pixel work for step complexity
• Best perf depends on specific GPU (cache, etc.)
Sample Motivating Application
for all samples:
neighbors[x,y] =
0.25 * ( value[x-1,y]+
value[x+1,y]+
value[x,y+1]+
value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0
for all samples where diff != 0:
result += diff
return result
Stream Compaction
• Input: stream of 1s and 0s
[1 0 1 1 0 0 1 0]
• Operation:“sum up all elements before you”
• Output: scatter addresses for “1” elements
[0 1 1 2 3 3 3 4]
• Note scatter addresses for red elements are
packed!
Parallel Scan (aka prefix sum)
• Given:
– Binary associative operator ⊕ with identity I
– Ordered set s = [a0, a1, …, an-1] of n elements
• scan(⊕, s) returns
[I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]

• Example:
scan(+, [3 1 7 0 4 1 6 3]) =
[0 3 4 11 11 15 16 22]
(From Blelloch, 1990, “Prefix Sums and Their Applications”)
Applications of Scan
• Radix sort • Sparse matrices
• Quicksort • Polynomial evaluation
• String comparison • Solving recurrences
• Lexical analysis • Tree operations
• Stream compaction • Histograms
A Naive Parallel Scan Algorithm
Log(n) iterations

Note: With graphics API,


T0 3 1 7 0 4 1 6 3 can’t read and write
the same texture, so
must “ping-pong”.
Not necessary with
newer APIs.
A Naive Parallel Scan Algorithm
Log(n) iterations

Note: With graphics API,


T0 3 1 7 0 4 1 6 3 can’t read and write
the same texture, so
Stride 1
must “ping-pong”
T1 3 4 8 7 4 5 7 9
For i from 1 to log(n)-1:
• Specify domain from
2i to n. Element k
computes

vout = v[k] + v[k-2i].


A Naive Parallel Scan Algorithm
Log(n) iterations

Note: With graphics API,


T0 3 1 7 0 4 1 6 3 can’t read and write
Stride 1 the same texture, so
must “ping-pong”
T1 3 4 8 7 4 5 7 9
For i from 1 to log(n)-1:
• Specify domain from
2i to n. Element k
computes

vout = v[k] + v[k-2i].


• Due to ping-pong,
specify 2nd domain
from 2(i-1) to 2i with a
simple pass-through
shader

vout = vin.
A Naive Parallel Scan Algorithm
Log(n) iterations

Note: With graphics API,


T0 3 1 7 0 4 1 6 3 can’t read and write
Stride 1 the same texture, so
must “ping-pong”
T1 3 4 8 7 4 5 7 9
For i from 1 to log(n)-1:
Stride 2 • Specify domain from
2i to n. Element k
T0 3 4 11 11 12 12 11 14 computes

vout = v[k] + v[k-2i].


• Due to ping-pong,
specify 2nd domain
from 2(i-1) to 2i with a
simple pass-through
shader

vout = vin.
A Naive Parallel Scan Algorithm
Log(n) iterations

Note: With graphics API,


T0 3 1 7 0 4 1 6 3 can’t read and write
the same texture, so
Stride 1
must “ping-pong”
T1 3 4 8 7 4 5 7 9
For i from 1 to log(n)-1:
Stride 2 • Specify domain from
2i to n. Element k
T0 3 4 11 11 12 12 11 14 computes

vout = v[k] + v[k-2i].


• Due to ping-pong,
specify 2nd domain
from 2(i-1) to 2i with a
simple pass-through
shader

vout = vin.
A Naive Parallel Scan Algorithm
Log(n) iterations

Note: With graphics API,


T0 3 1 7 0 4 1 6 3 can’t read and write
the same texture, so
Stride 1
must “ping-pong”
T1 3 4 8 7 4 5 7 9
For i from 1 to log(n)-1:
Stride 2 • Specify domain from
2i to n. Element k
T0 3 4 11 11 12 12 11 14 computes
Stride 4
vout = v[k] + v[k-2i].
Out 3 4 11 11 15 16 22 25 • Due to ping-pong,
specify 2nd domain
from 2(i-1) to 2i with a
simple pass-through
shader

vout = vin.
A Naive Parallel Scan Algorithm
Log(n) iterations

Note: With graphics API,


T0 3 1 7 0 4 1 6 3 can’t read and write
the same texture, so
Stride 1
must “ping-pong”
T1 3 4 8 7 4 5 7 9
For i from 1 to log(n)-1:
Stride 2 • Specify domain from
2i to n. Element k
T0 3 4 11 11 12 12 11 14 computes
Stride 4
vout = v[k] + v[k-2i].
Out 3 4 11 11 15 16 22 25 • Due to ping-pong,
specify 2nd domain
from 2(i-1) to 2i with a
simple pass-through
shader

vout = vin.
A Naive Parallel Scan Algorithm
• Algorithm given in more detail by Horn [‘05]
• Step-efficient, but not work-efficient
– O(log n) steps, but O(n log n) adds
– Sequential version is O(n)
– A factor of log(n) hurts: 20x for 106 elements!
• Dig into parallel algorithms literature for a
better solution
– See Blelloch 1990, “Prefix Sums and Their
Applications”
Balanced Trees
• Common parallel algorithms pattern
– Build a balanced binary tree on the input data and
sweep it to and from the root
– Tree is conceptual, not an actual data structure
• For scan:
– Traverse down from leaves to root building partial
sums at internal nodes in the tree
• Root holds sum of all leaves
– Traverse back up the tree building the scan from the
partial sums
– 2 log n passes, O(n) work
Balanced Tree Scan
“Balanced tree”: O(n)
computation, space

• First build sums in


place up the tree:
“reduce”
• Traverse back down
and use partial sums
to generate scan:
“downsweep”

See Mark Harris’s article


in GPU Gems 3
Figure courtesy Shubho Sengupta
Scan with Scatter
• Scatter in pixel shaders makes algorithms
that use scan easier to implement
– NVIDIA CUDA and AMD CTM enable this
– NVIDIA CUDA SDK includes example implementation
of scan primitive
– http://www.gpgpu.org/scan-gpugems3
• CUDA Parallel Data Cache improves
efficiency
– All steps executed in a single kernel
– Threads communicate through shared memory
– Drastically reduces bandwidth bottleneck!
Parallel Sorting
• Given an unordered list of elements,
produce list ordered by key value
– Kernel: compare and swap
• GPU’s constrained programming
environment limits viable algorithms
– Bitonic merge sort [Batcher 68]
– Periodic balanced sorting networks [Dowd 89]
• Recent research results impressive
– Naga Govindaraju will cover the algorithms in detail
Binary Search
• Find a specific element in an ordered list
• Implement just like CPU algorithm
– Finds the first element of a given value v
• If v does not exist, find next smallest element > v
• Search is sequential, but many searches can be
executed in parallel
– Number of pixels drawn determines number of searches
executed in parallel
• 1 pixel == 1 search
• For details see:
– “A Toolkit for Computation on GPUs”. Ian Buck and Tim
Purcell. In GPU Gems. Randy Fernando, ed. 2004
Summary
• Think parallel!
– Effective programming on GPUs requires mapping
computation & data structures into parallel
formulations
• Future thoughts
– Definite need for:
• Efficient implementations of parallel primitives
• Research on what is the right set of primitives
– How will high-bandwidth connections between CPU
and GPU change algorithm and data structure
design?
References
• “A Toolkit for Computation on GPUs”. Ian Buck
and Tim Purcell. In GPU Gems, Randy Fernando,
ed. 2004.
• “Parallel Prefix Sum (Scan) with CUDA”. Mark
Harris, Shubhabrata Sengupta, John D. Owens. In
GPU Gems 3, Herbert Nguyen, ed. Aug. 2007.
• “Stream Reduction Operations for GPGPU
Applications”. Daniel Horn. In GPU Gems 2, Matt
Pharr, ed. 2005.
• “Glift: Generic, Efficient, Random-Access GPU
Data Structures”. Aaron Lefohn et al., ACM TOG,
Jan. 2006.

You might also like