Professional Documents
Culture Documents
Agenda
Introduction OpenCL Architecture OpenCL Application
www.caps-entreprise.com
Introduction
1/25/2011 www.caps-entreprise.com
Before OpenCL
GPGPU
o Vertex / pixel shaders o Heavily constrained and not adapted
Brook
o Then Brook+ o Then CAL/IL
CUDA
o Widely broadcasted
www.caps-entreprise.com
Proc
GPU
PCI-e
North bridge
DDR
DDR
Device memory
HDA 1394 USB SCSI ATA
South bridge
SATA
PCI SCSI
www.caps-entreprise.com
ALU
ALU ALU
Control
ALU
Cache
DRAM DRAM
CPU
General purpose architecture
GPU
Massively data parallel
www.caps-entreprise.com
OpenCL Devices
Intel & AMD
X86 w/ >= SSE 3.x
NVIDIA
o All CUDA cards o But not all the drivers
S3
NV1000 5400E
ATI
o Radeon & Radeon HD o FirePro, FireStream o Mobility
IBM Cell
www.caps-entreprise.com
Application
OpenCL kernels OpenCL API OpenCL C language
www.caps-entreprise.com
www.caps-entreprise.com
10
www.caps-entreprise.com
11
Entry point for developers who prefer low level API C for CUDA
GPU
Shared backend compiler and optimization technology
www.caps-entreprise.com 12
OpenCL APIs
Compilation and execution
o Include header and link with OpenCL library o On ATI, export DISPLAY=:0.0
C language API
o o o o Bindings C++ (official) Bindings Java Bindings Python
Extensions exist to
o OpenGL o Direct3D
www.caps-entreprise.com
13
OpenCL Architecture
1/25/2011 www.caps-entreprise.com
14
Platform Model
Model consists of one or more interconnected devices
www.caps-entreprise.com
15
Platform Version
3 different kind of versions for an OpenCL device
The platform version
o Version of the OpenCL runtime linked with the application
www.caps-entreprise.com
16
Execution Model
Kernels are submitted by the host application to devices throw command queues Kernel instances, called Work-Item (WI), are identified by their point in the NDRange index space
o This enables to parallelize the execution of the kernels
So even if we have a single programming model, we should have two different programming approaches according to the paradigm we are considering
www.caps-entreprise.com 17
Programming Model
Data-parallel
o A frequent way is a one-to-one mapping between the elements of a memory object and the WI space a kernel can implement in parallel
Task parallel
o Single instance of kernels are executed independly of any NDRange o Equivalent to work-group contains only one work-item
The programmer then defines the number of work-items to execute and their allocation into work-groups 2 ways of synchronization
o On the hardware: WI of a same WG can synchronize their execution o On the host: commands queued up to the Command Queue for a same context can be executed asynchronously, in order or not
www.caps-entreprise.com 18
Parallelism Grains
CPU cores can handle only a few tasks
o But more complex
Hard control flows Memory cache
www.caps-entreprise.com
19
NDRange
NDRange is a N-Dimensional index space
o N is 1, 2 or 3 o NDRange is defined by an integer array of length N specifying the extent of the index space on each dimension
www.caps-entreprise.com
20
www.caps-entreprise.com
21
Memory Model
Four distinct memory regions
o o o o Global Memory Local Memory Constant Memory Private Memory
www.caps-entreprise.com
22
Memory Architecture
www.caps-entreprise.com
23
Memory Allocation
Allocations depends on users point of view
Global Host Dynamic Allocation Constant Dynamic Allocation Local Dynamic Allocation Private No Allocation
R/W
Kernel No Allocation R/W
R/W
Static Allocation
No Access
Static Allocation
No Access
Static Allocation
Read-only
R/W
R/W
www.caps-entreprise.com
24
The "__" prefix is not required before the qualifiers Inside a kernel
o By default variables are in private memory o By default, variables declared as pointer points to private memory
Pointers passed as arguments to a kernel function must be of type __global, __constant, or __local
www.caps-entreprise.com 25
NDRange 1
Kernel 1 WG (0,0) WG (1,0) WG (2,0)
A work group is a batch of threads which can cooperate with each other by:
Synchronizing their execution
For hazard-free local memory accesses
Kernel 2
WG (0,1)
WG (1,1)
WG (2,1)
NDRange 2
WG (1, 1)
W.I. (0,0) W.I. (1,0) W.I. (1,1) W.I. (1,2) W.I. (2,0) W.I. (2,1) W.I. (2,2) W.I. (3,0) W.I. (3,1) W.I. (3,2) W.I. (4,0) W.I. (4,1) W.I. (4,2)
www.caps-entreprise.com
26
NDRange 1
WG (0,0) WG (1,0) WG (2,0)
WG (0,1)
WG (1,1)
WG (2,1)
WG (1, 1)
W.I. (0,0) W.I. (0,1) W.I. (0,2) W.I. (1,0) W.I. (1,1) W.I. (1,2) W.I. (2,0) W.I. (2,1) W.I. (2,2) W.I. (3,0) W.I. (3,1) W.I. (3,2) W.I. (4,0) W.I. (4,1) W.I. (4,2)
www.caps-entreprise.com
27
W.I. W.I. W.I. W.I. W.I. Thread Thread Thread Thread (0,0,1) (1,0,1) (2,0,1) (3,0,1) (4,0,1) (0,1,0))0,1,3( )0,1,2( )0,1,1( W.I. W.I. W.I. W.I. W.I. (0,1,1) (1,1,2) (2,1,3) (3,1,1) (4,1,1) Thread Thread Thread Thread (0,2,0))0,2,3( )0,2,2( )0,2,1( W.I. (0,2,1) W.I. (1,2,1) W.I. (2,2,1) W.I. (3,2,1) W.I. (4,2,1)
num_groups(0) num_groups(1) Group (0,0) Group (0,1) Group (1,0) Group (2,0) Group (3,0) Group (3,1)
Group (1,1)
Group (2,1)
www.caps-entreprise.com
28
OpenCL Application
1/25/2011 www.caps-entreprise.com
29
OpenCL platform
OpenCL platform = host + OpenCL devices
o Devices can be CPUs, GPUs and so o At the moment, Nvidia SDK does not implement OpenCL on CPUs
ATI allows to use OpenCL either on GPUs or x86 CPUs (what are AMD and Intels)
Drivers for Nvidia and ATI GPUs cannot be installed on a same host
o So this is one platform or another o But take care, maybe one day it will be possible
www.caps-entreprise.com
30
An execution context
o Depends on a list of devices o clCreateContext
www.caps-entreprise.com 31
OpenCL Programs
An OpenCL program consists of a set of kernels
A program object contains
o An associated context o A program source code o The kernel objects associated to
www.caps-entreprise.com
32
Kernel Syntax
In the kernel, the language must be OpenCL C
Built-in Functions
o o o o o int get_local_id(int dim) int get_global_id(int dim) int get_local_size(int dim) int get_global_size(int dim)
Math functions
o o o o o o Trigonometric functions Mad function Min and max functions Exponential, logarithm and power modulus
www.caps-entreprise.com 34
Having multiple command-queues allows applications to queue multiple independent commands without requiring synchronization
www.caps-entreprise.com 35
Device Memory
Two types of memory objects
o Buffer Objects : 1D memory o Image Objects : 2D-3D memory
www.caps-entreprise.com
36
Kernel launch
Set kernel arguments
o Cause of kernel compilation at runtime o clSetKernelArg
Agenda
OpenCL Image Objects Local Memory Transformations & Optimizations Non-portability of the performances
www.caps-entreprise.com
39
1/25/2011 www.caps-entreprise.com
40
www.caps-entreprise.com
41
Images in kernel
Memory object argument of a kernel needs a qualifier
o __read_only (by default) o __write_only o Kernel cannot read from and write to the same image memory object
www.caps-entreprise.com
42
Therefore, they can be cached very effectively when reading : no cache coherency needed
o In particular, images are very, very good on ATI 58xx o Not so much on Nvidia C2050, as global memory is also cached
www.caps-entreprise.com
43
Vectorization
OpenCL specifications part 6.1.2 (types), 6.4 (operations) Vector data types (float4, ) represent a pseudo-array of elements (components in OpenCL) OpenCL specifies per-component semantic
o So the product of two float4 is another float4, not a scalar float : it is not a dot product o Scalar are implicitly converted into the proper type by replicating the value
Much more efficient on vectorized architecture, as only (for float) of the peak is reachable in scalar mode Component can be accessed in different ways
o Scalar: temp.s0, temp.s1 o Sub-vector: temp.odd, temp.left
www.caps-entreprise.com 44
Local Memory
1/25/2011 www.caps-entreprise.com
45
Local Memory
OpenCL specifications, part 3.3 Local memory is local to a work-group
o Shared by all work-items of work-group o Can be used for sharing between work-items o Can be used to cache global memory
2 ways to allocate it
o Statically, inside the kernel o Dynamically, from the host as a parameter
www.caps-entreprise.com 46
Allocation
Inside a kernel
o Declare local memory as a static array o Use the keyword as a type prefix
__local or local
www.caps-entreprise.com
48
1/25/2011 www.caps-entreprise.com
49
www.caps-entreprise.com
50
Memory Optimization
On Nvidia GPU
o Global memory access > 400-700 cycles o Prefers coalesced, small-size accesses (4 bytes per threads)
On AMD/ATI GPU
o Data reuse from Global (readings cached/writings not cached) o Images can be very fast o Prefers access to global in larger chunks (16 bytes, i.e. float4 per threads)
www.caps-entreprise.com
51
NDRange Optimization
Optimization of the NDRange is hardware dependant Required for efficient use of GPUs
o Must combine the global & local size for maximum efficiency of the various cores & the memory subsystem
If it works well in CUDA, then it should also works well in OpenCL :-)
o Each generation of hardware has different requirements
www.caps-entreprise.com 52
NDRange Optimization
On ATI GPU, WG size must be multiple of wavefront size
o o o o More WI can be generated GPU can switch when a WI is stalled Hiding memory latency Optmizing execution
www.caps-entreprise.com
53
Software pipelining
o Increases the distance between a memory load and its use o hides part of the latency
Avoid conditional
o On GPU, avoid divergent branches o Use predication rather than control-flow when possible
www.caps-entreprise.com
54
Native functions
Native functions are implementation defined
o native_* prefix identifier
Available operation
o o o o o Division Arithmetic : cos, sin, tan Logarithm : base-e, 2, 10 Exponential : base-e, 2, 10 Square root and inverse
www.caps-entreprise.com
55
NDRange optimization
o Like CUDA grid
www.caps-entreprise.com
56
NDRange optimization
o 64 work-item per work-group o Multiple work-item per wavefronts
Data reuse
o Reads are cached o But not writes
www.caps-entreprise.com
57
1/25/2011 www.caps-entreprise.com
58
Quick study in OpenCL, comparing on ATI GPU, Nvidia GPU, IBM Cell B/E
o The CPU is the more exotic of the three :-)
Completely different code in each case ! Kernel validated on GPUs by integration in HMPP codelet
www.caps-entreprise.com
59
SGEMM on ATI
Kernel from the GATLAS http://golem5.org/gatlas/ OpenCL generator
o Generates many forms of vectorized OpenCL SGEMM kernels, adapted to each problem size
Non-vectorized kernel are very inefficient on ATI hardware 8x8 is the default, and the most efficient work-group size Grid size is large, and depend on internal unrolling in the generator Using images for input matrices A & B, is much, much better On n=m=k=4000 (4096 is noticeably less efficient)
o About 500 Gflops with buffer on HD5870 o About 1.3 Tflops with images !
www.caps-entreprise.com
60
SGEMM on NVidia
Using GATLAS-derived kernels (current version of GATLAS doesnt support non-vectorized kernels, work is in progress) or HMPP-generated kernels vectorized kernels are inefficient on C1060
o Float4 memory accesses dont coalesce very well, and generates lots of local memory bank conflicts o Execution units are not vectorized
www.caps-entreprise.com
61
SGEMM on Cell
Best non-OpenCL matmult on Cell at http://www.tudresden.de/zih/cell/matmul Achieve >99% of peak under ideal conditions
o Assembly code, optimized DMA transfers, double buffering,
Current OpenCL Kernel only 65% of peak :-( Requires use of async_work_group_copy (not used on GPU) to use DMA for global -> local copies Requires massively unrolled float4 arithmetic Also partial double buffering, software pipelining to hide DMA latencies, Much more similar to traditional CPU optimizations
www.caps-entreprise.com
62
Hands-on : Convolution
1/25/2011 www.caps-entreprise.com
63
Hands-on : Convolution
6 incremental steps :
o o o o o o Communication part in function.cc Naive Kernel Kernel with static Local memory Kernel with dynamic Local memory Full unroll loops Put hard coded coefficient
Convolution :
[][] = [][] + ([ ][] + [ + ][] + [][ ] + [][ + ]) + ([ ][] + [ + ][] + [][ ] + [][ + ]) + ([ ][] + [ + ][] + [][ ] + [][ + ]) + ([ ][] + [ + ][] + [][ ] + [][ + ])
1/25/2011
www.caps-entreprise.com
64
Vasnier Jean-charles Application Engineer jvasnier@caps-entreprise.com +33 (0) 222 511 600
1/25/2011 www.caps-entreprise.com
65