You are on page 1of 60

OpenCL

Parallel Computing on the GPU and CPU

Aaftab Munshi
Opportunity: Processor Parallelism...
• Today’s processors are increasingly parallel
• CPUs
■ Multiple cores are driving performance increases
• GPUs
■ Transforming into general purpose data-parallel computational
coprocessors
■ Improving numerical precision (single and double)

Beyond Programmable Shading: Fundamentals


Challenge: Processor Parallelism
• Writing parallel programs different for the CPU and GPU
■ Differing domain-specific techniques
■ Vendor-specific technologies

• Graphics API is not an ideal abstraction for general


purpose compute

Beyond Programmable Shading: Fundamentals


Introducing OpenCL
• OpenCL – Open Computing Language
• Approachable language for accessing heterogeneous
computational resources
• Supports parallel execution on single or multiple processors
■ GPU, CPU, GPU + CPU or multiple GPUs
• Desktop and Handheld Profiles
• Designed to work with graphics APIs such as OpenGL

Beyond Programmable Shading: Fundamentals


OpenCL = Open Standard
• Specification under review
■ Royalty free, cross-platform, vendor neutral
■ Khronos OpenCL working group (www.khronos.org)

• Based on a proposal by Apple


■ Developed in collaboration with industry leaders
■ Performance-enhancing technology in Mac OS X Snow Leopard

Beyond Programmable Shading: Fundamentals


OpenCL Working Group Members
Broad Industry Support

© Copyright Khronos Group, 2008 - Page

Beyond Programmable Shading: Fundamentals


OpenCL — A Sneak Preview
Design Goals of OpenCL
• Use all computational resources in system
■ GPUs and CPUs as peers
Data- and task- parallel compute model

• Efficient parallel programming model


■ Based on C
■ Abstract the specifics of underlying hardware

• Specify accuracy of floating-point computations


■ IEEE 754 compliant rounding behavior
■ Define maximum allowable error of math functions

• Drive future hardware requirements

Beyond Programmable Shading: Fundamentals


OpenCL Software Stack
• Platform Layer
■ query and select compute devices in the system
■ initialize a compute device(s)

■ create compute contexts and work-queues

• Runtime
■ resource management
■ execute compute kernels

• Compiler
■ A subset of ISO C99 with appropriate language additions
■ Compile and build compute program executables

■ online or offline

Beyond Programmable Shading: Fundamentals


OpenCL Execution Model
• Compute Kernel
■ Basic unit of executable code — similar to a C function
■ Data-parallel or task-parallel

• Compute Program
■ Collection of compute kernels and internal functions
■ Analogous to a dynamic library

• Applications queue compute kernel execution instances


■ Queued in-order
■ Executed in-order or out-of-order

■ Events are used to implement appropriate synchronization of

execution instances

Beyond Programmable Shading: Fundamentals


OpenCL Data-Parallel Execution Model
• Define N-Dimensional computation domain
■ Each independent element of execution in N-D domain is called
a work-item
■ The N-D domain defines the total number of work-items that

execute in parallel — global work size.


• Work-items can be grouped together — work-group
■ Work-items in group can communicate with each other
■ Can synchronize execution among work-items in group to

coordinate memory access


• Execute multiple work-groups in parallel
• Mapping of global work size to work-groups
■ implicit or explicit

Beyond Programmable Shading: Fundamentals


OpenCL Task-Parallel Execution Model
• Data-parallel execution model must be implemented by all
OpenCL compute devices
• Some compute devices such as CPUs can also execute task-
parallel compute kernels
■ Executes as a single work-item
■ A compute kernel written in OpenCL

■ A native C / C++ function

Beyond Programmable Shading: Fundamentals


OpenCL Memory Model
• Implements a relaxed consistency,
shared memory model
• Multiple distinct address spaces
■ Address spaces can be collapsed
depending on the device’s
memory subsystem

Beyond Programmable Shading: Fundamentals


OpenCL Memory Model
• Implements a relaxed consistency,
shared memory model
Private Private Private Private
• Multiple distinct address spaces Memory Memory Memory Memory

■ Address spaces can be collapsed WorkItem 1 WorkItem M WorkItem 1 WorkItem M

depending on the device’s Compute Unit 1 Compute Unit N


memory subsystem
■ Address Qualifiers

■ __private

Beyond Programmable Shading: Fundamentals


OpenCL Memory Model
• Implements a relaxed consistency,
shared memory model
Private Private Private Private
• Multiple distinct address spaces Memory Memory Memory Memory

■ Address spaces can be collapsed WorkItem 1 WorkItem M WorkItem 1 WorkItem M

depending on the device’s Compute Unit 1 Compute Unit N


memory subsystem
■ Address Qualifiers
Local Memory Local Memory

■ __private
■ __local

Beyond Programmable Shading: Fundamentals


OpenCL Memory Model
• Implements a relaxed consistency,
shared memory model
Private Private Private Private
• Multiple distinct address spaces Memory Memory Memory Memory

■ Address spaces can be collapsed WorkItem 1 WorkItem M WorkItem 1 WorkItem M

depending on the device’s Compute Unit 1 Compute Unit N


memory subsystem
■ Address Qualifiers
Local Memory Local Memory

■ __private Global / Constant Memory Data Cache


■ __local
Compute Device

■ __constant and __global

■ Example: Global Memory

■ __global float4 *p; Compute Device Memory

Beyond Programmable Shading: Fundamentals


Language for writing compute kernels
• Derived from ISO C99
• A few restrictions
■ Recursion, function pointers, functions in C99 standard
headers ...
• Preprocessing directives defined by C99 are supported
• Built-in Data Types
■ Scalar and vector data types
■ Pointers

■ Data-type conversion functions

■ convert_type<_sat><_roundingmode>
■ Image types
■ image2d_t, image3d_t and sampler_t

Beyond Programmable Shading: Fundamentals


Language for writing compute kernels

Beyond Programmable Shading: Fundamentals


Language for writing compute kernels
• Built-in Functions — Required
■ work-item functions
■ math.h

■ read and write image

■ relational

■ geometric functions

■ synchronization functions

Beyond Programmable Shading: Fundamentals


Language for writing compute kernels
• Built-in Functions — Required
■ work-item functions
■ math.h

■ read and write image

■ relational

■ geometric functions

■ synchronization functions

• Built-in Functions — Optional


■ double precision
■ atomics to global and local memory

■ selection of rounding mode

■ writes to image3d_t surface

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create a compute context with GPU device

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create a compute context with GPU device


context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create a compute context with GPU device


context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queue

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create a compute context with GPU device


context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create a compute context with GPU device


context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objects

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create a compute context with GPU device


context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objects


memobjs[0] = clCreateBuffer(context,

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create a compute context with GPU device


context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objects


memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create a compute context with GPU device


context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objects


memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA);

memobjs[1] = clCreateBuffer(context,

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create a compute context with GPU device


context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objects


memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA);

memobjs[1] = clCreateBuffer(context,
CL_MEM_READ_WRITE,

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create a compute context with GPU device


context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a work-queue
queue = clCreateWorkQueue(context, NULL, NULL, 0);

// allocate the buffer memory objects


memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA);

memobjs[1] = clCreateBuffer(context,
CL_MEM_READ_WRITE,
sizeof(float)*2*num_entries, NULL);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create the compute program

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create the compute program


program = clCreateProgramFromSource(context, 1,

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create the compute program


program = clCreateProgramFromSource(context, 1,
&fft1D_1024_kernel_src, NULL);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create the compute program


program = clCreateProgramFromSource(context, 1,
&fft1D_1024_kernel_src, NULL);

// build the compute program executable

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create the compute program


program = clCreateProgramFromSource(context, 1,
&fft1D_1024_kernel_src, NULL);

// build the compute program executable


clBuildProgramExecutable(program, false, NULL, NULL);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create the compute program


program = clCreateProgramFromSource(context, 1,
&fft1D_1024_kernel_src, NULL);

// build the compute program executable


clBuildProgramExecutable(program, false, NULL, NULL);

// create the compute kernel

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

// create the compute program


program = clCreateProgramFromSource(context, 1,
&fft1D_1024_kernel_src, NULL);

// build the compute program executable


clBuildProgramExecutable(program, false, NULL, NULL);

// create the compute kernel


kernel = clCreateKernel(program, “fft1D_1024”);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

// set the args values

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

// set the args values


clSetKernelArg(kernel, 0, (void *)&memobjs[0],

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

// set the args values


clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

// set the args values


clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1],

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

// set the args values


clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem), NULL);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

// set the args values


clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 2, NULL,

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

// set the args values


clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 2, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

// set the args values


clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 2, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
clSetKernelArg(kernel, 3, NULL,

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

// set the args values


clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 2, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
clSetKernelArg(kernel, 3, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

// set the args values


clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 2, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
clSetKernelArg(kernel, 3, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);

// execute kernel

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Host API Code
// create N-D range object with work-item dimensions
global_work_size[0] = n;
local_work_size[0] = 64;
range = clCreateNDRangeContainer(context, 0, 1,
global_work_size,
local_work_size);

// set the args values


clSetKernelArg(kernel, 0, (void *)&memobjs[0],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 1, (void *)&memobjs[1],
sizeof(cl_mem), NULL);
clSetKernelArg(kernel, 2, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);
clSetKernelArg(kernel, 3, NULL,
sizeof(float)*(local_work_size[0]+1)*16, NULL);

// execute kernel
clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);

Beyond Programmable Shading: Fundamentals


OpenCL FFT Example - Compute Kernel
// This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into
// calls to a radix 16 function, another radix 16 function and then a radix 4 function
__kernel void fft1D_1024 (__global float2 *in, __global float2 *out,
__local float *sMemx, __local float *sMemy) {
int tid = get_local_id(0);
int blockIdx = get_group_id(0) * 1024 + tid;
float2 data[16];

// starting index of data to/from global memory


in = in + blockIdx; out = out + blockIdx;

globalLoads(data, in, 64); // coalesced global reads


fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 1024, 0);

// local shuffle using local memory


localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4)));
fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication

localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));
// four radix-4 function calls
fftRadix4Pass(data); fftRadix4Pass(data + 4);
fftRadix4Pass(data + 8); fftRadix4Pass(data + 12);

// coalesced global writes


globalStores(data, out, 64);
}

Beyond Programmable Shading: Fundamentals


OpenCL and OpenGL
• Sharing OpenGL Resources
■ OpenCL is designed to efficiently share with OpenGL
■ Textures, Buffer Objects and Renderbuffers
■ Data is shared, not copied

• Efficient queuing of OpenCL and OpenGL commands


• Apps can select compute device(s) that will run OpenGL and
OpenCL

Beyond Programmable Shading: Fundamentals


Summary
• A new compute language that works across GPUs and CPUs
■ C99 with extensions
■ Familiar to developers

■ Includes a rich set of built-in functions

■ Makes it easy to develop data- and task- parallel compute


programs
• Defines hardware and numerical precision requirements
• Open standard for heterogeneous parallel computing

Beyond Programmable Shading: Fundamentals

You might also like