Opencl

OpenCL
Parallel Computing on the GPU and CPU
Aaftab Munshi
Opportunity: Processor Parallelism...
• Today’s processors are increasingly parallel
• CPUs
■ Multiple cores are driving performance increases
• GPUs
■ Transforming into general purpose data-parallel computational
coprocessors
■ Improving numerical precision (single and double)
Beyond Programmable Shading: Fundamentals

Challenge: Processor Parallelism
• Writing parallel programs different for the CPU and GPU
■ Differing domain-specific techniques
■ Vendor-specific technologies
• Graphics API is not an ideal abstraction for general

purpose compute

Introducing OpenCL
• OpenCL – Open Computing Language
• Approachable language for accessing heterogeneous
computational resources
• Supports parallel execution on single or multiple processors
■ GPU, CPU, GPU + CPU or multiple GPUs
• Desktop and Handheld Profiles
• Designed to work with graphics APIs such as OpenGL

OpenCL = Open Standard
• Specification under review
■ Royalty free, cross-platform, vendor neutral
■ Khronos OpenCL working group (www.khronos.org)
• Based on a proposal by Apple

■ Developed in collaboration with industry leaders
■ Performance-enhancing technology in Mac OS X Snow Leopard

OpenCL Working Group Members
Broad Industry Support
© Copyright Khronos Group, 2008 - Page

OpenCL — A Sneak Preview
Design Goals of OpenCL
• Use all computational resources in system
■ GPUs and CPUs as peers
Data- and task- parallel compute model
■
• Efficient parallel programming model

■ Based on C
■ Abstract the specifics of underlying hardware
• Specify accuracy of floating-point computations

■ IEEE 754 compliant rounding behavior
■ Define maximum allowable error of math functions
• Drive future hardware requirements

OpenCL Software Stack
• Platform Layer
■ query and select compute devices in the system
■ initialize a compute device(s)
■ create compute contexts and work-queues
• Runtime
■ resource management
■ execute compute kernels
• Compiler
■ A subset of ISO C99 with appropriate language additions
■ Compile and build compute program executables
■ online or offline

OpenCL Execution Model
• Compute Kernel
■ Basic unit of executable code — similar to a C function
■ Data-parallel or task-parallel
• Compute Program
■ Collection of compute kernels and internal functions
■ Analogous to a dynamic library
• Applications queue compute kernel execution instances

■ Queued in-order
■ Executed in-order or out-of-order
■ Events are used to implement appropriate synchronization of
execution instances

OpenCL Data-Parallel Execution Model
• Define N-Dimensional computation domain
■ Each independent element of execution in N-D domain is called
a work-item
■ The N-D domain defines the total number of work-items that
execute in parallel — global work size.

• Work-items can be grouped together — work-group
■ Work-items in group can communicate with each other
■ Can synchronize execution among work-items in group to
coordinate memory access

• Execute multiple work-groups in parallel
• Mapping of global work size to work-groups
■ implicit or explicit

OpenCL Task-Parallel Execution Model
• Data-parallel execution model must be implemented by all
OpenCL compute devices
• Some compute devices such as CPUs can also execute task-
parallel compute kernels
■ Executes as a single work-item
■ A compute kernel written in OpenCL
■ A native C / C++ function

OpenCL Memory Model
• Implements a relaxed consistency,
shared memory model
• Multiple distinct address spaces
■ Address spaces can be collapsed
depending on the device’s
memory subsystem

OpenCL Memory Model
shared memory model
Private Private Private Private
• Multiple distinct address spaces Memory Memory Memory Memory
■ Address spaces can be collapsed WorkItem 1 WorkItem M WorkItem 1 WorkItem M
depending on the device’s Compute Unit 1 Compute Unit N

memory subsystem
■ Address Qualifiers
■ __private

OpenCL Memory Model
shared memory model

memory subsystem
Local Memory Local Memory
■ __private
■ __local

OpenCL Memory Model
shared memory model

memory subsystem
Local Memory Local Memory
■ __private Global / Constant Memory Data Cache

■ __local
Compute Device
■ __constant and __global
■ Example: Global Memory
■ __global float4 *p; Compute Device Memory

Language for writing compute kernels
• Derived from ISO C99
• A few restrictions
■ Recursion, function pointers, functions in C99 standard
headers ...
• Preprocessing directives defined by C99 are supported
• Built-in Data Types
■ Scalar and vector data types
■ Pointers
■ Data-type conversion functions
■ convert_type<_sat><_roundingmode>
■ Image types
■ image2d_t, image3d_t and sampler_t


• Built-in Functions — Required
■ work-item functions
■ math.h
■ read and write image
■ relational
■ geometric functions
■ synchronization functions

• Built-in Functions — Required
■ work-item functions
■ math.h
■ read and write image
■ relational
■ geometric functions
■ synchronization functions
• Built-in Functions — Optional

■ double precision
■ atomics to global and local memory
■ selection of rounding mode
■ writes to image3d_t surface

OpenCL FFT Example - Host API Code

// create a compute context with GPU device


context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);


// create a work-queue


queue = clCreateWorkQueue(context, NULL, NULL, 0);


// allocate the buffer memory objects



memobjs[0] = clCreateBuffer(context,



CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*2*num_entries, srcA);






CL_MEM_READ_WRITE,



CL_MEM_READ_WRITE,
sizeof(float)*2*num_entries, NULL);


// create the compute program


program = clCreateProgramFromSource(context, 1,


&fft1D_1024_kernel_src, NULL);


// build the compute program executable



clBuildProgramExecutable(program, false, NULL, NULL);



// create the compute kernel



// create the compute kernel

kernel = clCreateKernel(program, “fft1D_1024”);


// create N-D range object with work-item dimensions

global_work_size[0] = n;

local_work_size[0] = 64;

range = clCreateNDRangeContainer(context, 0, 1,

global_work_size,

global_work_size,
local_work_size);

global_work_size,
local_work_size);
// set the args values

global_work_size,
local_work_size);

clSetKernelArg(kernel, 0, (void *)&memobjs[0],

global_work_size,
local_work_size);

sizeof(cl_mem), NULL);

global_work_size,
local_work_size);


global_work_size,
local_work_size);


global_work_size,
local_work_size);

clSetKernelArg(kernel, 2, NULL,

global_work_size,
local_work_size);

sizeof(float)*(local_work_size[0]+1)*16, NULL);

global_work_size,
local_work_size);


global_work_size,
local_work_size);


global_work_size,
local_work_size);

// execute kernel

global_work_size,
local_work_size);

// execute kernel
clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);

OpenCL FFT Example - Compute Kernel
// This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into
// calls to a radix 16 function, another radix 16 function and then a radix 4 function
__kernel void fft1D_1024 (__global float2 *in, __global float2 *out,
__local float *sMemx, __local float *sMemy) {
int tid = get_local_id(0);
int blockIdx = get_group_id(0) * 1024 + tid;
float2 data[16];
// starting index of data to/from global memory

in = in + blockIdx; out = out + blockIdx;
globalLoads(data, in, 64); // coalesced global reads

fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 1024, 0);
// local shuffle using local memory

localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4)));
fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication
localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));
// four radix-4 function calls
fftRadix4Pass(data); fftRadix4Pass(data + 4);
fftRadix4Pass(data + 8); fftRadix4Pass(data + 12);
// coalesced global writes

globalStores(data, out, 64);
}

OpenCL and OpenGL
• Sharing OpenGL Resources
■ OpenCL is designed to efficiently share with OpenGL
■ Textures, Buffer Objects and Renderbuffers
■ Data is shared, not copied
• Efficient queuing of OpenCL and OpenGL commands

• Apps can select compute device(s) that will run OpenGL and
OpenCL

Summary
• A new compute language that works across GPUs and CPUs
■ C99 with extensions
■ Familiar to developers
■ Includes a rich set of built-in functions
■ Makes it easy to develop data- and task- parallel compute

programs
• Defines hardware and numerical precision requirements
• Open standard for heterogeneous parallel computing

Opencl

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Opencl

Uploaded by

Copyright:

Available Formats

OpenCL

Parallel Computing on the GPU and CPU

Beyond Programmable Shading: Fundamentals

• Graphics API is not an ideal abstraction for general

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

• Based on a proposal by Apple

Beyond Programmable Shading: Fundamentals

© Copyright Khronos Group, 2008 - Page

Beyond Programmable Shading: Fundamentals

• Efficient parallel programming model

• Specify accuracy of floating-point computations

• Drive future hardware requirements

Beyond Programmable Shading: Fundamentals

■ create compute contexts and work-queues

Beyond Programmable Shading: Fundamentals

• Applications queue compute kernel execution instances

■ Events are used to implement appropriate synchronization of

Beyond Programmable Shading: Fundamentals

execute in parallel — global work size.

coordinate memory access

Beyond Programmable Shading: Fundamentals

■ A native C / C++ function

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

■ Address spaces can be collapsed WorkItem 1 WorkItem M WorkItem 1 WorkItem M

depending on the device’s Compute Unit 1 Compute Unit N

Beyond Programmable Shading: Fundamentals

■ Address spaces can be collapsed WorkItem 1 WorkItem M WorkItem 1 WorkItem M

depending on the device’s Compute Unit 1 Compute Unit N

Beyond Programmable Shading: Fundamentals

■ Address spaces can be collapsed WorkItem 1 WorkItem M WorkItem 1 WorkItem M

depending on the device’s Compute Unit 1 Compute Unit N

■ __private Global / Constant Memory Data Cache

■ __constant and __global

■ Example: Global Memory

■ __global float4 *p; Compute Device Memory

Beyond Programmable Shading: Fundamentals

■ Data-type conversion functions

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

■ read and write image

Beyond Programmable Shading: Fundamentals

■ read and write image

• Built-in Functions — Optional

■ selection of rounding mode

■ writes to image3d_t surface

Beyond Programmable Shading: Fundamentals

Beyond Programmable Shading: Fundamentals

// create a compute context with GPU device

Beyond Programmable Shading: Fundamentals

// create a compute context with GPU device

Beyond Programmable Shading: Fundamentals

// create a compute context with GPU device

Beyond Programmable Shading: Fundamentals

// create a compute context with GPU device

Beyond Programmable Shading: Fundamentals

// create a compute context with GPU device

// allocate the buffer memory objects

Beyond Programmable Shading: Fundamentals

// create a compute context with GPU device

// allocate the buffer memory objects

■ constant and global