You are on page 1of 10

Rocks 5.4.

2 Scheduling GPU jobs via SGE


Disclaimer
The instructions/steps given below worked for me (and Michigan Technological
University) running Rocks 5.4.2 (with CentOS 5.5 and SGE 6.2u5) as has been a
common practice for several years now, a full version of Operating System was installed.
These instructions may very well work for you (or your institution), on Rocks-like or
other linux clusters. Please note that if you decide to use these instructions on your
machine, you are doing so entirely at your very own discretion and that neither this site,
sgowtham.com, nor its author (or Michigan Technological University) is responsible for
any/all damage intellectual and/or otherwise.
A bit about GPU computing
Citing NVIDIA,
GPU computing is the use of a GPU (graphics processing unit) together with a
CPU to accelerate general-purpose scientific and engineering applications.
Pioneered five years ago by NVIDIA, GPU computing has quickly become an
industry standard, enjoyed by millions of users worldwide and adopted by
virtually all computing vendors.
GPU computing offers unprecedented application performance by offloading
compute-intensive portions of the application to the GPU, while the remainder
of the code still runs on the CPU. From a users perspective, applications simply
run significantly faster.
CPU + GPU is a powerful combination because CPUs consist of a few cores
optimized for serial processing, while GPUs consist of thousands of smaller,
more efficient cores designed for parallel performance. Serial portions of the
code run on the CPU while parallel portions run on the GPU.
NVIDIAs list of GPU applications is here.
A bit about SGE
From our internal documentation,
Sun Grid Engine [formerly known as Computing in Distributed Networked
Environments (CODINE) or Global Resource Director (GRD) and later known
as the Oracle Grid Engine (OGE)] is an open source queuing system developed
and supported by Sun Microsystems. In December 2011, Oracle officially
passed on the torch for maintaining the Grid Engine open source code base to
the Open Grid Scheduler project. Open Grid Scheduler/Grid Engine is a
commercially supported open source batch queuing system for distributed
resource management. OGS/GE is based on Sun Grid Engine, and maintained
by the same group of external (i.e. non-Sun) developers who started
contributing code since 2001.
SGE is a highly scalable, flexible and reliable distributed resource manager
(DRM) and. A SGE cluster consists of worker machines (compute nodes), a
master machine (front end), and none or more shadow master machines. The
compute nodes run copies of the SGE execution daemon (sge_execd). The
front end runs the SGE qmaster daemon. The shadow front end machines run
the SGE shadow daemon. Often, the number of slots in a compute node is equal
to the number of CPU cores available. Each core can run one job and as such,
represents one slot.
Once a job has been submitted to the queue (either using the command line or
the graphical interface), it enters the pending state. During the next scheduling
run, the qmaster ranks the job against the other pending jobs. The relative
importance of a job is decided by the scheduling policies in effect. The most
important pending jobs will be scheduled to available slots. When a job requires
a resource that is currently unavailable, it will continue to be in the waiting
state.
Once the job has been scheduled to a compute node, it is sent to the execution
daemon in that compute node. sge_execd executes the command specified by
the job, and the job enters the running state. It will continue to be in running
state it completes, fails, is terminated, or is re-queued. The job may also be
suspended, resumed, and/or checkpointed (SGE does not natively checkpoint
any job; it will, however, run a script/program to checkpoint a job, when
available) any number of times.
After a job has completed or failed, sge_execd cleans up and notifies the
qmaster. The qmaster records the jobs information and drops that job from
its list of active jobs. SGE provides command with with jobs information is
retrieved from accounting logs and such information can be used to design
computing policies.
Installation & configuration
Rocks 5.4.2 installation includes a fully working instance of SGE. Let us suppose that the
cluster has 4 compute nodes, and that each compute node has 8 CPU cores along with 4
NVIDIA GPUs. Also, suppose that the compute nodes are named compute-0-0,
compute-0-1, compute-0-2 and compute-0-3. It is further assumed that each of
these nodes have the relevant, recent and stable version of NVIDIA drivers (and CUDA
Toolkit) installed. The following command
rocks run host compute 'hostname; nvidia-smi -L'
should list 4 NVIDIA GPUs along with hostname of each compute node.
Schematic representation of a Rocks cluster
By default, SGE puts all nodes (and the CPU cores or slots therein) in one queue
all.q. Also, SGE is unaware of GPUs in each node and as such, makes it improbable to
schedule & monitor jobs on a GPU.
The task
Make SGE aware of available GPUs; set every GPU in every node in compute
exclusive mode; split all.q into two queues: cpu.q and gpu.q; make sure a
job running on cpu.q does not access GPUs; make sure a job running on
gpu.q uses only one CPU core and one GPU
Making SGE aware of available GPUs
1. Dump the current complex configuration into a flat text file via the command
qconf -sc > qconf_sc.txt
2. Open the file, qconf_sc.txt, and add the following line at the very end
gpu gpu BOOL == FORCED
NO 0 0
3. Save and close the file.
4. Update the complex configuration via the command, qconf -Mc qconf_sc.txt
5. Check: qconf -sc | grep gpu should return the above line
Setting GPUs in compute exclusive mode
Run the following command:
rocks run host compute 'nvidia-smi -c 1'
Manual page for nvidia-smi indicates that this setting does not persist across reboots.
Splitting all.q into cpu.q and gpu.q
By default all 8 CPU cores from each node, for a total of 32 CPU cores, are part of
all.q; The set up needs
1. disabling of all.q
2. 4 CPU cores from each node, for a total of 16 CPU cores, will become part of cpu.q
3. 4 CPU cores and 4 GPUs from each node, for a total of 16 CPU cores & 16 GPUs,
will become part of gpu.q; also, each CPU core in gpu.q will serve as host (or
parent) to one GPU
Disabling all.q
Once the current all.q configuration is saved via the command qconf -sq all.q >
all.q.txt, it can be disabled using the command qmod -f -d all.q; contents of
all.q.txt should look something like below:
qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make mpich mpi orte
rerun FALSE
slots 1,[compute-0-0.local=8], \
[compute-0-1.local=8], \
[compute-0-2.local=8], \
[compute-0-3.local=8]
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
Creating cpu.q
Copy all.q.txt as cpu.q.txt and make it look like as follows:
qname cpu.q
hostlist @allhosts
seq_no 10
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make mpich mpi orte
rerun FALSE
slots 1,[compute-0-0.local=4], \
[compute-0-1.local=4], \
[compute-0-2.local=4], \
[compute-0-3.local=4]
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
The command, qconf -Aq cpu.q.txt, should create the new queue cpu.q. One may
run 16 single processor jobs OR one 16 processor job (OR any plausible combination in
between that brings the total to 16 slots) in this queue at any given time, and these jobs
will not be able to access GPUs. For testing purposes, one may use this Hello, World!
program, using the following submission script:
#! /bin/bash
#
# Save this file as hello_world_cpu.sh and submit to the queue using the
command
# qsub hello_world_cpu.sh
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -pe mpich 8
#$ -q cpu.q
#

# This assumes that the PATH variable knows about 'mpirun' command
mpirun -n $NSLOTS -machine $TMP/machines hello_world.x
Creating gpu.q
Copy all.q.txt as gpu.q.txt and make it look like as follows:
qname gpu.q
hostlist @allhosts
seq_no 20
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make mpich mpi orte
rerun FALSE
slots 1,[compute-0-0.local=4], \
[compute-0-1.local=4], \
[compute-0-2.local=4], \
[compute-0-3.local=4]
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values gpu=TRUE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
The command, qconf -Aq gpu.q.txt, should create the new queue gpu.q. One may
run 16 single CPU+GPU jobs at any given time and by design (please see the sample job
submission script) and this will use only one GPU per job. For testing purposes, one may
use this Hello, World!, program using the following submission script:
#! /bin/bash
#
# Save this file as hello_world_gpu.sh and submit to the queue using the
command
# qsub hello_world_gpu.sh
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -q gpu.q
#$ -hard -l gpu=1
#

./hello_world_cuda.x
Thanks be to
Rocks mailing list, Grid Engine mailing list and their participants.

You might also like