You are on page 1of 1

GPU Cluster computing with Nvidia Jetson TK1 boards

Ryan Sudhakaran1
Mentor: Weile Wang2

National Aeronautics and Space Administration

San Jose State University CAARE Program

Tests and Results

Abstract
The use of GPUs (graphical processing units) for HPC (high performance computing)
applications is becoming increasingly popular among researchers in the scientific
community. GPUs are designed for efficient graphical simulation, but are also
optimized for parallel applications and can be incredibly useful in various fields such as
computational fluid dynamics and machine learning. The NVIDIA Jetson TK1
development kit is a miniature computing device that uses the Nvidia Kepler
architecture, which is very efficient at parallel processing. The goal is to eventually
have a way of testing future HPC applications on a smaller scale for educational and
proof-of-concept purposes.

Background
How does a parallel computation
differ from a traditional
computation? Imagine you have
4 independent tasks that you
need to compute in some
program. A serial computation
process will perform tasks 1-4
sequentially, while a parallel
computation process can
compute each task simulateously
(given that there are 4 distinct
processors). It is fairly intuitive to
see that a paralell processor can
allow for much faster
computation given a large
number of tasks. Unfortunately
parallelization is not a perfect
solution to every computational
problem, rather they are best at
dealing with problems that can
be broken down into very simple
operations (such as large linear
algebra solvers or sorting
algorithms).

plot

Figure 1: A serial processing scheme [1]


A

Figure 2: A parallel processing scheme [1]

www.nasa.gov
1

To test whether the cluster was


performing more efficiently than
a single node, I used a code
provided by John Burkhardt[2]
that implements parellization in
the counting of prime numbers
from 1 to N (with N ranging
from 1 to 262144). With
parallelization via MPI
(message passing interface),
the range N can be broken up
and distributed among the
nodes, allowing for
simulatneous counting.
Presented are the results of the
parallel prime program being
run on a single Jetson vs. a
cluster with varying number of
processes.

San Jose State University, 2NASA Ames Research Center

Setup

Note that both the multi-node and single-node implementation become


more efficient at the jump from 4 to 8 processes, but the single node
become less efficient as the number becomes larger. The jump from 16 to
32 processes has little effect on the runtime.

Future Improvements
Develop simple fluid dynamics simulation or
atmospheric model utilizing both GPU and
cluster parallelization (CUDA-Aware MPI)
Test various cluster based algorithms for
machine learning
Collect data on most efficient load-balancing
and optimize workload distribution among
nodes and cores.

References
[1]https://computing.llnl.gov/tutorials/parallel_comp/
[2]https://people.sc.fsu.edu/~jburkardt/c_src/prime_mpi/prime_mpi.html

Email: ryan.sudhakaran@sjsu.edu
Support provided by NASA Office of Educations Minority University Research and Education Project,
Contract #NNX15AQ02A

You might also like