You are on page 1of 33

Collective Buffering: Improving

Parallel I/O Performance


By
Bill Nitzberg and Virginia Lo

Outline

Introduction
Concepts
Collective parallel I/O algorithms
Collective buffering experiments
Conclusion
Question

Introduction
Existing parallel I/O system evolved
directly from I/O system for serial
machines
Serial I/O systems are heavily tuned for:
Sequential, large accesses, limited file
sharing between processes
High degree of both spatial and temporal
locality

Introduction (cont.)
This paper presents a set of algorithms
known as Collective Buffering algorithms
These algorithms seeks to improve I/O
performance on distributed memory
machines by utilizing global knowledge of
the I/O operations

Concepts
Global data structure
Global data structure is the logical view of the
data from the applications point of view
Scientific applications generally use global
data structures consisting of arrays distributed
in one, two, or three dimensions

Concepts (cont.)
Data distribution
The global data structure is distributed among
node memories by cutting it into data chunks.
The HPF BLOCK distribution partitions the
global data structure into P equally sized
pieces
The HPF CYCLIC divides the global data
structure into small pieces (by distribution size
or block size) and deals these pieces out to
the P nodes in a round-robin fashion

Concepts (cont.)

Concepts (cont.)
File layout
File layout is another form of data distribution
The file represents a linearization of the global
data structures, such as the row-major
ordering of a three-dimensional array
This linearization is called canonical file
The file are distributed among I/O nodes

Concepts (cont.)

Collective parallel I/O algorithm


Nave algorithm
Nave algorithm treats parallel I/O the same
as workstation I/O
The order of writes is dependent on data
layout in nodes memory which as no relation
to the layout of data on disks
The unit of data transferred in each I/O
operation is the data block the smallest unit
of local data that is contiguous with respect to
the canonical file

Collective parallel I/O algorithm


(cont.)
Nave algorithm (cont.)
The size of the data block is very small and is
unrelated to the size of a file block because of
the disparity between data distributions and
file layout parameters
The overall effect are:
The network is flood with many small messages
Messages arrive at I/O nodes in an uncoordinated
fashion resulting in highly inefficient disk writes

Collective parallel I/O algorithms


(cont.)

Collective parallel I/O algorithms


(cont.)
Collective buffering algorithm
This method rearranges the data on compute
nodes prior to issuance of I/O operations to
minimize the number of disk operations
The permutation can be performed in place
where nodes transpose data among them self
It can also be performed on auxiliary nodes
where the compute nodes transpose the data
by sending it to a set of auxiliary buffering
nodes

Collective parallel I/O algorithms


(cont.)

Collective parallel I/O algorithms


(cont.)

Collective parallel I/O algorithms


(cont.)
Four techniques are developed and
evaluated:
1 - All compute nodes are used to permute the
data to a simple HPF BLOCK intermediate
distribution in a single step
2 Refine the first technique by realistically
limiting the amount of buffer space and using
a distribution which matches the file layout

Collective parallel I/O algorithms


(cont.)
Four techniques (cont.):
This technique uses HPF CYCLIC
intermediate distribution
This method uses scatter/gather hardware to
eliminate the latency dominated overhead of
the permutation phase

Collective buffering experiments


Experiment systems:
The Paragon consists of 224 processing
nodes connected in a 16x32 mesh.
Application space-share 208 compute nodes
with 32 MB of memory each.
Nine I/O nodes each with one SCSI-1 RAID-3
disk array consisting of 5 disks, 2 gigabytes
each.
The parallel file system, PFS is configured to
use 6 of the 9 I/O nodes

Collective buffering experiments


(cont.)
Experiments systems:
The SP2 consists of 160 nodes. Each node is
an IBM RS6000/590 with 128 MB of memory
and a SCSI-1 attached 2 GB disk
The Parallel file system, IBM AIX Parallel I/O
File System (PIOFS) is configured with 8 I/O
nodes (semi-dedicated servers) and 150
compute nodes

Collective buffering experiments


(cont.)

Collective buffering experiments


(cont.)

Collective buffering experiments


(cont.)

Collective buffering experiments


(cont.)

Collective buffering experiments


(cont.)

Collective buffering experiments


(cont.)

Collective buffering experiments


(cont.)

Collective buffering experiments


(cont.)

Collective buffering experiments


(cont.)

Collective buffering experiments


(cont.)

Collective buffering experiments


(cont.)

Conclusion
Collective buffering significantly improves
Nave parallel I/O performance by two
orders of magnitude for small data block
sizes
Peak performance can be obtained with
minimal buffer space (approximately 1
megabyte per I/O node)
Performance is dependent on intermediate
distribution (up to a factor of 2)

Conclusion (cont.)
There is no single intermediate distribution
which provides the best performance for
all cases, but a few come close
Collective buffering with scatter/gather can
potentially deliver peak performance for all
data block sizes.

Question
What is the advantages and
disadvantages of the Nave algorithm ?
What is Collective Buffering and how this
technique may improve parallel I/O
performance ?

You might also like