Professional Documents
Culture Documents
Noncontiguous IO ∗
Client
sign of PVFS2. As shown in Fig. 1, its
!
"
#
"
Meta Server
BMI Storage Fig. 2. An Example of PVFS2 List IO
lib{elan,elan4} There is a unique chain DMA mechanism
over Quadrics. In this mechanism, one or
PVFS2 Client
BMI
Quadrics IO Server more DMA operations can be configured as
Interconnect Storage
lib{elan,elan4}
BMI chained operations with a single NIC-based
lib{elan,elan4}
event. When the event is fired, all the
DMA operations will be posted to Quadrics
Fig. 1. PVFS2 over Quadrics Elan4 DMA engine. Based on this mechanism,
the default Quadrics software release provides
3. Designing Zero-Copy Quadrics noncontiguous communication operations in
Scatter/Gather for PVFS2 List the form of elan putv and elan getv. How-
ever, these operations are specifically designed
IO
for the shared memory programming model
(SHMEM) over Quadrics. The final placement
Noncontiguous IO access is the main ac- of the data still requires a memory copy from
cess pattern in scientific applications. Thakur the global memory to the application destina-
et. al. [18] noted that it is important to achieve tion memory.
high performance MPI-IO with native noncon-
tiguous access support in file systems. PVFS2 To support zero-copy PVFS2 list IO, we pro-
provides list IO interface to support such non- pose a software zero-copy scatter/gather mech-
contiguous IO accesses. Fig. 2 shows an ex- anism with a single event chained to multiple
ample of noncontiguous IO with PVFS2. In RDMA operations. Fig. 3 shows a diagram
PVFS2 list IO, communication between clients
and servers over noncontiguous memory re- &
'
$ &
'
Source Memory
$
%
$
%
(
)
* *
+
*
+
.
/
Destination Memory
.
/
, .
/
,
-
,
-
0
1
2 2
3
&
'
$
$
&
'
&
'
$
%
$
%
$
%
$
%
(
)
(
)
*
*
*
+
*
+
*
+
*
+
.
/
.
/
.
/
.
/
,
,
.
/
.
/
,
-
,
-
,
-
,
-
0
1
0
1
2
2
2
3
2
3
Host Event 1
the combined source memory. List IO can be Command Port
built on top of interconnects with native scat- 2 3
N 4
ter/gather communication support, otherwise,
it often resorts to memory packing and un- single event Multiple RDMA destination NIC
200
IO. We then evaluate the performance bene-
150
fits of this support to PVFS2 IO accesses and
100
a scientific application, MPI-Tile-IO. Our ex-
50 perimental results indicate that Quadrics scat-
0
Write Read ter/gather support is beneficial to the perfor-
mance of PVFS2 IO accesses. In terms of ag-
Fig. 6. Performance of MPI-Tile-IO Benchmark gregated read and write bandwidth, the per-
formance of MPI-Tile-IO application bench-
5. Related Work mark can be improved by 113% and 66%,
respectively, Moreover, our implementation
Optimizing noncontiguous IO has been the of PVFS2 over Quadrics significantly outper-
topic of various research, covering a spectrum forms an implementation of PVFS2 over In-
of IO subsystems. For example, data siev- finiBand.
ing [17] and two-phase IO [15] have been In future, we intend to leverage more fea-
leveraged to improve the client IO accesses of tures of Quadrics to support PVFS2 and study
parallel applications in the parallel IO libraries, their possible benefits to different aspects of
such as MPI-IO. Wong et. al. [19] optimized parallel file systems. For example, we intend
the linux vector IO by allowing IO vectors to to study the benefits of integrating Quadrics
be processed in a batched manner. NIC memory into PVFS2 memory hierarchy,
such as data caching with client and/or server-
Ching et. al [5] have implemented list IO
side NIC memory. We also intend to study the
in PVFS1 to support efficient noncontiguous
feasibility of directed IO over PVFS2/Quadrics
IO and evaluated its performance over TCP/IP.
using the Quadrics programmable NIC as the
Wu et. al [20] have studied the benefits of
processing unit.
leveraging InfiniBand hardware scatter/gather
operations to optimize noncontiguous IO ac-
cess in PVFS1. In [20], a scheme called Op- References
timistic Group Registration is also proposed
[1] The Parallel Virtual File System, version 2.
to reduce costs of registration and deregis- http://www.pvfs.org/pvfs2.
tration needed for communication over In- [2] J. Beecroft, D. Addison, F. Petrini, and
finiBand fabric. Our work exploits a com- M. McLaren. QsNet-II: An Interconnect for
munication mechanism with a single event Supercomputing Applications. In the Pro-
chained to multiple RDMA to support zero- ceedings of Hot Chips ’03, Stanford, CA, Au-
copy non-contiguous network IO in PVFS2 gust 2003.
[3] N. J. Boden, D. Cohen, R. E. Felderman, [14] Quadrics, Inc. Quadrics Linux Cluster Docu-
A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and mentation.
W.-K. Su. Myrinet: A Gigabit-per-Second [15] J. Rosario, R. Bordawekar, and A. Choud-
Local Area Network. IEEE Micro, 15(1):29– hary. Improved parallel i/o via a two-phase
36, 1995. run-time access strategy. In Workshop on
[4] P. H. Carns, W. B. Ligon III, R. Ross, and Input/Output in Parallel Computer Systems,
P. Wyckoff. BMI: A Network Abstraction IPPS ’93, Newport Beach, CA, 1993.
Layer for Parallel I/O, April 2005. [16] R. B. Ross. Parallel i/o benchmarking con-
[5] A. Ching, A. Choudhary, W. Liao, R. Ross, sortium. http://www-unix.mcs.anl.gov/rross/
and W. Gropp. Noncontiguous I/O through pio-benchmark/html/.
PVFS. In Proceedings of the IEEE Inter- [17] R. Thakur, A. Choudhary, R. Bordawekar,
national Conference on Cluster Computing, S. More, and S. Kuditipudi. Passion: Op-
Chicago, IL, September 2002. timized I/O for Parallel Applications. IEEE
[6] A. M. David Nagle, Denis Serenyi. The Computer, (6):70–78, June 1996.
Panasas ActiveScale Storage Cluster – Deliv- [18] R. Thakur, W. Gropp, and E. Lusk. On Im-
ering Scalable High Bandwidth Storage. In plementing MPI-IO Portably and with High
Proceedings of Supercomputing ’04, Novem- Performance. In Proceedings of the 6th Work-
ber 2004. shop on I/O in Parallel and Distributed Sys-
[7] J. Duato, S. Yalamanchili, and L. Ni. In- tems, pages 23–32. ACM Press, May 1999.
terconnection Networks: An Engineering Ap- [19] P. Wong, B. Pulavarty, S. Nagar, J. Morgan,
proach. The IEEE Computer Society Press, J. Lahr, B. Hartner, H. Franke, and S. Bhat-
1997. tacharya. Improving linux block i/o layer for
[8] J. Huber, C. L. Elford, D. A. Reed, A. A. enterprise workload. In Proceedings of The
Chien, and D. S. Blumenthal. PPFS: A Ottawa Linux Symposium, 2002.
High Performance Portable Parallel File Sys- [20] J. Wu, P. Wychoff, and D. K. Panda. Support-
tem. In Proceedings of the 9th ACM Interna- ing Efficient Noncontiguous Access in PVFS
tional Conference on Supercomputing, pages over InfiniBand. In Proceedings of Cluster
385–394, Barcelona, Spain, July 1995. ACM Computing ’03, Hong Kong, December 2004.
Press. [21] W. Yu, S. Liang, and D. K. Panda. High
[9] IBM Corp. IBM AIX Parallel I/O File Sys- Performance Support of Parallel Virtual File
tem: Installation, Administration, and Use. System (PVFS2) over Quadrics. In Proceed-
Document Number SH34-6065-01, August ings of The 19th ACM International Con-
1995. ference on Supercomputing, Boston, Mas-
[10] Infiniband Trade Association. http://www. sachusetts, June 2005.
infinibandta.org. [22] W. Yu, T. S. Woodall, R. L. Graham, and
[11] Intel Scalable Systems Division. Paragon D. K. Panda. Design and Implementation of
System User’s Guide, May 1995. Open MPI over Quadrics/Elan4. In Proceed-
ings of the International Conference on Par-
[12] N. Nieuwejaar and D. Kotz. The Galley
Parallel File System. Parallel Computing, allel and Distributed Processing Symposium
(4):447–476, June 1997. ’05, Colorado, Denver, April 2005.
[13] P. H. Carns and W. B. Ligon III and R. B.
Ross and R. Thakur. PVFS: A Parallel File
System For Linux Clusters. In Proceedings of
the 4th Annual Linux Showcase and Confer-
ence, pages 317–327, Atlanta, GA, October
2000.