Professional Documents
Culture Documents
Abstract
This paper presents the performance of the Myrinet Gb/s LAN with the TCP/IP
communication protocol in a cluster of Pentium based workstations. We systemat-
ically measured the bandwidth and latency of the dierent layers of the TCP/IP
software and the Myrinet rmware/hardware, in order to expose bottlenecks in
dierent levels of the communication protocol stack. Our study identied several
bottlenecks that could impair the performance of TCP/IP for other Gb/s LANs,
e.g., Gigabit Ethernet. The paper presents details of our performance measurements
and suggests means to improve the overall performance of the communication sub-
system.
1 Introduction
2
and ways to speedup TCP/IP. Section 5 presents some related works dealing
with Gigabit LAN software. Our conclusions and possible other optimizations
are given in Section 6.
3
may be blocked (stopped) temporarily by the receiver. This
ow control is
provided by every link. The packet size is not limited and may be of any
reasonable length.
The Myrinet \data link" (network interface) layer, includes the device driver
in the operating system, and the corresponding host network interface board,
(see Figure 1 for a general layout). Together they handle all the low level
details of the interface with the media.
Host 1 Host 2
Hardware
Host RAM Host RAM
Wire
4
1100 550
Wire Wire
1000 DMA DMA
PIO 500 PIO
900 450
800 400
Transfer Rate [Mbit/Sec]
600
300
500
250
400
200
300
150
200
100
100
50
0
0 1024 2048 4096 8192 16384 4 8 16 32 64 128
Transfer Size [Bytes] Transfer Size [Bytes]
(a) (b)
5
1000
DMA
MCP
900
800
700
Bandwidth [Mbit/Sec]
600
500
400
300
200
100
0
64 512 1024 2048 4096 8192
Packet Size[Bytes]
160
150
140
Latency [msec]
130
120
110
100
90
80
64 512 1024 2048 4096 8192
Packet Size [Bytes]
6
in FM [13].
MCP
Link level
700
600
Bandwidth [Mbit/Sec]
500
400
300
200
100
0
64 512 1024 2048 4096 8192
Packet Size[Bytes]
7
for this performance degradation is the \gather" on send, and \scatter" on
receive, which imply several DMA transfers for each such operation.
User
Kernel
Socket API
Wire
This section presents the performance of the network layer, the transport layers
and the socket API. For each layer, we executed a set of tests that measured
the bandwidth from that layer down to the wire. More specically, we injected
data directly into each of the tested layers and measured the throughput (data
transfer rate) over the wire. Note that each of these tests re
ects the maximal
available bandwidth to the layer just above the tested layer. For example, any
protocol above IP can get no more that the measured bandwidth of the IP
layer.
8
3.1 The network (IP) layer
This layer handles the movement of packets over the network. It performs
packet routing, fragmentation and defragmentation of packets to t the packet
in the underlying network, gateway services, etc. Each packet delivered to this
layer includes the data and the destination IP address. The address deter-
mines the outgoing interface necessary to deliver the packet to its destination.
Then the network layer performs fragmentation (if needed) according to the
Maximal Transfer Unit (MTU) supported by the hardware, and sends the
fragments in separate IP packets. The receiver is responsible for collecting the
dierent fragments until the original packet is obtained.
The performance of the network layer is aected by the route searching and the
fragmentation and defragmentation operations. The route searching is usually
applied only to the rst packet of the current connection and therefore has
relatively small overhead. In contrast, the fragmentation/de-fragmentation of
the packet may causes a signicant network overhead.
Device driver
Standard IP
500
400
Bandwidth [Mbit/Sec]
300
200
100
0
64 512 1024 2048 4096 8192
Packet Size[Bytes]
Figure 7 shows the bandwidth of the network layer of the Myrinet for dierent
packet sizes. This bandwidth approaches 557.8 Mb/s for the maximal packet
size. For reference, the gure also presents the bandwidth of the device driver,
which is only few percents faster than the network layer. This small dierence
is due to the fact that there is no need for routing { because the Myrinet is a
LAN, and due to the large MTU (8KB) used, which eliminates the use of the
fragmentation mechanism.
9
3.2 The transport (TCP) layer
The TCP based transport layer provides a reliable data
ow and multiplex-
ing for the application layer. Reliability is achieved by acknowledging received
packets, by retransmission of lost packets and by an Internet checksum cal-
culation for data integrity. Multiplexing allows the opening of a number of
independent connections between two hosts. Multiplexing incurs a signicant
latency, to match between an incoming packet and the destine user's \port",
a user dened number that identies the connection. This matching may slow
down packet receiving in hosts which have a large number of open connections.
The transport layer performs two data touching operations, a memory copy
that saves the data for possible retransmition, and the checksum calculation.
Other time consuming operations of the transport layer are to support dier-
ent kinds of synchronizations, such as waiting for an acknowledgment, or for
retransmition, or for the allocation of space in the receive or send buer. Each
such synchronization is performed by stopping the corresponding process, fol-
lowed by a context switch which results in a loss of the remaining quantum
time. To get an idea of the amount of the lost bandwidth due to such synchro-
nization points, consider a system in which the process time quantum is 20ms,
and a communication software that can deliver 518 Mb/s. Assuming that on
average, half of the time quantum is lost at each synchronization point, then
the process looses 5.18 Mb/s from its bandwidth.
600
IP
TCP
500
400
Bandwidth [Mbit/Sec]
300
200
100
0
1024 4096 8192 16384 30000
Packet Size[Bytes]
10
of TCP asymptotically approaches 412 Mb/s, about 26% slowdown over the
IP layer.
In BSD derived UNIX-like systems, the interface between the application layer
and the kernel protocols layer is called the \socket" layer. The socket layer,
also called the socket API, is the generic single entry point for the user's
programs to all the networking software in the kernel. The socket layer is
protocol independent. All network related system calls start at this layer.
When a system call includes data movement, the socket layer is responsible
to copy the data from the user space to the kernel space. This copy allows the
reuse of the data buers immediately after a send operation and also provides
semantics of synchronized send to the application layer.
The bandwidth of the Socket API via UDP and via TCP are shown in Figure 9.
The corresponding latencies are shown in Figure 10. Obviously, since the UDP
protocol is much simpler than TCP, its bandwidth is better for packet sizes
greater than 4K Bytes, and its latency is consistently better for all packet sizes.
For smaller packets, the bandwidth of TCP is better than UDP due to the use
of the \delay sending" mechanism. Note that the maximal bandwidth of UDP
and TCP for packet size 16 KB are 325.2 Mb/s and 250.5 Mb/s respectively.
11
350
UDP
TCP
300
250
Bandwidth [Mbit/Sec]
200
150
100
50
0
256 2048 4096 8192 16384 30000
Packet Size[Bytes]
600
550
500
Latency [msec]
450
400
350
300
250
200
64 512 1024 2048 4096 8192
Packet Size [Bytes]
The TCP/IP protocol suite has four \data-touching" operations: a copy from
the user to the kernel, a checksum calculation, backup copy for retransmission
and a copy to the network interface SRAM. In order to get an idea what is the
overhead of these operations, we measured the bandwidth of the data copy
from the user to the kernel address space, performed by the socket layer.
The results, shown in the lower graph of Figure 11, re
ect a relatively low
bandwidth, which we attributed to the rate of the memory copy operation.
To prove this point, we measured the rate of the memory to memory copy,
using the the same bcopy routine that is used by the Socket API. The results
of these measurements for dierent packet sizes, are shown in the upper graph
of Figure 11. From the gure it follows that the maximal rate of the memory
copy is 480 Mb/s, while the maximal bandwidth of the socket layer is 430
12
Memory copy
500 Socket
400
Bandwidth [Mbit/Sec]
300
200
100
0
64 2048 4096 8192 16384 32768
Packet Size[Bytes]
Mb/s.
The close proximity, of about 10-20%, between the resulting graphs, indicates
that most of the overhead of the socket layer is due to the relatively slow
rate of the memory copy operation. This implies that in order to provide
high bandwidth, future network protocols should reduce the number of mem-
ory copy operations. Alternatively, a substantial improvement in the memory
subsystem, e.g. a wider bus and faster memory, could solve this problem.
13
300 MNP
TCP
250
Bandwidth [Mbit/Sec]
200
150
100
50
0
512 4096 8192 16384 32764 49152 65528
Packet Size[Bytes]
Fig. 12. Bandwidth of the socket API via MNP vs. TCP
4.2 Faster processors
250
Bandwidth [Mbit/Sec]
200
150
100
50
0
512 4096 8192 16384 32768 49152 65536
Packet Size[Bytes]
Fig. 13. Bandwidth of MNP vs. TCP for the socket API on Pentium II 300 MHz
The results are presented in Figure 13. In comparison to the results shown in
Figure 12, the bandwidth obtained with the faster Pentium II for all of the
above protocols is increased by 10-15%, over the bandwidth for the Pentium-
Pro 200 MHz for dierent packet sizes. The maximal bandwidth was 320 Mb/s
for both MNP and UDP, and 278 Mb/s for TCP.
14
5 Related Works
15
PM [16] is a high-performance communication library for parallel process-
ing. Its maximal reported throughput is 294 Mb/s on SUN workstations. PM
supports network context switching for a multi-user parallel processing envi-
ronment and FIFO message delivery. It uses a scalable
ow-control algorithm.
To decrease the send latency, the immediate sending scheme is used. This
scheme is an optimized version of the double-buering scheme. The PM in-
terface is simple and includes only a few primitives, for send/receive to/from
the network, initialize the library and for channel context switch. The main
drawback of PM is the use of polling.
6 Conclusions
This paper presented the performance of TCP/IP and the data link layers of
the Myrinet Gb/s LAN when used for general purpose communication services
via the standard protocol stack of an Operating System (OS) kernel. As long as
TCP/IP continues to be the dominant communication software, it is necessary
to nd ways to improve its performance, to match that of modern LANs.
Special communication libraries for a direct access to the network from the
application layer, prove the feasibility to build such software. However, this
approach is unlikely to replace TCP/IP, because it is not designed for general
purpose communication, it lacks the capability to interact with the rest of the
communication subsystem, and it is not transparent to the applications.
We have exposed several bottlenecks in the communications layers. One bot-
tleneck is in the bandwidth of the PCI bus, which is more than two times
slower than the aggregate bandwidth of the Myrinet and the Gigabit Eth-
ernet. The bandwidth of the memory copy operation was also found to be
a major bottleneck for the TCP/IP protocol suite, particularly because this
operation is used several times.
The main conclusions of this paper are that in order to deliver a Gigabit per
second communication to the user, it is necessary to speedup the PCI bus
and memory copy transfer rates. We showed that the checksum calculation
and the backup copy of the transport layer could be eliminated. This leaves
two mandatory copy operations in the TCP/IP protocol suite, to preserve
its semantic. Therefore, the improvement of the memory copy operation is
essential. For the Myrinet LAN, it is necessary to reduce the DMA engine
latency and further optimize the MCP.
The forthcoming upgrade to Gb/s LANs, such as the Gigabit Ethernet, will
not achieve its intended performance with existing platforms. This fact has
already been observed by early measurements that were conducted by several
research groups. Further improvements of the buses and memories of the host
16
machine are necessary, as well as optimization of the networking software. The
66 MHz, 64 bit PCI which is already available from some hardware vendors is
a step in the right direction. Another mandatory improvement is to increase
the memory copy bandwidth. Alternatively, the development of a network
interface chip that implements the TCP protocol could result in obtaining
Gb/s communication.
Acknowledgments
References
17
[9] J.S. Kay. Path IDs: A Mechanism for Reducing Network Software Latency.
PhD thesis, Computer Science and Engineering, University of California, San
Diego, 1995.
[10] A. Mainwaring and D.E. Culler. Active Message Application Programmin
Interface and Communication Subsystem Organization. Technical report, Univ.
of California, Berkeley, 1995.
[11] R.P. Martin, A.M. Vahdat, D.E. Culler, and T.E. Anderson. Eects of
communication latency, overhead, and bandwidth in a cluster architecture. In
Proc. 24-th Annual Intr. Symp. on Computer Architecture (ISCA), June 1997.
[12] I. Metrik. The Mosix Network Protocol. Master's thesis, Computer Science
Institute, The Hebrew University of Jerusalem, may 1998.
[13] S. Pakin, V. Karamcheti, and A.A. Chien. Fast Messages: Ecient, Portable
Communication for Workstation Clusters and MPPs. IEEE Concurrency,
5(2):60{73, 1997.
[14] C. Partidge. Gigabit Networking. Addison-Wesley, Reading, MA, 1994.
[15] W.R. Stevens. TCP/IP Illustrated, The Protocols. Addison-Wesley, Reading,
MA, 1994.
[16] H. Tezuka, A. Hori, Y. Ishikawa, and M. Sato. PM: An Operating System
Coordinated High Performance Communication Library. In Proc. Intr. Conf.
on High-Performance Computing and Networking (HPCN Europe 1997), pages
708{717, April 1997.
[17] T. von Eicken, A. Basu, and W. Vogels. U-Net: a user level network interface for
parallel and distributed computing. In Proc. 15-th ACM Symp. on Operating
Systems Principles, pages 40{53, 1995.
18