You are on page 1of 18

Performance of the Communication Layers of

TCP/IP with the Myrinet Gigabit LAN

Amnon Barak 1 Ilia Gilderman Igor Metrik


Institute of Computer Science, The Hebrew University of Jerusalem,
Jerusalem 91904, Israel

Abstract
This paper presents the performance of the Myrinet Gb/s LAN with the TCP/IP
communication protocol in a cluster of Pentium based workstations. We systemat-
ically measured the bandwidth and latency of the di erent layers of the TCP/IP
software and the Myrinet rmware/hardware, in order to expose bottlenecks in
di erent levels of the communication protocol stack. Our study identi ed several
bottlenecks that could impair the performance of TCP/IP for other Gb/s LANs,
e.g., Gigabit Ethernet. The paper presents details of our performance measurements
and suggests means to improve the overall performance of the communication sub-
system.

Key words: TCP/IP; Myrinet; Gigabit LAN; Communication bottlenecks

1 Introduction

The improved performance of Local Area Networks (LAN) hardware, which


nowadays have Giga bit per second (Gb/s) throughput and low latency [4,14],
has not been matched so far by proportional improvements in the perfor-
mance of TCP-UDP/IP [15], the most widely used networking software for
general purpose communication. TCP is known to su er from major perfor-
mance degradations due to inherent overhead and bottlenecks in its layered
protocol suite design. For example, early measurements showed less than twice
throughput improvement of the Gigabit Ethernet over Fast Ethernet [8].
1 E-mail: amnon@cs.huji.ac.il
c 1999 Elsevier Science B.V.
In order to gain better understanding of the causes of the communication
overhead, we measured the performance of the Myrinet [7] Gb/s LAN using
TCP/IP. Our study was aimed to identify bottlenecks that could impair the
performance of TCP/IP for other Gb/s LANs, e.g., Gigabit Ethernet [3]. More
speci cally, we systematically measured the bandwidth and the latency of the
di erent layers of the TCP/IP protocol and the Myrinet rmware/hardware
for general purpose applications, starting from the user level to the wire. These
tests were executed in a cluster of PCI-based, Pentium-Pro 200 MHz work-
stations running BSDI's BSD/OS. For each test we generated the maximal
amount of data and we measured the resulting bandwidth at the bottom (wire)
layer. This process was carried for the data-link, the network (IP), transport
(TCP) layers and the socket API. We also conducted tests to measure the
latency [11] of the di erent layers.
Our measurements expose several bottlenecks in the execution of the TCP/IP
communication protocol for any Gb/s LAN. Unlike recent studies [11,17] which
analyzed the bandwidth and latency of small packets, the emphasis of this
paper is the maximal performance for all practical packet sizes. This includes
small packets, which characterize NFS and RPC network trac, as observed in
several distributed systems [5,11], as well as larger packets, which characterize
the communication patterns of data intensive, numerical applications and fast
remote paging [2].
Our study could bene t operating systems [1,6] and application [8] developers
that intend to use the TCP/IP communication software and Gb/s LANs [2].
The main advantage of TCP/IP is its popularity and the fact that it is trans-
parent to the application level. Its main disadvantage is its relatively poor
performance. Another alternative is to develop a special communication li-
brary that allows a direct access to the network from the application layer.
Examples of such packages are Active Messages (AM) [10], Illinois Fast Mes-
sages (FM) [13], PM [16] and U-Net [17]. The advantage of these packages
is that the software can be tuned to speci c needs and it can achieve high
bandwidth and low latency. The main disadvantages of this approach is that
it is not integrated within the operating system and thus lacks the capabil-
ity to interact with the rest of the communication subsystem, and that it is
not transparent to the applications. Nevertheless, the performance that was
achieved by these packages prove the feasibility to build a high speed commu-
nication software that di ers by only few percents from the performance of
Gb/s LAN hardware.
The paper is organized as follows: section 2 presents the Myrinet LAN and the
performance of its data link layer. We point out several hardware and rmware
bottlenecks in the host I/O bus, the Myrinet DMA engine and rmware. Sec-
tion 3 presents the performance of the layers of the TCP/IP protocol for
di erent packet sizes. Section 4 presents further performance measurements

2
and ways to speedup TCP/IP. Section 5 presents some related works dealing
with Gigabit LAN software. Our conclusions and possible other optimizations
are given in Section 6.

2 The Myrinet LAN and its Data Link Performance

Myrinet [7] is a Gigabit-per-second (Gb/s) LAN based on the technology used


for packet communication in massive parallel computers. The Myrinet tech-
nology consists of host interfaces, switches, and LAN links that can operate
at 1.28 Gb/s in each direction.
The Myrinet host interface include a RISC processor with up to 1 MByte
SRAM. This memory is used for bu ering packet data and also for storing the
Myrinet Control Programs (MCP). The interface to the Myrinet network, the
processor and a DMA engine are contained in a custom-VLSI chip, called the
LANai. The MCP, which is executed within the network interface, controls the
transfer of packet data between the host's memory and the network. It is also
responsible for network mapping and routes discovering. Although the error
rate on the Myrinet links is extremely low, it employ a cyclic-redundancy-
check (CRC) byte, to detect packets whose data has been corrupted by cable
or connector faults. The CRC used by Myrinet is the same CRC-8 that is used
in the header of ATM packets. The LANai is also able to calculate standard
Internet Checksum during DMA transfers.
The Myrinet switches are multiple-port components that switch (route) a
packet from the incoming channel of a port to the outgoing channel of the port
selected by the packet header. Myrinet switches employ cut-through routing.
If the selected outgoing channel is not already occupied by another packet,
the head of the incoming packet is advanced into this outgoing channel as
soon as the head of the packet is received and decoded. The packet is then
spooled through this established path until the path is broken by the tail of
the packet. If the selected outgoing channel is occupied by another packet
or is blocked, the incoming packet is blocked. The network topology may be
viewed as undirected graph. Any form of linking together computer interfaces
and switches that forms a connected graph is allowed. When a packet enters
a switch, the leading byte of the header determines the outgoing port before
being stripped o the packet. When a packet enters a host interface, the
leading byte identi es the type of packet, e.g., a mapping packet, a network
management packet, a data packet, etc.
A Myrinet link is composed of a full-duplex pair of Myrinet channels. Channels
convey packets that are arbitrary-length sequence of packet-data bytes. The
channel maintains the framing of packets. The ow of information on a channel

3
may be blocked (stopped) temporarily by the receiver. This ow control is
provided by every link. The packet size is not limited and may be of any
reasonable length.

2.1 The Myrinet data link layer performance

The Myrinet \data link" (network interface) layer, includes the device driver
in the operating system, and the corresponding host network interface board,
(see Figure 1 for a general layout). Together they handle all the low level
details of the interface with the media.
Host 1 Host 2

Device driver Device driver Software

Hardware
Host RAM Host RAM

PCI BUS PCI BUS

Myrinet SRAM Myrinet SRAM

Myrinet host interface Myrinet host interface

Wire

Fig. 1. Myrinet data link layout

2.1.1 The link layer hardware performance


The Myrinet LAN is capable to simultaneously transfer up to 1.28 Gb/s in
each direction. To obtain this performance, the host I/O bus should be capable
to transfer at least 2.56 Gb/s, to allow the Myrinet to operate at its maximal
performance. However, the current theoretical PCI bus transfer rate is only
1.056Gb/s (32 bit word at 33 MHz = 132 MByte/s), almost 2.5 times slower
than the maximal Myrinet transfer rate. The obvious conclusion is that it is
necessary to speedup the PCI bus by at least that factor. This will be achieved
when the PCI will be 64 bits wide and operate at 66 MHz.
To measure the performance of the Myrinet hardware, we conducted three
tests. The rst test checked the bandwidth of the raw transfer over the wire,
i.e., between two Myrinet interface boards. This test was conducted using a
dedicated MCP software, without any other software protocol, memory copy
or PCI transfer. The second test checked the performance of the DMA engine
of the Myrinet host interface over the PCI bus. We tested the raw transfer rate
between the host RAM and the Myrinet interface SRAM (in both directions).

4
1100 550
Wire Wire
1000 DMA DMA
PIO 500 PIO

900 450

800 400
Transfer Rate [Mbit/Sec]

Transfer Rate [Mbit/Sec]


700
350

600
300

500
250

400
200
300
150
200
100
100
50
0
0 1024 2048 4096 8192 16384 4 8 16 32 64 128
Transfer Size [Bytes] Transfer Size [Bytes]
(a) (b)

Fig. 2. Myrinet wire, DMA and PIO bandwidth


The third test checked the programmed I/O performance of the Myrinet board,
using the host processor.
The results of these tests for di erent data block sizes, are shown in Fig-
ure 2(a). From the gure it follows that the Myrinet raw transfer rate over the
wire asymptotically approaches 1 Gb/s in each direction. The corresponding
DMA engine performance asymptotically approaches 910 Mb/s, about 15%
slower than the peak theoretical transfer rate of the 33MHz PCI bus. We note
that this slowdown is due to the overhead of the bus data transfer setup, which
becomes quite signi cant for small packets. Figure 2(b) magni es the results
for packet size 4 { 128 Bytes. From the gure it follows that for packet sizes
of up to 128 Bytes, programmed I/O performs better than the DMA engine.

2.1.2 Myricom's rmware performance


The Myrinet Control Programs (MCP) is responsible to transfer data between
the host's memory and the network, as well as to perform network mapping
and routing. Its main functions are:
 Hardware initialization;
 Dynamic self mapping which initiate \scout" messages to all potential ports
in order to create a map of the network;
 Message sending/receiving to/from the network;
 Broadcast/multicast support.
The bandwidth of the MCP for di erent packet sizes is shown in Figure 3.
For comparison, the bandwidth of the DMA engine is also shown. From the

5
1000
DMA
MCP
900

800

700

Bandwidth [Mbit/Sec]
600

500

400

300

200

100

0
64 512 1024 2048 4096 8192
Packet Size[Bytes]

Fig. 3. MCP vs. DMA bandwidth


gure it follows that the bandwidth of the MCP approaches 728 Mb/s for
the maximal packet size. This bandwidth represents a 20% slowdown over the
bandwidth of the DMA engine for this packet size. Observe that the MCP
slowdown is much more signi cant for small packet sizes. This is due to its
multi functionality, which implies a relatively large overhead for small packets.
180
MCP
170

160

150

140
Latency [msec]

130

120

110

100

90

80
64 512 1024 2048 4096 8192
Packet Size [Bytes]

Fig. 4. The MCP latency


The one-way latency of the MCP for di erent packet sizes is shown in Figure 4.
From the corresponding measurements it follows that for packet sizes up to
128 bytes this latency is approximately 85 sec. For larger packets the latency
is increased linearly, as seen in the gure. For example, the sent segment used
by the TCP protocol is 8K bytes. Thus the latency of the MCP for sending
such a segment is approximately 180 sec.
We note that both the bandwidth and the latency overheads of the MCP could
be reduced with a \thin interface" MCP. Such MCP could be optimized for
speci c communication patterns, e.g. small, xed size packets, as implemented

6
in FM [13].

2.1.3 The link layer software performance


The Myrinet device driver is responsible to allocate memory bu ers and to
transfer data over the network. It acts as a translator, converting the generic
requests received from the operating system into commands that the Myrinet
board can understand. After initialization, the device driver performs two op-
erations: injecting outgoing frames into the network and receiving incoming
frames. The send operation is usually synchronous. It requires only few setup
steps, e.g. setting bu er address, then the operation is performed with almost
no additional overhead. The receive operation is asynchronous. It usually in-
curs more overhead than the send operation, to handle its triggering interrupt
and allocate the memory bu ers. In the case of the Myrinet, the memory bu er
must be aligned and contiguous. This last requirement increases the latency
of the receive operation.
800

MCP
Link level
700

600
Bandwidth [Mbit/Sec]

500

400

300

200

100

0
64 512 1024 2048 4096 8192
Packet Size[Bytes]

Fig. 5. Device driver vs. MCP bandwidth


The bandwidth of the link layer (device driver) is shown in Figure 5. The
bandwidth is shown for packet sizes ranging from 64 bytes to 8K bytes, which
is the Myrinet driver maximal transfer unit (MTU). We note that this is the
standard packet size used by NFS and for most remote bulk transfer of data.
For comparison the gure also shows the bandwidth of the MCP, previously
shown in Figure 3.
From the gure it follows that for the maximal packet size, the bandwidth
of the device driver is 558.3 Mb/s, about 23% slower than that of the MCP,
and about 39% slower than the DMA. Observe that for packet sizes up to 2K
bytes the bandwidth of the device driver is only a few percents slower than
that of the MCP. The main reason for the performance degradation for large
packets is the time consuming allocation of memory bu ers. Another reason

7
for this performance degradation is the \gather" on send, and \scatter" on
receive, which imply several DMA transfers for each such operation.

3 Performance of the TCP/IP Protocol Suite

TCP/IP is the most widely used form of communication between comput-


ers. The TCP/IP protocol suite allows di erent computers, e.g., from di er-
ent vendors, running di erent operating systems, to communicate with each
other [15]. According to the OSI model each layer of TCP/IP includes one or
more protocols that interact with lower and higher layers through well de ned
interfaces. The TCP/IP protocol suite consists of 4 main layers: the data link
layer, which was presented in the last section; the network layer; the transport
layer and the application layer, which interacts with the transport layer via
the socket interface, see Figure 6 for a general layout of the protocol suite.

User process User process User process Application

User
Kernel
Socket API

TCP/UDP other protocols Transport

IP other protocols Network

Hardware interface Link

Wire

Fig. 6. The TCP/IP protocol suite layers

This section presents the performance of the network layer, the transport layers
and the socket API. For each layer, we executed a set of tests that measured
the bandwidth from that layer down to the wire. More speci cally, we injected
data directly into each of the tested layers and measured the throughput (data
transfer rate) over the wire. Note that each of these tests re ects the maximal
available bandwidth to the layer just above the tested layer. For example, any
protocol above IP can get no more that the measured bandwidth of the IP
layer.

8
3.1 The network (IP) layer

This layer handles the movement of packets over the network. It performs
packet routing, fragmentation and defragmentation of packets to t the packet
in the underlying network, gateway services, etc. Each packet delivered to this
layer includes the data and the destination IP address. The address deter-
mines the outgoing interface necessary to deliver the packet to its destination.
Then the network layer performs fragmentation (if needed) according to the
Maximal Transfer Unit (MTU) supported by the hardware, and sends the
fragments in separate IP packets. The receiver is responsible for collecting the
di erent fragments until the original packet is obtained.

The performance of the network layer is a ected by the route searching and the
fragmentation and defragmentation operations. The route searching is usually
applied only to the rst packet of the current connection and therefore has
relatively small overhead. In contrast, the fragmentation/de-fragmentation of
the packet may causes a signi cant network overhead.

Network level bandwidth


600

Device driver
Standard IP
500

400
Bandwidth [Mbit/Sec]

300

200

100

0
64 512 1024 2048 4096 8192
Packet Size[Bytes]

Fig. 7. Network vs. device driver bandwidth

Figure 7 shows the bandwidth of the network layer of the Myrinet for di erent
packet sizes. This bandwidth approaches 557.8 Mb/s for the maximal packet
size. For reference, the gure also presents the bandwidth of the device driver,
which is only few percents faster than the network layer. This small di erence
is due to the fact that there is no need for routing { because the Myrinet is a
LAN, and due to the large MTU (8KB) used, which eliminates the use of the
fragmentation mechanism.

9
3.2 The transport (TCP) layer

The TCP based transport layer provides a reliable data ow and multiplex-
ing for the application layer. Reliability is achieved by acknowledging received
packets, by retransmission of lost packets and by an Internet checksum cal-
culation for data integrity. Multiplexing allows the opening of a number of
independent connections between two hosts. Multiplexing incurs a signi cant
latency, to match between an incoming packet and the destine user's \port",
a user de ned number that identi es the connection. This matching may slow
down packet receiving in hosts which have a large number of open connections.
The transport layer performs two data touching operations, a memory copy
that saves the data for possible retransmition, and the checksum calculation.
Other time consuming operations of the transport layer are to support di er-
ent kinds of synchronizations, such as waiting for an acknowledgment, or for
retransmition, or for the allocation of space in the receive or send bu er. Each
such synchronization is performed by stopping the corresponding process, fol-
lowed by a context switch which results in a loss of the remaining quantum
time. To get an idea of the amount of the lost bandwidth due to such synchro-
nization points, consider a system in which the process time quantum is 20ms,
and a communication software that can deliver 518 Mb/s. Assuming that on
average, half of the time quantum is lost at each synchronization point, then
the process looses 5.18 Mb/s from its bandwidth.
600
IP
TCP

500

400
Bandwidth [Mbit/Sec]

300

200

100

0
1024 4096 8192 16384 30000
Packet Size[Bytes]

Fig. 8. Transport (TCP) vs. network (IP) layer bandwidth


To check the performance of the transport layer for TCP we executed a server-
client data transfer test between two kernel level processes. This test did not
include checksum calculations, because the Myrinet LAN has a per-packet
built-in CRC. The results of this test are shown in Figure 8. For reference the
gure also presents the IP based network layer, previously shown in Figure 7.
From the measurements it follows that the bandwidth of the transport layer

10
of TCP asymptotically approaches 412 Mb/s, about 26% slowdown over the
IP layer.

3.3 The socket interface

In BSD derived UNIX-like systems, the interface between the application layer
and the kernel protocols layer is called the \socket" layer. The socket layer,
also called the socket API, is the generic single entry point for the user's
programs to all the networking software in the kernel. The socket layer is
protocol independent. All network related system calls start at this layer.
When a system call includes data movement, the socket layer is responsible
to copy the data from the user space to the kernel space. This copy allows the
reuse of the data bu ers immediately after a send operation and also provides
semantics of synchronized send to the application layer.

Another responsibility of the socket layer is ow control. This is achieved by a


generic data ow mechanisms. Each socket has an associated send and receive
bu ers. Each bu er contains control information as well as pointers to the data
stored in bu er (mbuf) chains. Flow management is performed by low/high
water marks. In this method, data is sent only if the amount of data stored in
the send bu er is above the low-water mark and below the high-water mark.
In the receiver side, receive is performed if the amount of data stored in the
received bu er is smaller than the high-water mark. If a reliable protocol is
used then there are two cases where the ow control mechanism stops the
process. One case is when there is no space in the received bu er. In this
case the receiver blocks the sender. The second case is when the sender moves
data faster than the underlined layers. Then the sender is stopped. In both of
these cases, the process performs a voluntarily context switch, which causes it
process to loose its time slice. To minimize the occurrence of such cases, a ne
synchronization and tuning between the sender and the receiver is required.

The bandwidth of the Socket API via UDP and via TCP are shown in Figure 9.
The corresponding latencies are shown in Figure 10. Obviously, since the UDP
protocol is much simpler than TCP, its bandwidth is better for packet sizes
greater than 4K Bytes, and its latency is consistently better for all packet sizes.
For smaller packets, the bandwidth of TCP is better than UDP due to the use
of the \delay sending" mechanism. Note that the maximal bandwidth of UDP
and TCP for packet size 16 KB are 325.2 Mb/s and 250.5 Mb/s respectively.

11
350
UDP
TCP
300

250

Bandwidth [Mbit/Sec]
200

150

100

50

0
256 2048 4096 8192 16384 30000
Packet Size[Bytes]

Fig. 9. Bandwidth of the Socket API


700
TCP
650 UDP

600

550

500
Latency [msec]

450

400

350

300

250

200
64 512 1024 2048 4096 8192
Packet Size [Bytes]

Fig. 10. Latency of the Socket API


4 Further Measurements and Optimizations

The TCP/IP protocol suite has four \data-touching" operations: a copy from
the user to the kernel, a checksum calculation, backup copy for retransmission
and a copy to the network interface SRAM. In order to get an idea what is the
overhead of these operations, we measured the bandwidth of the data copy
from the user to the kernel address space, performed by the socket layer.
The results, shown in the lower graph of Figure 11, re ect a relatively low
bandwidth, which we attributed to the rate of the memory copy operation.
To prove this point, we measured the rate of the memory to memory copy,
using the the same bcopy routine that is used by the Socket API. The results
of these measurements for di erent packet sizes, are shown in the upper graph
of Figure 11. From the gure it follows that the maximal rate of the memory
copy is 480 Mb/s, while the maximal bandwidth of the socket layer is 430

12
Memory copy
500 Socket

400

Bandwidth [Mbit/Sec]
300

200

100

0
64 2048 4096 8192 16384 32768
Packet Size[Bytes]

Fig. 11. Memory copy vs. socket layer copy

Mb/s.
The close proximity, of about 10-20%, between the resulting graphs, indicates
that most of the overhead of the socket layer is due to the relatively slow
rate of the memory copy operation. This implies that in order to provide
high bandwidth, future network protocols should reduce the number of mem-
ory copy operations. Alternatively, a substantial improvement in the memory
subsystem, e.g. a wider bus and faster memory, could solve this problem.

4.1 Optimization of the TCP/IP protocol

One way to increase the performance of TCP/IP is by eliminating some of


its data-touching operations [9]. For this test we developed a new protocol,
called MNP [12], in which the checksum calculation and the backup copy for
retransmission were eliminated. MNP preserves the Socket API interface, it is
connection oriented and reliable. MNP supports all the functionalities of and
coexists with TCP/IP.
The bandwidth of the socket API using MNP vs. TCP for di erent packet
sizes, are shown in Figure 12. The lower graph presents the bandwidth via
TCP/IP, previously shown in Figure 9. The upper graph presents the corre-
sponding bandwidth via the MNP protocol.
From the gure it can be seen that the bandwidth of MNP is about 20% higher
than that of TCP, with a pick bandwidth of 302 Mb/s for MNP vs. 250 Mb/s
for TCP.

13
300 MNP
TCP

250

Bandwidth [Mbit/Sec]
200

150

100

50

0
512 4096 8192 16384 32764 49152 65528
Packet Size[Bytes]

Fig. 12. Bandwidth of the socket API via MNP vs. TCP
4.2 Faster processors

All the measurements presented so far were performed on Pentium-Pro 200


MHz computers. In order to check how faster Pentium processors improve the
performance, we measured the bandwidth of the socket API, user level to user
level, of MNP, UDP and TCP on Pentium II 300 MHz computers. We used
the same tests that were previously been used to check the performance of the
socket layer, as shown in Figure 9.
350
MNP
TCP
300

250
Bandwidth [Mbit/Sec]

200

150

100

50

0
512 4096 8192 16384 32768 49152 65536
Packet Size[Bytes]

Fig. 13. Bandwidth of MNP vs. TCP for the socket API on Pentium II 300 MHz
The results are presented in Figure 13. In comparison to the results shown in
Figure 12, the bandwidth obtained with the faster Pentium II for all of the
above protocols is increased by 10-15%, over the bandwidth for the Pentium-
Pro 200 MHz for di erent packet sizes. The maximal bandwidth was 320 Mb/s
for both MNP and UDP, and 278 Mb/s for TCP.

14
5 Related Works

There are two alternative approaches to develop communication software for


Gb/s LANs. One approach is to develop a low level driver for the LAN in-
terface board and to relay on existing standard software, e.g. TCP/IP, for
the remaining communication layers. This approach was taken in the current
study as well as by several development groups, all of which reported similar
performance results, even though di erent platforms were used [1,2,8]. The
main advantage of this approach is that it is transparent to the application
level. Its main disadvantage is its relatively poor performance.
The second alternative is to develop a \thin interface" communication library
that allows a direct access to the network from the application layer. The
advantage of this approach is that it can be tuned to speci c needs and it
can achieve high bandwidth and low latency. The main disadvantage of this
approach is that it lacks the capability to interact with the rest of the commu-
nication subsystem, and that it is not transparent to the applications. Below
we describe three software packages that implement this approach. In all these
implementations the Myricom MCP was replaced by a proprietary MCP.
Illinois Fast Messages (FM) [13] is a low-level software messaging layer that
enables the simpli cation and streamlining of higher level communication lay-
ers. It provides bu er management, as well as ordered and reliable delivery,
which assumes that the underling network is reliable. The FM interface con-
sists of only three functions, two for injecting messages into the network, and
the third for dequeuing pending messages. The Myrinet implementation of
FM includes a proprietary MCP that uses end-to-end window ow control
scheme which ensures that no packets are lost due to bu er over ow. The
FM implementation uses programmed I/O on the send side, to eliminate the
overhead caused by the use of the DMA engine. The maximal bandwidth of
FM is 440 Mb/s using P-Pro 200 MHz. In order to achieve this performance
FM uses 128 byte xed-size packet frames. Larger messages are segmented
and reassembled. The main disadvantage of FM is its inability to multiplex
(share) the communication channels between several users. Other drawbacks
are the use of polling instead of interrupts to receive data from the network,
and the requirement to use a special development environment (preprocessor).
Active Messages (AM) [10] is another software package that allows a direct
application { network interface interaction. An active message is a network
packet which contains the name of a handler function and data for that han-
dler. When an active message arrives at its destination, the handler is invoked
with the data carried in the message. AM was implemented for many hardware
systems and software platforms. It has the same drawback of FM.

15
PM [16] is a high-performance communication library for parallel process-
ing. Its maximal reported throughput is 294 Mb/s on SUN workstations. PM
supports network context switching for a multi-user parallel processing envi-
ronment and FIFO message delivery. It uses a scalable ow-control algorithm.
To decrease the send latency, the immediate sending scheme is used. This
scheme is an optimized version of the double-bu ering scheme. The PM in-
terface is simple and includes only a few primitives, for send/receive to/from
the network, initialize the library and for channel context switch. The main
drawback of PM is the use of polling.

6 Conclusions

This paper presented the performance of TCP/IP and the data link layers of
the Myrinet Gb/s LAN when used for general purpose communication services
via the standard protocol stack of an Operating System (OS) kernel. As long as
TCP/IP continues to be the dominant communication software, it is necessary
to nd ways to improve its performance, to match that of modern LANs.
Special communication libraries for a direct access to the network from the
application layer, prove the feasibility to build such software. However, this
approach is unlikely to replace TCP/IP, because it is not designed for general
purpose communication, it lacks the capability to interact with the rest of the
communication subsystem, and it is not transparent to the applications.
We have exposed several bottlenecks in the communications layers. One bot-
tleneck is in the bandwidth of the PCI bus, which is more than two times
slower than the aggregate bandwidth of the Myrinet and the Gigabit Eth-
ernet. The bandwidth of the memory copy operation was also found to be
a major bottleneck for the TCP/IP protocol suite, particularly because this
operation is used several times.
The main conclusions of this paper are that in order to deliver a Gigabit per
second communication to the user, it is necessary to speedup the PCI bus
and memory copy transfer rates. We showed that the checksum calculation
and the backup copy of the transport layer could be eliminated. This leaves
two mandatory copy operations in the TCP/IP protocol suite, to preserve
its semantic. Therefore, the improvement of the memory copy operation is
essential. For the Myrinet LAN, it is necessary to reduce the DMA engine
latency and further optimize the MCP.
The forthcoming upgrade to Gb/s LANs, such as the Gigabit Ethernet, will
not achieve its intended performance with existing platforms. This fact has
already been observed by early measurements that were conducted by several
research groups. Further improvements of the buses and memories of the host

16
machine are necessary, as well as optimization of the networking software. The
66 MHz, 64 bit PCI which is already available from some hardware vendors is
a step in the right direction. Another mandatory improvement is to increase
the memory copy bandwidth. Alternatively, the development of a network
interface chip that implements the TCP protocol could result in obtaining
Gb/s communication.

Acknowledgments

Special thanks to O. La'adan and A. Shiloh for their valuable contributions.


The advice of R. Felderman from Myricom is greatly acknowledged.
This research was supported in part by the Ministry of Defense and the Min-
istry of Science.

References

[1] Gigabit Ethernet with Linux. NASA, CESDIS,


http://cesdis.gsfc.nasa.gov/linux/drivers/yellow n.html, 1997.
[2] Duke Myrinet TCP/IP Drivers. Duke University,
http://www.cs.duke.edu/ari/manic/ip, 1998.
[3] IEEE 802.3z. The Emerging Gigabit Ethernet Standard. Gigabit Ethernet
Alliance, 1997.
[4] W. Almesberger. High-Speed ATM Networking on Low-end Computer Systems.
In Proc. IEEE Intl. Phoenix Conf. on Computers and Communications, March
1996.
[5] T.E. Anderson, D.E. Culler, and D.A. Patterson. A Case for NOW (Networks
of Workstations). IEEE Micro, 15(1):54{64, February 1995.
[6] A. Barak and O. La'aden. The MOSIX Multicomputer Operating System for
High Performance Cluster Computing. Future Generation Computer Systems,
13(4{5):361{372, March 1998.
[7] N.J. Boden, D. Cohen, R.E. Felderman, A.K. Kulawik, C.L. Seitz, J.N.Seizovic,
and W-K. Su. Myrinet: A Gigabit-per-Second Local Area Network. IEEE
Micro, 15(1):29{36, February 1995.
[8] S.T. Elbert. Preliminary Performance Results of Gigabit Ethernet Card.
Technical Report IS-5126, Ames Laboratory, July 1997.

17
[9] J.S. Kay. Path IDs: A Mechanism for Reducing Network Software Latency.
PhD thesis, Computer Science and Engineering, University of California, San
Diego, 1995.
[10] A. Mainwaring and D.E. Culler. Active Message Application Programmin
Interface and Communication Subsystem Organization. Technical report, Univ.
of California, Berkeley, 1995.
[11] R.P. Martin, A.M. Vahdat, D.E. Culler, and T.E. Anderson. E ects of
communication latency, overhead, and bandwidth in a cluster architecture. In
Proc. 24-th Annual Intr. Symp. on Computer Architecture (ISCA), June 1997.
[12] I. Metrik. The Mosix Network Protocol. Master's thesis, Computer Science
Institute, The Hebrew University of Jerusalem, may 1998.
[13] S. Pakin, V. Karamcheti, and A.A. Chien. Fast Messages: Ecient, Portable
Communication for Workstation Clusters and MPPs. IEEE Concurrency,
5(2):60{73, 1997.
[14] C. Partidge. Gigabit Networking. Addison-Wesley, Reading, MA, 1994.
[15] W.R. Stevens. TCP/IP Illustrated, The Protocols. Addison-Wesley, Reading,
MA, 1994.
[16] H. Tezuka, A. Hori, Y. Ishikawa, and M. Sato. PM: An Operating System
Coordinated High Performance Communication Library. In Proc. Intr. Conf.
on High-Performance Computing and Networking (HPCN Europe 1997), pages
708{717, April 1997.
[17] T. von Eicken, A. Basu, and W. Vogels. U-Net: a user level network interface for
parallel and distributed computing. In Proc. 15-th ACM Symp. on Operating
Systems Principles, pages 40{53, 1995.

18

You might also like