Professional Documents
Culture Documents
Specification (Infiniband architecture specification release 1.2, Oct. 5, 2004) available at Infiniband Trade Association (http://www.infinibandta.org)
Potential improvements
The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network. Topology:
Irregular Regular: Fat tree
Link speed:
2.5Gbps (X), 10Gbps (4X), and 30Gbps (12X).
Transport layer
Reliable/unreliable, connection/datagram
Packet format:
Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet Global Route Header (GRH): 40 Bytes. Used for routing between subnets Base Transport header (BTH): 12 Bytes, for IBA transport Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram Datagram extended transport header (DETH): 8 bytes RDMA extended transport header (RETH): 16 bytes Atomic, ACK, Atomic ACK, Immediate DATA extended transport header: 4 bytes, optimized for small packets. Invalidate Invariant CRC and variant CRC:
CRC for fields not changed and changed.
Switching based on the destination port address (LID) Multipath switching by allocating multiple LIDs to one port
Verbs
OS/Users access the adaptor through verbs Communication mechanism: Queue Pair (QP)
Support the four types of services, including reliable connection service Each connection takes one QP on each end. Each QP has a send queue and a receive queue. Users can post send requests to the send queue and receive requests to the receive queue. Three types of send operations: SEND, RDMA(WRITE, READ, ATOMIC), MEMORY-BINDING One receive operation (matching SEND)
Queue Pair:
The status of the result of an operation (send/receive) is stored in the complete queue. Send/receive queues can bind to different complete queues.
To communicate:
Make system calls to setup everything (open QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc). Post send/receive requests. Check completion.
SilverStorm 9024:
24 ports 4X(10Gbps) or 8 ports 12X(30 Gbps) switch type: cut-through switch latency: < 140ns switch bandwidth: 480 Gbps forwarding table size: 48K VL support: 8 + 1 management
SilverStorm 9240:
24 expansion slots, each expansion model 12 port 4X or 4 port 12X (24x12 = 288, 288 by 288 switch) switch type cut-through switch latency: < 140ns to < 420ns switch bandwidth: 5.76Tbps forwarding table size: 48K VL support: 8 + 1 management
Improving the messaging software (software to hardware interface): no chance. Improving the MPI implementation over Infiniband: similar to our current work on Ethernet
Message scheduling for collective/point-to-point communications based on the network topology. Exploring NIC features (buffers in NIC, multicast) Reducing the number of instructions in a library routine makes sense. Compiled communication can be used to optimize the MPI library. Compiled communication can help improving the library implementation (e.g. reducing the number of message copies, early requests posting , using RDMA, etc).
For a sparse traffic pattern, the maximum channel load can usually be minimized using the minimim interference principle.
Need to extend minimum interference routing for load balance deadlock free routing.
The best way to realize IBA SM is still not clear (unknown) at this time, we can probably do something here.
Irregular network or Fat tree network