Professional Documents
Culture Documents
(Lecture #3)
1/04/2010
Professor H. T. Kung
Harvard School
of Engineering and Applied Sciences
Three Approaches
Main References
VL2: A Scalable and Flexible Data Center
Network, SIGCOMM 2009 (Lecture #1
12/21/2009)
PortLand: A Scalable Fault-Tolerant Layer 2
Data Center Network Fabric, SIGCOMM
2009 (Lecture #2 12/23/2010)
BCube: A High Performance, Servercentric Network Architecture for Modular
Data Centers, SIGCOMM 2009 (Lecture
#3---Todays Lecture)
2
Approach 1:
Virtual Layer Two Approach
Multi-rooted tree
Complete
Bipartite
Layer 3
Interconnection
Aggregation
Edge
Hosts
Pod 0
Pod 1
Pod 2
Pod 3
Approach 3:
Server-centric Source-routing
Not a multi-rooted tree!
Source: http://www.datacenterknowledge.com/archives/
2009/09/30/microsoft-unveils-its-container-powered-cloud
Goals
One-to-one
One-to-several (e.g., distributed file systems)
One-to-all (e.g., application data broadcasting)
All-to-all (e.g., MapReduce)
Approach
13
14
Requirement 2:
Use of Low-end Commodity Switches
Requirement 3:
Graceful Performance Degradation
Let n be the expansion factor at each level. That is, the total number of servers
is increased by 4X with each additional level. Throughout this class, we assume
n = 4, unless stated otherwise
BCubek at level k is constructed from by connecting n = 4 copies of BCubek-1 at
level k-1 using nk n-port switches
Each switch connects n servers, each in a separate Bcubek-1
Each server in BCubek has k + 1 ports, each connecting to a switch in a
seperate level
17
How to Route
from Server 00 to Server 21 ?
Level 1:
Fix 2nd Digit
Level 0:
Fix 1st Digit
The blue path fixes the 1st digit first and then the 2nd digit, whereas the red path
uses the reverse order
Note that the blue and red paths are node-disjoint. This is not an accident!
Question: Are there other paths from 00 to 21?
There is no magic here: The BCube topology is actually the well-known
hypercube topology. Routing over BCube can be understood by examining the
intuitive routing we can easily see on hypercube
19
Hypercube
2-node
4-node
(a) B i na ry 1-c ub e,
built o f tw o
bina ry 0-c u bes ,
label ed 0 and 1
000
8-node
(b) B i na ry 2-c ub e,
built o f tw o
bina ry 1-c u bes ,
label ed 0 and 1
001
100
00
01
10
11
101
0
010
1
011
110
111
16-node
010 0
000 0
010 1
000 1
110 0
100 0
100 1
011 0
001 0
110 1
011 1
001 1
111 0
101 0
111 1
101 1
20
The 64-Node
Hypercube
Only sample
wraparound
links are
shown to
avoid clutter
Isomorphic to
the 4 4 4
3D torus
(each has
64 6/2 links)
21
ID of node x
xq1xq2 . . . x2x1x0
xq1xq2 . . . x2x1x0
.
.
.
xq1xq2 . . . x2x1x0
0101
Dim 0
xx
Dim 3
Dim 2
0000
The q
neighbors
of node x
1100
Dim 1
1101
1111
0110
0111
1010
0010
1011
0011
22
001
100
0
010
16-node BCube
101
Sw
1
011
110
111
000
001
100
0
010
110
Sw1
Sw2
Sw3
Sw3
101
Sw
1
011
Sw
Sw
111
23
Hypercube Routing
Gives BCube Routing
16-node Hypercube
000
001
100
0
010
16-node BCube
101
Sw
1
011
110
111
000
001
100
0
010
110
Sw1
Sw2
Sw3
Sw3
101
Sw
1
011
Sw
Sw
111
27
In BSR, the source server decides which path a packet flow should
traverse by probing the network and encodes the path in the packet
header
Source routing has the following advantages:
The source can control the routing path without coordinations of the
intermediate servers (this is suited for data center management, why?)
Intermediate servers do not involve in routing and just forward packets
based on the packet header. This simplifies their functionalities
y reactively probing the network, we can avoid link state broadcasting,
which suffers from scalability concerns when thousands of servers are in
operation
When a new flow comes, the source sends probe packets over multiple
parallel paths. The intermediate servers process the probe packets to fill
the needed information, e.g., the minimum available bandwidth of its
input/output links. The destination returns a probe response to the
source. When the source receives the responses, it uses a metric to
select the best path, e.g., the one with maximum available bandwidth
28
Path Adaption
During the lifetime of a flow, its path may break due to various
failures and the network condition may change significantly as
well. The source periodically (say, every 10 seconds) performs
path selection to adapt to network failures and dynamic network
conditions
When an intermediate server finds that the next hop of a packet
is not available, it sends a path failure message back to the
source. As long as there are paths available, the source does not
probe the network immediately when the message is received.
Instead, it switches the flow to one of the available paths
obtained from the previous probing. When the probing timer
expires, the source will perform another round path selection and
try its best to maintain k+ 1 parallel paths
When multiple flows between two servers arrive simultaneously,
they may select the same path. To make things worse, after the
path selection timers expire, they will probe the network and
switch to another path simultaneously. This results in path
oscillation. We mitigate this symptom by injecting randomness
into the timeout value of the path selection timers
30
As for wiring, the Gigabit Ethernet copper wires can be 100 meters long,
which is much longer than the perimeter of a 40-feet container. And
there is enough space to accommodate these wires. We use 64 servers
within a rack to form a BCube1 and 16 8-port switches within the rack to
interconnect them
The wires of the BCube1 are inside the rack and do not go out. The
inter-rack wires are layer-2 and layer-3 wires and we pace them on the
top of the racks
We divide the 32 racks into four super-racks. A super-rack forms a
BCube2 and there are two super-racks in each column. We evenly
distribute the layer-2 and layer-3 switches into all the racks, so that
there are 8 layer-2 and 16 layer-3 switches within every rack. The level2 wires are within a super-rack and level-3 wires are between superracks
Our calculation shows that the maximum number of level-2 and level-3
wires along a rack column is 768 (256 and 512 for level-2 and level-3,
respectively). The diameter of an Ethernet wire is 0.54cm. The
maximum space needed is approximate 176cm2 < (20cm)2. Since the
available height from the top of the rack to the ceil is 42cm, there is
enough space for all the wires
32
Graceful Degradation
Implementation Architecture
Implementation Architecture
(Cont.)
Components:
BSR Protocol for Routing
Neighbor Maintenance Protocol (maintains a neighbor status table)
Packet sending/receiving part (interacts with the TCP/IP stack)
Packet Forwarding Engine (relays packets for other servers)
Header:
Between the Ethernet Header and IP Header
Contains typical fields
Similar to DCell: 1-1 mapping between IP and BCube addresses
Different from DCell: every BCube packet store the complete path
and a next hop index (NHI)
Using
35
Forwarding:
Only one lookup for
Gets
the packet, checks the NHA (next hop array) for status and
MAC of the next hop
Checks the Neighbor Status Table if it is alive
Does Checksum
Forwards the packet to the identified output port
Because
36
Testbed
16
4 BCube0 s
No
disk I/O
No
37
38
39
40
41
Performance Comparisons
42
43
Conclusion