Data Center Networks 3

Data Center Networks
(Lecture #3)
1/04/2010
Professor H. T. Kung
Harvard School
of Engineering and Applied Sciences
Copyright 2010 by H. T. Kung
Three Approaches
Main References
VL2: A Scalable and Flexible Data Center
Network, SIGCOMM 2009 (Lecture #1
12/21/2009)
PortLand: A Scalable Fault-Tolerant Layer 2
Data Center Network Fabric, SIGCOMM
2009 (Lecture #2 12/23/2010)
BCube: A High Performance, Servercentric Network Architecture for Modular
Data Centers, SIGCOMM 2009 (Lecture
#3---Todays Lecture)
2
Approach 1:
Virtual Layer Two Approach
Multi-rooted tree
Complete
Bipartite
Layer 3
Interconnection
Use a highly redundant multipath layer-3

network as a virtual layer-2 network
3
Approach 2:The PortLand Approach

Multi-rooted tree
Core
Aggregation
Edge
Hosts
Pod 0
Pod 1
Pod 2
Pod 3
Switches discover their position in the topology

Pseudo MAC (PMAC) addresses are assigned to all
end hosts to encode their position in the topology
The hierarchical PMAC addresses enable efficient,
provably loop-free forwarding with small switch state
4
Approach 3:
Server-centric Source-routing
Not a multi-rooted tree!
This is a peer-to-peer approach in the peer nodes will keep

states and do the routing
Can use commodity switches
Graceful performance degradation under faulty conditions
Suited to shipping-container based, modular data centers, where
physical access by service personnel can be difficult or not
allowed due to regulations
5
Review of the Last Weeks Exam (1/3)

(1) These are true-false questions.
(a) [2] For VL2, the rack and cluster switches in Ref. #1 can actually
be IP routers. (True)
(b) [2] For PortLand, the rack and cluster switches in Ref. #1 can
actually be IP routers. (False)
(c) [2] When PortLand uses TCP to avoid packet loss in a data
center, a TCP header will need to be added to each packet. .
(True)
(d) [2] In VL2 and PortLand, when a new host is added to the data
center, the network will automatically learn the position of the
host so it can be reached by other hosts. . (True)
(e) [2] Multicast support is useful for GFS. (True)
(f) [4] In both VL2 and Portland, multi-rooted tree topologies are
used. Is it true that the multi-rooted tree topologies are useful for
all of the following purposes: scaling network bandwidth, fault
tolerance and multicast? . (True)
6

(2) [6] We noted in class that putting servers to sleep will save power, but may make local
disks unavailable. Give three ideas on how to solve/alleviate this problem?
Answer: data replication, robotic arms for disk drive insertion/removal, software cache, and
putting storage on a switching/network fabric rather than CPU buses.
(3) [15] (The first and second correct answers earn 5 and 10 points, respectively) VL2 and
PortLand share similar approaches in several aspects in providing large layer-2 networks
for data centers. For example, they both use multi-rooted tree topologies. Please
describe two other areas where both methods share similar approaches. Please give
succinct answers in the bullet form. Hints: Think about addressing.
Answer:
i. Hierarchical addressing
VL2: hierarchical IP addresses
PortLand: hierarchical Pseudo PMAC (PMAC) addresses
ii. Separation of host identifier and host location
VL2: AA vs. LA
PortLand: AMAC vs. PMAC

(4) [10 points] VL2 and PortLand share drawbacks in some similar ways. Describe
one such area where both methods may potentially have similar performance
problems. Please use no more than a total of 30 words in your answers. Hints:
think about possible congestion or update issues.
Answer:
i.
Congestion problem for "elephant flows"
ii.
Update delay and overhead for location addresses LA and PMAC
(5) [15 points] When discussing PortLand in class, we showed a three-layer multirooted tree based on k-port switches with k = 6 (slide 12 of Lecture #2). We
noted that the total amount of bandwidth connecting the top two layers of
switches is less than that connecting the bottom two layers of switches. As
pointed out by someone in class, we can fix this problem by adding some
additional switches in the top layer. How many additional switches do we need?
Show the resulting drawing, like the one on slide 12. To save time in drawing,
you should just add nodes and links on top of the existing drawing of slide 12.
Answer:
Add three additional switches in the top layer.
For the three switches in the middle layer of each pod , connect each switch to a
separate added switch.
8
Container-based Datacenter (1/2)
Placing the server

racks (thousands of
servers) into a
standard shipping
container and
integrating heat
exchange and power
distribution into the
container
Air handling is similar
to in-rack cooling and
typically allows higher
power densities than
regular raised-floor
datacenters
Microsoft Data Center Near Chicago

(9/30/2009)
Source: http://www.datacenterknowledge.com/archives/
2009/09/30/microsoft-unveils-its-container-powered-cloud
The container-based facility has

achieved extremely high energy
efficiency ratings compared with
typical datacenters today
9
Container-based Datacenter (2/2)
Shipping-container based, modular data center

(MDC) offers a new way in which data centers are
built and deployed. In an MDC, up to a few
thousands of servers are interconnected via switches
to form the network infrastructure, say, a typical, twoor three-level tree in the current practice. All the
servers and switches are then packed into a
standard 20- or 40-feet shipping-container
No longer tied to a fixed location, organizations can
place the MDC anywhere they intend and then
relocate as their requirements change
In addition to high degree of mobility, an MDC has
other benefits including shorter deployment time,
higher system and power density, and lower
cooling and manufacturing cost
10
BCube: A Network Architecture

for Modular Data Centers
BCube is a network architecture specifically designed

for shipping-container based, modular data centers
At the core of the BCube architecture is its servercentric network structure, where servers with multiple
network ports connect to multiple layers of commercial
off-the-shelf (COTS) mini-switches. Servers act as
not only end hosts, but also relay nodes for each
other. BCube supports various bandwidth-intensive
applications
BCube exhibits graceful performance degradation
as the server and/or switch failure rate increases.
This property is of special importance for shippingcontainer data centers, since once the container is
sealed and operational, it becomes very difficult to
repair or replace its components
11
Goals
Support bandwidth-intensive traffic patterns among

data center servers:
One-to-one
One-to-several (e.g., distributed file systems)
One-to-all (e.g., application data broadcasting)
All-to-all (e.g., MapReduce)
Beyond using commodity servers, go one step

further by using only low-end COTS mini-switches.
This option eliminates expensive high-end switches
Different from a traditional data center, it is difficult or
even impossible to service an MDC once it is
deployed. Therefore, BCube needs to achieve
graceful performance degradation in the presence
of server and switch failures
12
Approach
Take the server-centric approach, rather than the switch-oriented

practice. It places intelligence on MDC servers and works
with commodity switches
Provide multiple parallel short paths between any pair of
servers
BCube not only provides high one-to-one bandwidth, but also

greatly improves fault tolerance and load balancing
BCube accelerates one-to-x traffic by constructing edge-disjoint
complete graphs and multiple edge-disjoint server spanning trees.
Moreover, due to its low diameter, BCube provides high network
capacity for all-to-all traffic such as MapReduce
BCube runs a source routing protocol called BSR (BCube Source

Routing). BSR places routing intelligence solely onto servers.
By taking advantage of the multi-path property of BCube and by
actively probing the network, BSR balances traffic and handles
failures without link-state distribution (this is a typical p2p probing
method). With BSR, the capacity of BCube decreases gracefully
as the server and/or switch failure increases
BCube uses more wires than the tree structure. But wiring is a
solvable issue for containers which are at most 40-feet long (a
strange argument!)
13
Requirement 1: Support for

Bandwidth-intensive Traffic
One-to-one, which is the basic traffic model in which one server

moves data to another server. For example, this takes place on
server pairs that exchange large amount of data such as disk
backup. Good one-to-one support also results in good severalto-one and all-to-one support
One-to-several, in which one server transfers the same copy of
data to several receivers. Current distributed systems such as
GFS, HDFS, and CloudStore, replicate data chunks of a file
several times (typically three) at different chunk servers to
improve reliability. When a chunk is written into the file system, it
needs to be simultaneously replicated to several servers.
One-to-all, in which a server transfers the same copy of data to
all the other servers in the cluster. There are several cases that
one-to-all happens: to upgrade the system image, to distribute
application binaries, or to distribute specific application data
All-to-all, in which every server transmits data to all the other
servers. The representative example of all-to-all traffic is
MapReduce. The reduce phase of MapReduce needs to shuffle
data among many servers, thus generating an all-to-all traffic
pattern
14
Requirement 2:
Use of Low-end Commodity Switches
Current data centers use commodity PC servers, but

high-end switches/routers. We want to use low-end
non-programmable COTS switches instead of the
high-end ones, based on the observation that the
per-port price of the low-end switches is much
cheaper than that of the high-end ones
The COTS switches, however, can speak only the
spanning tree protocol, which cannot fully utilize the
links in advanced network structures (why?). The
switch boxes are generally not as open as the server
computers. Re-programming the switches for new
routing and packet forwarding algorithms is much
harder, if not impossible, compared with
programming the servers. This is a challenge we
need to address
15
Requirement 3:
Graceful Performance Degradation
Given that we only assume commodity servers and

switches in a shipping-container data center, we
should assume a failure model of frequent
component failures. Moreover, an MDC is
prefabricated in factory, and it is rather difficult, if
not impossible, to service an MDC once it is
deployed in the field, due to operational and space
constraints (data center in a shipping-container is
analogous to system on a chip built with low-power
transistors which may fail)
Therefore, it is important that we design our network
architecture to be fault tolerant and to degrade
gracefully in the presence of continuous component
failures
16
BCubes Recursively Defined Topology

BCube1 (i.e., k = 1):
Throughout this class, we assumed n = 4

How many paths are there between
server 00 and server 21? (see a later slide)
Let n be the expansion factor at each level. That is, the total number of servers
is increased by 4X with each additional level. Throughout this class, we assume
n = 4, unless stated otherwise
BCubek at level k is constructed from by connecting n = 4 copies of BCubek-1 at
level k-1 using nk n-port switches
Each switch connects n servers, each in a separate Bcubek-1
Each server in BCubek has k + 1 ports, each connecting to a switch in a
seperate level
17
Constructing Level 2 from Level 1
For BCubek, we have:
k +1 levels: level-0 through level-k

# servers is nk+1
# n-port switches at each level is the same, that is, nk. Thus the total number of
switches is
(k + 1)nk
For example, with n = 8 and k = 3, BCube3 connects 84 =4096 servers in

four levels by using 83 = 512 8-port switches each level
Note that switches only connect to servers and never directly connect to
other switches. we can treat the switches as dummy crossbars that connect
several neighboring servers and let servers relay traffic for each other
18
How to Route
from Server 00 to Server 21 ?
Level 1:
Fix 2nd Digit
Level 0:
Fix 1st Digit
The blue path fixes the 1st digit first and then the 2nd digit, whereas the red path
uses the reverse order
Note that the blue and red paths are node-disjoint. This is not an accident!
Question: Are there other paths from 00 to 21?
There is no magic here: The BCube topology is actually the well-known
hypercube topology. Routing over BCube can be understood by examining the
intuitive routing we can easily see on hypercube
19
Hypercube
2-node
4-node
(a) B i na ry 1-c ub e,
built o f tw o
bina ry 0-c u bes ,
label ed 0 and 1
000
8-node
(b) B i na ry 2-c ub e,
built o f tw o
bina ry 1-c u bes ,
label ed 0 and 1
001
100
00
01
10
11
101
0
010
1
011
110
111
(c ) B ina ry 3 -c u be, bu ilt o f tw o bin ary 2 -c u bes , lab eled 0 an d 1
16-node
010 0
000 0
010 1
000 1
110 0
100 0
100 1
011 0
001 0
110 1
011 1
001 1
111 0
101 0
111 1
101 1
(d) B i na ry 4 -c ub e, bui lt o f tw o bi na ry 3 -c ub es , label ed 0 and 1
Source: Slides from Introduction to Parallel Processing:

Algorithms and Architectures by Behrooz Parhami
20
The 64-Node
Hypercube
Only sample
wraparound
links are
shown to
avoid clutter
Isomorphic to
the 4 4 4
3D torus
(each has
64 6/2 links)

21
Neighbors of a Node in a Hypercube

xq1xq2 . . . x2x1x0
ID of node x
xq1xq2 . . . x2x1x0
xq1xq2 . . . x2x1x0
.
.
.
xq1xq2 . . . x2x1x0
dimension-0 neighbor; N0(x)

dimension-1 neighbor; N1(x)
.
.
.
dimension-(q 1) neighbor; Nq1(x)
0100
Nodes whose labels differ in k bits

(at Hamming distance k) connected
by shortest path of length k
0101
Dim 0
xx
Dim 3
Dim 2
0000
The q
neighbors
of node x
1100
Dim 1
1101
Both node- and edge-symmetric
1111
Strengths: symmetry, log diameter,

and linear bisection width
Weakness: poor scalability due to
many long interconnection wires
0110
0111
1010
0010
1011
0011

22
BCube Uses Switches to

Implement Hypercube Links
16-node Hypercube
000
001
100
0
010
16-node BCube
101
Sw
1
011
110
111
000
001
100
0
010
110
Sw1
Sw2
Sw3
Sw3
101
Sw
1
011
Sw
Sw
111
23
Hypercube Routing
Gives BCube Routing
16-node Hypercube
000
001
100
0
010
16-node BCube
101
Sw
1
011
110
111
000
001
100
0
010
110
Sw1
Sw2
Sw3
Sw3
101
Sw
1
011
Sw
Sw
111
Thus BCubeRouting is the same as the

routing algorithm for Hypercube
24
Single-path Routing in BCube
In BcubeRouting, A=akak-1 a0 is the source

server and B=bkbk-1 b0 is the destination
server. We systematically build a series of
intermediate servers by correcting one digit
of the previous server. Hence the path length
is at most k+1
Note that the intermediate switches in the
path can be uniquely determined by its two
adjacent servers, hence are omitted from the
path
25
Multi-paths for One-to-one Traffic
Two parallel paths between a source server

and a destination server exist if they are nodedisjoint, i.e., the intermediate servers and
switches on one path do not appear on the
other
Theorem. There are k + 1 parallel paths between

any two servers in a BCubek
BCube should also well support several-to-one

and all-to-one traffic patterns. We can fully
utilize the multiple links of the destination server
to accelerate these x-to-one traffic patterns
26
Speedup for One-to-several Traffic
Edge-disjoint complete graphs with k + 2

servers can be efficiently constructed in a
BCubek. These complete graphs can speed
up data replications in distributed file
systems like GFS
27
BCube Source Routing (BSR)
In BSR, the source server decides which path a packet flow should
traverse by probing the network and encodes the path in the packet
header
Source routing has the following advantages:
The source can control the routing path without coordinations of the
intermediate servers (this is suited for data center management, why?)
Intermediate servers do not involve in routing and just forward packets
based on the packet header. This simplifies their functionalities
y reactively probing the network, we can avoid link state broadcasting,
which suffers from scalability concerns when thousands of servers are in
operation
When a new flow comes, the source sends probe packets over multiple
parallel paths. The intermediate servers process the probe packets to fill
the needed information, e.g., the minimum available bandwidth of its
input/output links. The destination returns a probe response to the
source. When the source receives the responses, it uses a metric to
select the best path, e.g., the one with maximum available bandwidth
28
The PathSelection Procedure
A source uses BuildPathSet to obtain k + 1 parallel

paths and then probes these paths. If one path is
found not available, the source uses the Breadth
First Search (BFS) algorithm to find another parallel
path. For n = 8 and k = 3, the execution time of BFS
is less than 1 millisecond
An intermediate server updates the available
bandwidth field of the probe packet if its available
bandwidth is smaller than the existing value
A destination server updates the available
bandwidth field of the probe packet if the available
bandwidth of the incoming link is smaller than the
value carried in the probe packet. It then sends the
value back to the source in a probe response
message
29
Path Adaption
During the lifetime of a flow, its path may break due to various
failures and the network condition may change significantly as
well. The source periodically (say, every 10 seconds) performs
path selection to adapt to network failures and dynamic network
conditions
When an intermediate server finds that the next hop of a packet
is not available, it sends a path failure message back to the
source. As long as there are paths available, the source does not
probe the network immediately when the message is received.
Instead, it switches the flow to one of the available paths
obtained from the previous probing. When the probing timer
expires, the source will perform another round path selection and
try its best to maintain k+ 1 parallel paths
When multiple flows between two servers arrive simultaneously,
they may select the same path. To make things worse, after the
path selection timers expire, they will probe the network and
switch to another path simultaneously. This results in path
oscillation. We mitigate this symptom by injecting randomness
into the timeout value of the path selection timers
30
Packaging and Wiring
We show how packaging and wiring can be

addressed for a container with 2048 servers and
1280 8-port switches (a partial BCube with n = 8 and
k = 3). The interior size of a 40-feet container is 12m
x 2.35m x 2.38m
In the container, we deploy 32 racks in two columns,
with each column has 16 racks. Each rack
accommodates 44 rack units (or 1.96m high)
We use 32 rack units to host 64 servers as the
current practice can pack two servers into one unit,
and 10 rack units to host 40 8-port switches. The 8port switches are small enough, and we can easily
put 4 into one rack unit. Altogether, we use 42 rack
units and have 2 unused units
31
Packaging and Wiring (Cont.)
As for wiring, the Gigabit Ethernet copper wires can be 100 meters long,
which is much longer than the perimeter of a 40-feet container. And
there is enough space to accommodate these wires. We use 64 servers
within a rack to form a BCube1 and 16 8-port switches within the rack to
interconnect them
The wires of the BCube1 are inside the rack and do not go out. The
inter-rack wires are layer-2 and layer-3 wires and we pace them on the
top of the racks
We divide the 32 racks into four super-racks. A super-rack forms a
BCube2 and there are two super-racks in each column. We evenly
distribute the layer-2 and layer-3 switches into all the racks, so that
there are 8 layer-2 and 16 layer-3 switches within every rack. The level2 wires are within a super-rack and level-3 wires are between superracks
Our calculation shows that the maximum number of level-2 and level-3
wires along a rack column is 768 (256 and 512 for level-2 and level-3,
respectively). The diameter of an Ethernet wire is 0.54cm. The
maximum space needed is approximate 176cm2 < (20cm)2. Since the
available height from the top of the rack to the ceil is 42cm, there is
enough space for all the wires
32
Graceful Degradation
The aggregate bottleneck throughput (ABT) is the

throughput of the bottleneck flow times the number of
total flows in the all-to-all traffic model. ABT reflects
the all-to-all network capacity
Server Failure Rate (%)
Switch Failure Rate (%)

33
Implementation Architecture
The BCube architecture includes a BCube

protocol stack. The BCube stack locates
between the TCP/IP protocol driver and the
Ethernet NDIS driver. The BCube driver is
located at 2.5 layer: to the TCP/IP driver, it is a
NDIS driver; to the real Ethernet driver, it is a
protocol driver
If we directly use the 32-bit addresses, we need
many bytes to store the complete path. For
example, we need 32 bytes when the maximum
path length is 8. We leverage the fact that
neighboring servers in BCube differ in only one
digit in their address arrays to reduce the space
needed for an intermediate server, from four
bytes to only one byte
34
Implementation Architecture
(Cont.)
Components:
BSR Protocol for Routing
Neighbor Maintenance Protocol (maintains a neighbor status table)
Packet sending/receiving part (interacts with the TCP/IP stack)
Packet Forwarding Engine (relays packets for other servers)
Header:
Between the Ethernet Header and IP Header
Contains typical fields
Similar to DCell: 1-1 mapping between IP and BCube addresses
Different from DCell: every BCube packet store the complete path
and a next hop index (NHI)
Using
1-digit address difference between neighbors, path is stored

efficiently
35
Packet Forwarding Engine

Neighbor Status Table:
Maintained by Neighbor Maintenance Protocol
Consists of Neighbor MACs, connecting output ports, and a Status
Flag indicating availability
Table is almost static (MACs change when a neighboring NIC is
replaced, status flag changes when the neighbors status changes.)
Forwarding:
Only one lookup for
Gets
the packet, checks the NHA (next hop array) for status and
MAC of the next hop
Checks the Neighbor Status Table if it is alive
Does Checksum
Forwards the packet to the identified output port
Because
of PCI Interface limitations (160Mb/s) software

implementation is used
36
Testbed
16
Servers + 8 8-port Gigabit Ethernet

mini-switches
BCube1 with
4 BCube0 s
No
disk I/O
No
Ethernet flow control
37
CPU Overhead for Packet Forwarding
38
Bandwidth-Intensive Application Support

MTU: 9KB
Tests: 1-1, 1-M, 1-All, All-All
Topology:
39
40
41
Performance Comparisons
42
Cost, Power, and Wiring Comparison
43
Conclusion
By installing a small number of network ports at each

server and using COTS mini-switches as crossbars,
and putting routing intelligence at the server side,
BCube forms a server-centric architecture
We have shown that BCube significantly accelerates
one-to-x traffic patterns and provides high network
capacity for all-to-all traffic
The BSR routing protocol further enables graceful
performance degradation
Future work will study how to scale the current
server-centric design from the single container to
multiple containers
44

Data Center Networks 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Center Networks 3

Uploaded by

Copyright:

Available Formats

Data Center Networks

Copyright 2010 by H. T. Kung

Use a highly redundant multipath layer-3

Approach 2:The PortLand Approach

Switches discover their position in the topology

This is a peer-to-peer approach in the peer nodes will keep

Review of the Last Weeks Exam (1/3)

Review of the Last Weeks Exam (2/3)

Review of the Last Weeks Exam (3/3)

Container-based Datacenter (1/2)

Placing the server

Microsoft Data Center Near Chicago

The container-based facility has

Container-based Datacenter (2/2)

Shipping-container based, modular data center

BCube: A Network Architecture

BCube is a network architecture specifically designed

Support bandwidth-intensive traffic patterns among

Beyond using commodity servers, go one step

Take the server-centric approach, rather than the switch-oriented

BCube not only provides high one-to-one bandwidth, but also

BCube runs a source routing protocol called BSR (BCube Source

Requirement 1: Support for

One-to-one, which is the basic traffic model in which one server

Current data centers use commodity PC servers, but

Given that we only assume commodity servers and

BCubes Recursively Defined Topology

Throughout this class, we assumed n = 4

Constructing Level 2 from Level 1

For BCubek, we have:

k +1 levels: level-0 through level-k

For example, with n = 8 and k = 3, BCube3 connects 84 =4096 servers in

(c ) B ina ry 3 -c u be, bu ilt o f tw o bin ary 2 -c u bes , lab eled 0 an d 1

(d) B i na ry 4 -c ub e, bui lt o f tw o bi na ry 3 -c ub es , label ed 0 and 1

Source: Slides from Introduction to Parallel Processing:

Source: Slides from Introduction to Parallel Processing:

Neighbors of a Node in a Hypercube

dimension-0 neighbor; N0(x)

Nodes whose labels differ in k bits

Both node- and edge-symmetric

Strengths: symmetry, log diameter,

Source: Slides from Introduction to Parallel Processing:

BCube Uses Switches to

(c ) B ina ry 3 -c u be, bu ilt o f tw o bin ary 2 -c u bes , lab eled 0 an d 1

(c ) B ina ry 3 -c u be, bu ilt o f tw o bin ary 2 -c u bes , lab eled 0 an d 1

(c ) B ina ry 3 -c u be, bu ilt o f tw o bin ary 2 -c u bes , lab eled 0 an d 1

(c ) B ina ry 3 -c u be, bu ilt o f tw o bin ary 2 -c u bes , lab eled 0 an d 1

Thus BCubeRouting is the same as the

Single-path Routing in BCube

In BcubeRouting, A=akak-1 a0 is the source

Multi-paths for One-to-one Traffic

Two parallel paths between a source server

Theorem. There are k + 1 parallel paths between

BCube should also well support several-to-one

Speedup for One-to-several Traffic

Edge-disjoint complete graphs with k + 2

BCube Source Routing (BSR)

The PathSelection Procedure

A source uses BuildPathSet to obtain k + 1 parallel

Packaging and Wiring

We show how packaging and wiring can be

Packaging and Wiring (Cont.)

The aggregate bottleneck throughput (ABT) is the

Server Failure Rate (%)