You are on page 1of 15

Explicit Path Control in Commodity Data Centers:

Design and Applications


Shuihai Hu and Kai Chen, The Hong Kong University of Science and Technology;
Haitao Wu, Microsoft; Wei Bai, The Hong Kong University of Science and Technology;
Chang Lan, University of California, Berkeley; Hao Wang, The Hong Kong University of
Science and Technology; Hongze Zhao, Duke University; Chuanxiong Guo, Microsoft
https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/hu

This paper is included in the Proceedings of the


12th USENIX Symposium on Networked Systems
Design and Implementation (NSDI 15).
May 46, 2015 Oakland, CA, USA
ISBN 978-1-931971-218

Open Access to the Proceedings of the


12th USENIX Symposium on
Networked Systems Design and
Implementation (NSDI 15)
is sponsored by USENIX

Explicit Path Control in Commodity Data Centers:


Design and Applications
Shuihai Hu1 Kai Chen1 Haitao Wu2 Wei Bai1 Chang Lan3
Hao Wang1 Hongze Zhao4 Chuanxiong Guo2
1
SING Group @ Hong Kong University of Science and Technology
2
Microsoft, 3 UC Berkeley, 4 Duke University
Abstract

or flow-level path control. OpenFlow in theory can establish fine-grained routing paths by installing flow entries
in the OpenFlow switches via the controller. But in practice, there are practical challenges such as limited flow
table size and dynamic flow path setup that need to be
addressed (see 6 for more details).
In order to address the scalability and deployment
challenges faced by the above mentioned approaches,
this paper presents XPath for flow-level explicit path
control. XPath addresses the dynamic path setup challenge by giving a positive answer to the following question: can we pre-install all desired routing paths between
any two nodes? Further, XPath shows that we can preinstall all these paths using the destination IP based forwarding TCAM tables of commodity switches1 .
One cannot enumerate all possible paths in a DCN as
the number can be extremely large. However, we observe
that DCNs (e.g., [24, 17, 18, 20]) do not intend to use
all possible paths but a set of desired paths that are sufficient to exploit the topology redundancy (2.2). Based
on this observation, XPath focuses on pre-installing these
desired paths in this paper. Even though, the challenge
is that the sheer number of desired paths in large DCNs
is still large, e.g., a Fattree (k = 64) has over 232 paths
among ToRs (Top-of-Rack switches), exceeding the size
of IP table with 144K entries, by many magnitudes.
To tackle the above challenge, we introduce a twostep compression algorithm, i.e., paths to path sets aggregation and path ID assignment for prefix aggregation,
which is capable of compressing a large number of paths
to a practical number of routing entries for commodity
switches (3).
To show XPaths scalability, we evaluate it on various
well-known DCNs (3.3). Our results suggest that XPath
effectively expresses tens of billions of paths using only
tens of thousands of routing entries. For example, for
Fattree(64), we pre-install 4 billion paths using 64K
entries2 ; for HyperX(4,16,100), we pre-install 17 billion
paths using 36K entries. With such algorithm, XPath

Many data center network (DCN) applications require


explicit routing path control over the underlying topologies. This paper introduces XPath, a simple, practical
and readily-deployable way to implement explicit path
control, using existing commodity switches. At its core,
XPath explicitly identifies an end-to-end path with a path
ID and leverages a two-step compression algorithm to
pre-install all the desired paths into IP TCAM tables of
commodity switches. Our evaluation and implementation show that XPath scales to large DCNs and is readilydeployable. Furthermore, on our testbed, we integrate
XPath into four applications to showcase its utility.

Introduction

Driven by modern Internet applications and cloud computing, data centers are being built around the world. To
obtain high bandwidth and achieve fault tolerance, data
center networks (DCNs) are often designed with multiple paths between any two nodes [3, 4, 13, 17, 18, 31].
Equal Cost Multi-Path routing (ECMP) [23] is the stateof-the-art for multi-path routing and load-balancing in
DCNs [5, 17, 31].
In ECMP, a switch locally decides the next hop from
multiple equal cost paths by calculating a hash value,
typically from the source and destination IP addresses
and transport port numbers. Applications therefore cannot explicitly control the routing path in DCNs.
However, many emerging DCN applications such as
provisioned IOPS, fine-grained flow scheduling, bandwidth guarantee, etc. [5, 7, 8, 19, 21, 22, 25, 26, 39, 45],
all require explicit routing path control over the underlying topologies (2).
Many approaches such as source routing [36],
MPLS [35], and OpenFlow [29] can enforce explicit path
control. However, source routing is not supported in the
hardware of the data center switches, which typically
only support destination IP based routing. MPLS needs
a signaling protocol, i.e., Label Distribution Protocol, to
establish the path, which is typically used only for traffic
engineering in core networks instead of application-level

1 The recent advances in switching chip technology make it ready


to support 144K entries in IP LPM (Longest Prefix Match) tables of
commodity switches (e.g., [1, 24]).
2 The largest routing table size among all the switches.

1
USENIX Association

12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) 15

Desired path

easily pre-installs all desired paths into IP LPM tables


with 144K entries, while still reserving space to accommodate more paths.
To demonstrate XPaths deployability, we implement
it on both Windows and Linux platforms under the umbrella of SDN, and deploy it on a 3-layer Fattree testbed
with 54 servers (4). Our experience shows that XPath
can be readily implemented with existing commodity
switches. Through basic experiments, we show that
XPath handles failure smoothly.
To showcase XPaths utility, we integrate it into
four applications (from provisioned IOPS [25] to Mapreduce) to enable explicit path control and show that
XPath directly benefits them (5). For example, for provisioned IOPS application, we use XPath to arrange explicit path with necessary bandwidth to ensure the IOPS
provisioned. For network update, we show that XPath
easily assists networks to accomplish switch upgrades at
zero traffic loss. For Map-reduce data shuffle, we use
XPath to identify non-contention parallel paths in accord with the many-to-many shuffle pattern, reducing the
shuffle time by over 3 compared to ECMP.
In a nutshell, the primary contribution of XPath is that
it provides a practical, readily-deployable way to preinstall all the desired routing paths between any s-d pairs
using existing commodity switches, so that applications
only need to choose which path to use without worrying about how to set up the path, and/or the time cost or
overhead of setting up the path.
To access XPath implementation scripts, please visit:
http://sing.cse.ust.hk/projects/XPath.
The rest of the paper is organized as follows. 2
overviews XPath. 3 elaborates XPath algorithm and
evaluates its scalability. 4 implements XPath and performs basic experiments. 5 integrates XPath into applications. 6 discusses the related work, and 7 concludes
the paper.

2
2.1

Undesired path

Figure 1: Example of the desired paths between two


servers/ToRs in a 4-radix Fattree topology.
Case #2: Flow scheduling: Data center networks are
built with multiple paths [4, 17]. To use such multiple paths, state-of-the-art forwarding in enterprise and
data center environments uses ECMP to statically stripe
flows across available paths using flow hashing. Because
ECMP does not account for either current network utilization or flow size, it can waste over 50% of network
bisection bandwidth [5]. Thus, to fully utilize network
bisection bandwidth, we need to schedule elephant flows
across parallel paths to avoid contention as in [5]. Explicit path control is required to enable such fine-grained
flow scheduling, which benefits data intensive applications such as Map-reduce (5.4).
Case #3: Virtual network embedding: In cloud computing, virtual data center (VDC) with bandwidth guarantees is an appealing model for cloud tenants due to
its performance predictability in shared environments [7,
19, 45]. To accurately enforce such VDC abstraction
over the physical topology with constrained bandwidth,
one should be able to explicitly dictate which path to use
in order to efficiently allocate and manage the bandwidth
on each path (5.3).
Besides the above applications, the need for explicit
path control has permeated almost every corner of data
center designs and applications, from traffic engineering
(e.g., [8, 22]), energy-efficiency (e.g., [21]), to network
virtualization (e.g., [7, 19, 45]), and so on. In 5, we will
study four of them.

Motivation and Overview


The need for explicit path control

2.2

Case #1: Provisioned IOPS: IOPS are input/output operations per second. Provisioned IOPS are designed to
deliver predictable, high performance for I/O intensive
workloads, such as database applications, that rely on
consistent and fast response times. Amazon EBS provisioned IOPS storage was recently launched to ensure that
disk resources are available whenever you need them regardless of other customer activity [25, 34]. In order to
ensure provisioned IOPS, there is a need for necessary
bandwidth over the network. Explicit path control is required for choosing an explicit path that can provide such
necessary bandwidth (5.1).

XPath overview

To enable explicit path control for general DCNs, XPath


explicitly identifies an end-to-end path with a path ID
and attempts to pre-install all desired paths using IP LPM
tables of commodity switches, so that DCN applications
can use these pre-installed explicit paths easily without
dynamically setting up them. In what follows, we first
introduce what the desired paths are, and then overview
the XPath framework.
Desired paths: XPath does not try to pre-install all
possible paths in a DCN because this is impossible and
impractical. We observe that when designing DCNs,
2

16 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15)

USENIX Association

operators do not intend to use all possible paths in


the routing. Instead, they use a set of desired paths
which are sufficient to exploit the topology redundancy.
This is the case for many recent DCN designs such as
[24, 17, 18, 20, 31]. For example, in a k-radix Fattree [4], they exploit k 2 /4 parallel paths between any two
ToRs for routing (see Fig. 1 for desired/undesired paths
on a 4-radix Fattree); whereas in an n-layer BCube [18],
they use (n + 1) parallel paths between any two servers.
These sets of desired paths have already contained sufficient parallel paths between any s-d pairs to ensure good
load-balancing and handle failures. As the first step,
XPath focuses on pre-installing all these desired paths.

XPath framework: Fig. 2 overviews XPath. As many


prior DCN designs [11, 17, 18, 31, 40], in our implementation, XPath employs a logically centralized controller, called XPath manager, to control the network.
The XPath manager has three main modules: routing
table computation, path ID resolution, and failure handling. Servers have client modules for path ID resolution
and failure handling.

Figure 2: The XPath system framework.


XPath source server handles failures by simply changing path IDs. This is because it has cached multiple
path IDs for a destination, if one of them fails, it just
uses a new live one instead. In the meanwhile, the
source will request, from the manager, the updated
path IDs to the destination. Similarly, upon a link recovery, the recovered paths will be added back to the
IP-to-ID table accordingly. The source is able to use
the recovered paths once the local cache expires and a
new path ID resolution is performed.
We note that XPath leverages failure detection and recovery outputs to handle failures. The detailed failure detection and recovery mechanisms are orthogonal to XPath, which focuses on explicit path control.
In our implementation (4.2), we adopt a simple TCP
sequence based approach for proof-of-concept experiments, and we believe XPath can benefit from existing advanced failure detection and recovery literatures [15, 27].

Routing table computation: This module is the heart


of XPath. The problem is how to compress a large
number of desired paths (e.g., tens of billions) into IP
LPM tables with 144K entries. To this end, we design a two-step compression algorithm: paths to path
sets aggregation (in order to reduce unique path IDs)
and ID assignment for prefix aggregation (in order to
reduce IP prefix based routing entries). We elaborate
the algorithm and evaluate its scalability in 3.

Path ID resolution: In XPath, path IDs (in the format


of 32-bit IP, or called routing IPs3 ) are used for routing to a destination, whereas the server has its own IP
for applications. This entails path ID resolution which
translates application IPs to path IDs. For this, the
XPath manager maintains an IP-to-ID mapping table.
Before a new communication, the source sends a request to the XPath manager resolving the path IDs to
the destination based on its application IP. The manager may return multiple path IDs in response, providing multiple paths to the destination for the source to
select. These path IDs will be cached locally for subsequent communications, but need to be forgotten periodically for failure handling. We elaborate this module and its implementation in 4.1.

Remarks: In this paper, XPath focuses on how to preinstall the desired paths, but it does not impose any constraint on how to use the pre-installed paths. On top of
XPath, we can either let each server to select paths in
a distributed manner, or employ an SDN controller to
coordinate path selection between servers or ToRs in a
centralized way (which we have taken in our implementation of this paper). In either case, the key benefit is that
with XPath we do not need to dynamically modify the
switches.
We also note that XPath is expressive and is able to
pre-install all desired paths in large DCNs into commodity switches. Thus XPaths routing table recomputation
is performed infrequently, and cases such as link failures
or switch upgrade [26] are handled through changing
path IDs rather than switch table reconfiguration. However, table recomputation is necessary for extreme cases
like network wide expansion where the network topology
has fundamentally changed.

Failure handling: Upon a link failure, the detecting


devices will inform the XPath manager. Then the
XPath manager will in turn identify the affected paths
and update the IP-to-ID table (i.e., disable the affected
paths) to ensure that it will not return a failed path
to a source that performs path ID resolution. The
3 We

use routing IPs and path IDs interchangeably in this paper.

3
USENIX Association

12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) 17

path1

path2
path2

path1

path1
convergent

disjoint

path2

divergent

Figure 3: Three basic relations between two paths.


s1

s2

p1 p2 p3 p4 p5 p6

Path set

Egress port

ID assignment (bad)

ID assignment (good)

pathset0
pathset1
pathset2
pathset3
pathset4
pathset5
pathset6
pathset7

0
1
2
0
1
2
0
0

0
1
2
3
4
5
6
7

4
0
2
5
1
3
6
7

Table 1: Path set ID assignment.

s3
p7 p8 p9

Convergent: {p1, p4, p7}, {p2, p5, p8}, {p3, p6, p9}

Path set

ID

Prefix

Egress port

Disjoint: {p1, p5, p9}, {p2, p6, p7}, {p3, p4, p8}

pathset1,4
pathset2,5
pathset0,3,6,7

0, 1
2, 3
4, 5, 6, 7

00
01
1

1
2
0

Mix of two: {p1, p4, p8}, {p2, p6, p9}, {p2, p5, p7}
d1

d2

d3

Table 2: Compressed table via ID prefix aggregation.

Figure 4: Different ways of path aggregation.

we introduce two basic approaches for paths to path sets


aggregation: convergent paths first approach (CPF) and
disjoint paths first approach (DPF). The idea is simple. In
CPF, we prefer to aggregate the convergent paths into the
path set first until no more convergent path can be added
in; Then we can add the disjoint paths, if exist, into the
path set until no more paths can be added in. In DPF,
we prefer to aggregate the disjoint paths into the path set
first and add the convergent ones, if exist, at the end.
The obtained CPF or DPF path sets have their own
benefits. For example, a CPF path set facilitates manyto-one communication for data aggregation because such
an ID naturally defines a many-to-one communication
channel. A DPF path set, on the other hand, identifies
parallel paths between two groups of nodes, and such
an ID identifies a many-to-many communication channel for data shuffle. In practice, users may have their
own preferences to define customized path sets for different purposes as long as the path sets are free of routing
ambiguity.

XPath Algorithm and Scalability

We elaborate the XPath two-step compression algorithm


in 3.1 and 3.2. Then, we evaluate it on various large
DCNs to show XPaths scalability in 3.3.

3.1

Paths to path sets aggregation (Step I)

The number of desired paths is large. For example, Fattree(64) has over 232 paths between ToRs, more than
what a 32-bit IP/ID can express. To reduce the number
of unique IDs, we aggregate the paths that can share the
same ID without causing routing ambiguity into a nonconflict path set, identified by a unique ID.
Then, what kinds of paths can be aggregated? Without
loss of generality, two paths have three basic relations between each other, i.e., convergent, disjoint, and divergent
as shown in Fig. 3. Convergent and disjoint paths can
be aggregated using the same ID, while divergent paths
cannot. The reason is straightforward: suppose two paths
diverge from each other at a specific switch and they have
the same ID path1 = path2 = path id, then there will
be two entries in the routing table: path id portx and
path id porty , (x = y). This clearly leads to ambiguity. Two paths can be aggregated without conflict if
they do not cause any routing ambiguity on any switch
when sharing the same ID.

3.2

ID assignment for prefix aggregation


(Step II)

While unique IDs can be much reduced through Step I,


the absolute value is still large. For example, Fattree(64)
has over 2 million IDs after Step I. We cannot allocate
one entry per ID flatly with 144K entries. To address
this problem, we further reduce routing entries using ID
prefix aggregation. Since a DCN is usually under centralized control and the IDs of paths can be coordinately
assigned, our goal of Step II is to assign IDs to paths in
such a way that they can be better aggregated using prefixes in the switches.

Problem 1: Given the desired paths P = {p1 , , pn }


of a DCN, aggregate the paths into non-conflict path sets
so that the number of sets is minimized.
We find that the general problem of paths to nonconflict path sets aggregation is NP-hard since it can be
reduced from the Graph vertex-coloring problem [41].
Thus, we resort to practical heuristics.
Based on the relations in Fig. 3, we can aggregate the
convergent paths, the disjoint paths, or the mix into a
non-conflict path set as shown in Fig. 4. Following this,

3.2.1

Problem description

We assign IDs to paths that traverse the same egress port


consecutively so that these IDs can be expressed using
4

18 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15)

USENIX Association

one entry via prefix aggregation. For example, in Table 1, 8 path sets go through a switch with 3 ports. A
nave (bad) assignment will lead to an uncompressable
routing table with 7 entries. However, if we assign the
paths that traverse the same egress port with consecutive
IDs (good), we can obtain a compressed table with 3 entries as shown in Table 2.
To optimize for a single switch, we can easily achieve
the optimal by grouping the path sets according to the
egress ports and encoding them consecutively. In this
way, the number of entries is equal to the number of
ports. However, we optimize for all the switches simultaneously instead of one.

[n1r , n2r , , n|S|r ]T = [m1j , m2j , , m|S|j ]T .


1
n11
s1
s2
n21

N =f (M) = s3 n31
.. ..
. .
s|S| n|S|1

AIB(si ) = 1 +

In a switch, a block of consecutive IDs with the same


egress port can be aggregated using one entry4 . We call
this an aggregateable ID block (AIB). The number of
such AIBs indicates routing states in the switch5 . Thus,
we try to minimize the maximal number of AIBs among
all the switches through coordinated ID assignment.
To illustrate the problem, we use a matrix M to describe the relation between path sets and switches. Suppose switches have k ports (numbered as 1...k), then we
use mij [0, k] (1 i |S|, 1 j |T |) to indicate
whether tj goes through switch si , and if yes, which the
egress port is. If 1 mij k, it means tj goes through
si and the egress port is mij , and 0 otherwise means tj
does not appear on switch si .
t1
s1
m11
s2
m21
m31
s
3
M=

.. ..
. .
s|S| m|S|1

t2
m12
m22
m32
..
.
m|S|2

t3
m13
m23
m33
..
.
m|S|3

...
...
...
...
..
.
...

3
n13
n23
n33
..
.
n|S|3

...
...
...
...
..
.
...

|T |

n1|T |
n2|T |

n3|T |

..
.
n|S||T |

With matrix N, we can calculate the number of AIBs


on each switch. To compute it on switch si , we only need
to sequentially check all the elements on the i-th row.
If there exist sequential non-zero elements that are the
same, it means all these consecutive IDs share the same
egress port and belong to a same AIB. Otherwise, one
more AIB is needed. Thus, the total number of AIBs
on switch si is:

Problem 2. Let T = {t1 , t2 , , t|T | } be the path


sets after solving Problem 1. Assigning (or ordering) the
IDs for these path sets so that, after performing ID prefix
aggregation, the largest number of routing entries among
all switches is minimized.

2
n12
n22
n32
..
.
n|S|2

|T |1


r=1

(nir ni(r+1) )

(1)

where u v = 1 if u = v (0 is skipped), and 0 otherwise. With Equation 1, we can obtain the maximal
number of AIBs among all the switches: MAIB =
max {AIB(si )}, and our goal is to find an f that min1i|S|

imizes MAIB.
3.2.2

Solution

ID assignment algorithm: The above problem is NPhard as it can be reduced from the 3-SAT problem [37].
Thus, we resort to heuristics. Our practical solution is
guided by the following thought. Each switch si has its
own local optimal assignment fi . But these individual
local optimal assignments may conflict with each other
by assigning different IDs to a same path set on different switches, causing an ID inconsistency on this path
set. To generate a global optimized assignment f from
the local optimal assignments fi s, we can first optimally
assign IDs to path sets on each switch individually, and
then resolve the ID inconsistency on each path set in an
incremental manner. In other words, we require that each
step of ID inconsistency correction introduces minimal
increase on MAIB.
Based on the above consideration, we introduce our
ID Assignment() in Algorithm 1. The main idea behind
the algorithm is as follows.
First, we assign IDs to path sets on each switch individually. We achieve the optimal result for each
switch by simply assigning the path sets that have the
same egress ports with consecutive IDs (lines 12).
Second, we correct inconsistent IDs of each path set
incrementally. After the first step, the IDs for a path

t|T |

m1|T |
m2|T |

m3|T |

..

.
m|S||T |

To assign IDs to path sets, we use f(tj ) = r (1


r |T |) to denote that, with an ID assignment f, the ID
for tj is r (or ranks the r-th among all the IDs). With
f, we actually permutate columns on M to obtain N.
Column r in N corresponds to column tj in M, i.e.,
4 The consecutiveness has local significance. Suppose path IDs 4, 6,
7 are on the switch (all exit through port p), but 5 are not present, then
4, 6, 7 are still consecutive and can be aggregated as 1p.
5 Note that the routing entries can be further optimized using subnetting and supernetting [16], in this paper, we just use AIBs to indicate
entries for simplicity, in practice the table size can be even smaller.

5
USENIX Association

12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) 19

t
1
s1 1
M = s2 2
s3 1

t2
1
1
2

t
1
s1 1(3)
M2 = s2 2(3)
s3 1(3)

t3
1
1
2
t2
1(2)
1(2)
2(2)

t4
2
2
2

t5
1
3
3

t3
1(1)
1(1)
2(1)

t6

2
4 M0
2
t4
t5
2(5) 1(4)
2(4) 3(5)
2(4) 3(6)

t
t2
1
s1 1(1) 1(2)
= s2 2(3) 1(1)
s3 1(1) 2(2)
t6

2(6)
4(6) . . . M6
2(5)

t3
1(3)
1(2)
2(3)

t4
2(5)
2(4)
2(4)
t
1
s1 1(3)
= s2 2(3)
s3 1(3)

t5
1(4)
3(5)
3(6)
t2
1(2)
1(2)
2(2)

t6

2(6)
4(6) M1
2(5)
t3
t4
1(1) 2(4)
1(1) 2(4)
2(1) 2(4)

t
t2
t3
t4
1
s1 1(3) 1(2) 1(1) 2(5)
= s2 2(3) 1(1) 1(2) 2(4)
s3 1(3) 2(2) 2(1) 2(4)
t5
t6
1 2

s1 1 1
1(6) 2(5)
3(6) 4(5) N = s2 1 1
s3 2 2
3(6) 2(5)

t5
1(4)
3(5)
3(6)
3 4
1 2
2 2
1 2

t6

2(6)
4(6)
2(5)
5 6

2 1
4 3
2 3

Figure 5: Walk-through example on Algorithm 1: for any element x(y) in Mk (1 k 6), x is the egress port
and y is the ID assigned to a path set tj on switch si , red/green ys mean inconsistent/consistent IDs for path sets.
Algorithm 1 Coordinated ID assignment algorithm
ID Assignment(M) /* M is initial matrix, N is
output */;
1 foreach row i of M (i.e., switch si ) do
2
assign path sets tj (1 j |T |) having the
same mij values (i.e., egress ports) with consecutive IDs;
/* path sets are optimally encoded on each switch
locally, but one path set may have different IDs assigned with respect to different switches */;

3 M M with IDs specified for each tj in each si ;

4 foreach column j of M (i.e., path set tj ) do
5
if tj has inconsistent IDs then
6
let C = {c1 , c2 , , ck }, (1 < k |S|) be
the set of inconsistent IDs;
7
foreach c C do
8
tentatively use c to correct the inconsistency by swapping ci with c on each relevant switch;
9
compute MAIB;
10
11

t4 , t6 with 5, 6 respectively. We repeat this on s2 and


s3 , and achieve M0 with MAIB = 4. However, we
have inconsistent IDs (marked in red) for all path sets.
For example, t1 has different IDs 1, 3, 1 on s1 , s2 , s3 respectively. Then, we start to correct the inconsistency
for each path set. For t1 with inconsistent IDs 1, 3, 1,
we try to correct with IDs 1 and 3 respectively. To correct with ID 1, we exchange IDs 3 and 1 for t1 and t2
on switch s2 , and get MAIB = 5. To correct with
ID 3, we exchange IDs 1 and 3 for t1 and t3 on switch
s1 and s3 , and get MAIB = 4. We thus choose to
correct with ID 3 and achieve M1 as it has minimal
MAIB = 4. We perform the same operation for the
remaining path sets one by one and finally achieve M6
with MAIB = 4. Therefore, the final ID assignment is
f : (t1 , t2 , t3 , t4 , t5 , t6 ) (3, 2, 1, 4, 6, 5).
We note that the proposed algorithm is not optimal and
has room to improve. However, it is effective in compressing the routing tables as we will show in our evaluation. One problem is the time cost as it works on a
large matrix. We intentionally designed our Algorithm 1
to be of low time complexity, i.e., O(|S|2 |T |) for the
|S||T | matrix M. Even though, we find that when the
network scales to several thousands, it cannot return a result within 24 hours (see Table 4). Worse, it is possible
that |S|1045 and |T |106 or more for large DCNs. In
such cases, even a linear time algorithm can be slow, not
to mention any advanced algorithms.

ID(tj ) c with the minimal MAIB;


return N f(M ); /* M is inconsistency-free */

set on different switches may be different. For any


path set having inconsistent IDs, we resolve this as
follows: we pick one ID out of all the inconsistent IDs
of this path set and let other IDs be consistent with
it provided that such correction leads to the minimal
MAIB (lines 410). More specifically, in lines 69,
we try each of the inconsistent IDs, calculate the associated MAIB if we correct the inconsistency with
this ID, and finally pick the one that leads to the minimal MAIB. The algorithm terminates after we resolve the ID inconsistencies for all the path sets.
In Fig. 5 we use a simple example to walk readers
through the algorithm. Given M with 6 path sets across 3
switches, we first encode each switch optimally. This is
achieved by assigning path sets having the same egress
port with consecutive IDs. For example, on switch s1 ,
path sets t1 , t2 , t3 , t5 exit from port1 and t4 , t6 from
port2 , then we encode t1 , t2 , t3 , t5 with IDs 1, 2, 3, 4 and

Speedup with equivalence reduction: To speed up,


we exploit DCN topology characteristics to reduce the
runtime of our algorithm. The observation is that most
DCN topologies are regular and many nodes are equivalent (or symmetric). These equivalent nodes are likely to
have similar numbers of routing states for any given ID
assignment, especially when the path sets are symmetrically distributed. The reason is that for two equivalent
switches, if some path sets share a common egress port
on one switch, most of these path sets, if not all, are likely
to pass through a common egress port on another switch.
As a result, no matter how the path sets are encoded, the
ultimate routing entries on two equivalent switches tend
to be similar. Thus, our hypothesis is that, by picking
a representative node from each equivalence node class,
6

20 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15)

USENIX Association

DCNs

Nodes #

Links #

Original
paths#

Max. entries #
without compression

Path sets # after


Step I compression

Max. entries # after


Step I compression

Max. entries # after


Step II compression

Fattree(8)
Fattree(16)
Fattree(32)
Fattree(64)
BCube(4, 2)
BCube(8, 2)
BCube(8, 3)
BCube(8, 4)
VL2(20, 8, 40)
VL2(40, 16, 60)
VL2(80, 64, 80)
VL2(100, 96, 100)
HyperX(3, 4, 40)
HyperX(3, 8, 60)
HyperX(4, 10, 80)
HyperX(4, 16, 100)

208
1,344
9,472
70,656
112
704
6,144
53,248
1,658
9,796
103,784
242,546
2,624
31,232
810,000
6,619,136

384
3,072
24,576
196,608
192
1,536
16,384
163,840
1,760
10,240
107,520
249,600
2,848
36,096
980,000
8,519,680

15,872
1,040,384
66,977,792
4,292,870,144
12,096
784,896
67,092,480
5,368,545,280
31,200
1,017,600
130,969,600
575,760,000
12,096
784,896
399,960,000
17,179,607,040

944
15,808
257,792
4,160,512
576
10,752
114,688
1,146,880
6,900
119,600
4,030,400
7,872,500
432
4,032
144,000
983,040

512
8,192
131,072
2,097,152
192
1,536
16,384
163,840
800
6,400
102,400
240,000
192
1,536
40,000
262,144

496
8,128
130,816
2,096,128
189
1,533
16,380
163,835
780
6,360
102,320
239,900
189
1,533
39,996
262,140

116
968
7,952
64,544
108
522
4,989
47,731
310
2,820
49,640
117,550
103
447
8,732
36,164

Table 3: Results of XPath on the 4 well-known DCNs.


we can optimize the routing tables for all the nodes in the
topology while spending much less time.
Based on the hypothesis, we improve the runtime of
Algorithm 1 with equivalence reduction. This speedup
makes no change to the basic procedure of Algorithm 1.
Instead of directly working on M with |S| rows, the key
idea is to derive a smaller M with fewer rows from M
using equivalence reduction, i.e., for all the equivalent
nodes si s in M we only pick one of them into M , and
then apply ID Assignment() on M . To this end, we
first need to compute the equivalence classes among all
the nodes, and there are many fast algorithms available
for this purpose [10, 14, 28]. This improvement enables
our algorithm to finish with much less time for various
well-known DCNs while still maintaining good results
as we will show subsequently.

3.3

These paths do not enumerate all possible paths in the


topology, however, they cover all desired paths sufficient
to exploit topology redundancy in each DCN.
Our scalability experiments run on a Windows server
with an Intel Xeon E7-4850 2.00GHz CPU and 256GB
memory.
Main results: Table 3 shows the results of XPath algorithm on the 4 well-known DCNs, which demonstrates
XPaths high scalability. Here, for paths to path sets aggregation we used CPF.
We find that XPath can effectively pre-install up to
tens of billions of paths using tens of thousands of routing entries for very large DCNs. Specifically, for Fattree(64) we express 4 billion paths with 64K entries; for
BCube(8,4) we express 5 billion paths with 47K entries;
for VL2(100,96,100) we express 575 million paths with
117K entries; for HyperX(4,16,100) we express 17 billion paths with 36K entries. These results suggest that
XPath can easily pre-install all desired paths into IP LPM
tables with 144K entries, and in the meanwhile XPath
is still able to accommodate more paths before reaching
144K.

Scalability evaluation

Evaluation setting: We evaluate XPaths scalability on


4 well-known DCNs: Fattree [4], VL2 [17], BCube [18],
and HyperX [3]. Among these DCNs, BCube is a servercentric structure where servers act not only as end hosts
but also relay nodes for each other. For the other 3 DCNs,
switches are the only relay nodes and servers are connected to ToRs at last hop. For this reason, we consider
the paths between servers in BCube and between ToRs
in Fattree, VL2 and HyperX.
For each DCN, we vary the network size (Table 3). We
consider k 2 /4 paths between any two ToRs in Fattree(k),
(k + 1) paths between any two servers in BCube(n, k),
DA paths between any two ToRs in VL2(DA , DI , T ),
and L paths between any two ToRs in HyperX(L, S, T )6 .

Understanding the ID assignment: The most difficult


part of the XPath compression algorithm is Step II (i.e.,
ID assignment), which eventually determines if XPath
can pre-install all desired paths using 144K entries. The
last two columns of Table 3 contrast the maximal entries
before and after our coordinated ID assignment for each
DCN.
We find that XPaths ID assignment algorithm can efficiently compress the routing entries by 2 to 32 for
different DCNs. For example, before our coordinated ID
assignment, there are over 2 million routing entries in the
bottleneck switch (i.e., the switch with the largest routing table size) for Fattree(64), and after it, we achieve

6 DCNs

use different parameters to describe their topologies.


In Fattree(k), k is the number of switch ports; in BCube(n, k),
n is the number of switch ports and k is the BCube layers; in VL2(DA , DI , T ), DA /DI are the numbers of aggregation/core switch ports and T is the number of servers per rack; in

HyperX(L, S, T ), L is the number of dimensions, S is the number of


switches per dimension, and T is the number of servers per rack.

7
USENIX Association

12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) 21

Fattree(16)
Fattree(32)
Fattree(64)
BCube(8, 2)
BCube(8, 3)
BCube(8, 4)
VL2(40, 16, 60)
VL2(80, 64, 80)
VL2(100, 96, 100)
HyperX(3, 4, 40)
HyperX(4, 10, 80)
HyperX(4, 16, 100)

10 4

Time cost (Second)


No equivalence reduction Equivalence reduction
8191.121000
>24 hours
>24 hours
365.769000
>24 hours
>24 hours
227.438000
>24 hours
>24 hours
0.281000
>24 hours
>24 hours

0.078000
4.696000
311.909000
0.046000
6.568000
684.895000
0.047000
3.645000
28.258000
0.000000
10.117000
442.379000

10

10

Number of entries

DCNs

No equivalence reduction
Equivalene reduction

10 1
10

e(8

ttre

Fa

(16

ree

tt
Fa

2)
,2)
(4,
e(8
ub
c
B

be

u
Bc

)
)
0)
0)
40
,20
4,4
4,2
,8,
(3,
(2,
(20
X
X
r
r
2
pe
pe
VL
Hy
Hy

0,4

(1
L2

Figure 6: Effect of ID assignment algorithm with and


without equivalence reduction for the 4 DCNs.

Table 4: Time cost of ID assignment algorithm with


and without equivalence reduction for the 4 DCNs.

turns out that the results are similar to that without equivalence reduction. This partially validates our hypothesis
in 3.2.2. Furthermore, we note that the algorithm with
equivalence reduction can even slightly outperform that
without it in some cases. This is not a surprising result
since both algorithms are heuristic solutions to the original problem.

64K entries via prefix aggregation. In the worst case, we


still compress the routing states from 240K to 117K in
VL2(100,96,100). Furthermore, we note that the routing entries can be further compressed using traditional
Internet IP prefix compression techniques, e.g., [16], as
a post-processing step. Our ID assignment algorithm
makes this prefix compression more efficient.
We note that our algorithm has different compression
effects on different DCNs. As to the 4 largest topologies, we achieve a compression ratio of 2,096,128
64,544 =32.48
for Fattree(64), 262,140
=
7.25
for
HyperX(4,16,100),
36,164
163,835
=
3.43
for
BCube(8,4), and 239,900
47,731
117,550 = 2.04 for
VL2(100,96,100) respectively. We believe one important
decisive factor for the compression ratio is the density of
the matrix M. According to Equation 1, the number of
routing entries is related to the non-zero elements in M.
The sparser the matrix, the more likely we achieve better
results. For example, in Fattree(64), a typical path set tra1
1
verses 32
aggregation switches and 1024
core switches,
while in VL2(100,96,100), a typical path set traverses
1
1
2 aggregation switches and 50 core switches. This indicates that MFattree is much sparser than MVL2 , which
leads to the effect that the compression on Fattree is better than that on VL2.

Results on randomized DCNs: We note that most


other DCNs such as CamCube [2] and CiscoDCN [13]
are regular and XPath can perform as efficiently as
above. In recent work such as Jellyfish [39] and
SWDC [38], the authors also discussed random graphs
for DCN topologies. XPaths performance is indeed unpredictable for random graphs. But for all the Jellyfish
topologies we tested, in the worst case, XPath still manages to compress over 1.8 billion paths with less than
120K entries. The runtime varies from tens of minutes to
hours or more depending on the degree of symmetry of
the random graph.

Implementation and Experiments

We have implemented XPath on both Windows and


Linux platforms, and deployed it on a 54-server Fattree testbed with commodity switches for experiments.
This paper describes the implementation on Windows. In
what follows, we first introduce path ID resolution (4.1)
and failure handling (4.2). Then, we present testbed
setup and basic XPath experiments (4.3).

Time cost: In Table 4, we show that equivalence reduction speeds up the runtime of the ID assignment algorithm. For example, without equivalence reduction, it
cannot return an output within 24 hours when the network scales to a few thousands. With it, we can get results for all the 4 DCNs within a few minutes even when
the network becomes very large. This is acceptable because it is one time pre-computation and we do not require routing table re-computation as long as the network
topology does not change.

4.1

Path ID resolution

As introduced in 2.2, path ID resolution addresses how


to resolve the path IDs (i.e., routing IPs) for a destination. To achieve fault-tolerant path ID resolution, there
are two issues to consider. First, how to distribute the
path IDs of a destination to the source. The live paths to
the destination may change, for example, due to link failures. Second, how to choose the path for a destination,
and enforce such path selection in existing networks.

Effect of equivalence reduction: In Fig. 6, we compare


the performance of our ID assignment with and without
equivalence reduction. With equivalence reduction, we
use M (i.e., part of M) to perform ID assignment, and it
8

22 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15)

USENIX Association

User
Kernel

Application

XPath
User Mode Daemon

Packet Header
Parser

TCP/IP
NDIS Filter Driver
XPath Kernel Driver
NIC Driver

servers consists of two parts: a Windows Network Driver


Interface Specification (NDIS) filter driver in kernel
space and a XPath daemon in user space. The software stacks of XPath are shown in Fig. 7. The XPath
filter driver is between the TCP/IP and the Network Interface Card (NIC) driver. We use the Windows filter
driver to parse the incoming/outgoing packets, and to intercept the packets that XPath is interested in. The XPath
user mode daemon is responsible for path selection and
packet header modification. The function of the XPath
filter driver is relatively fixed, while the algorithm module in the user space daemon simplifies debugging and
future extensions.
In Fig. 7, we observe that the packets are transferred
between the kernel and user space, which may degrade
the performance. Therefore, we allocate a shared memory pool by the XPath driver. With this pool, the packets
are not copied and both the driver and the daemon operate on the same shared buffer. We tested our XPath implementation (with tunnel) and did not observe any visible impact on TCP throughput at Gigabit line rate.

XPath User Mode Daemon


Flow Table

Path Selection
Packet Header Modier

Figure 7: The software stacks of XPath on servers.


These two issues look similar to the name resolution in
existing DNS. In practice, it is possible to return multiple
IPs for a server, and balance the load by returning different IPs to the queries. However, integrating the path ID
resolution of XPath into existing DNS may challenge the
usage of IPs, as legacy applications (on socket communication) may use IPs to differentiate the servers instead
of routing to them. Thus, in this paper, we develop a
clean-slate XPath implementation on the XPath manager
and end servers. Each server has its original name and IP
address, and the routing IPs for path IDs are not related
to DNS.
To enable path ID resolution, we implemented a XPath
software module on the end server, and a module on the
XPath manager. The end server XPath software queries
the XPath manager to obtain the updated path IDs for a
destination. The XPath manager returns the path IDs by
indexing the IP-to-ID mapping table. From the path IDs
in the query response, the source selects one for the current flow, and caches all (with a timeout) for subsequent
communications.
To maintain the connectivity to legacy TCP/IP stacks,
we design an IP-in-IP tunnel based implementation.
The XPath software encapsulates the original IP packets
within an IP tunnel: the path ID is used for the tunnel IP
header and the original IP header is the inner one. After
the tunnel packets are decapsulated, the inner IP packets
are delivered to destinations so that multi-path routing
by XPath is transparent to applications. Since path IDs
in Fattree end at the last hop ToR, the decapsulation is
performed there. The XPath software may switch tunnel
IP header to change the paths in case of failures, while
for applications the connection is not affected. Such IPin-IP encapsulation also eases VM migration as VM can
keep the original IP during migration.
We note that if VXLAN [42] or NVGRE [32] is introduced for tenant network virtualization, XPath IP header
needs to be the outer IP header and we will need 3 IP
headers which looks awkward. In the future, we may
consider more efficient and consolidated packet format.
For example, we may put path ID in the outer NVGRE
IP header and the physical IP in NVGRE GRE Key field.
Once the packet reaches the destination, the host OS then
switches the physical IP and path ID.
In our implementation, the XPath software on end

4.2

Failure handling

As introduced in 2.2, when a link fails, the devices on


the failed link will notify the XPath manager. In our implementation, the communication channel for such notification is out-of-band. Such out-of-band control network and the controller are available in existing production DCNs [44].
The path IDs for a destination server are distributed using a query-response based model. After the XPath manager obtains the updated link status, it may remove the
affected paths or add the recovered paths, and respond to
any later query with the updated paths.
For proof-of-concept experiments, we implemented a
failure detection method with TCP connections on the
servers. In our XPath daemon, we check the TCP sequence numbers and switch the path ID once we detect
that the TCP has retransmitted a data packet after a TCP
timeout. The motivation is that the TCP connection is
experiencing bad performance on the current path (either failed or seriously congested) and the XPath driver
has other alternative paths ready for use. We note that
this TCP based approach is sub-optimal and there are
faster failure detection mechanisms such as BFD [15] or
F10 [27] that can detect failures in 30s, which XPath
can leverage to perform fast rerouting (combining XPath
with these advanced failure detection schemes is our future work). A key benefit of XPath is that it does not require route re-convergence and is loop-free during failure
handling. This is because XPath pre-installs the backup
paths and there is no need to do table re-computation unless all backup paths are down.
9

USENIX Association

12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) 23

Core

C1

C2

C3

C4

C5

C6

C7

C8

C9

Agg

A1

A2

A3

A4

A5

A6

A7

A8

A9

TCP goodput (Mbps)

40
35

A10 A11 A12 A13 A14 A15 A16 A17 A18

30

ToR T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18
A fat-tree testbed with 18 ToR/Agg switches and 9 Core switches

25

Figure 8: Fattree(6) testbed with 54 servers. Each


ToR switch connects 3 servers (not drawn).

15
0

1
0.8

CDF

0.4
0.2
2

Path ID resolution time (ms)

40

60
80
Time (seconds)

100

120

T4 to T18. Each source server initiates 30 TCP connections in parallel, and each destination server hosts two
TCP connections. The total link capacity from T1 is
31G=3G, shared by 90 TCP connections.
Given the 90 TCP connections randomly share 3 up
links from T1, the load should be balanced overall. At
around 40 seconds, we disconnect one link (T1 to A1).
We use TCP sequence based method developed in 4.2
for automatic failure detection and recovery in this experiment. We then resume the link at time around 80
seconds to check whether the load is still balanced. We
log the goodput (observed by the application) and show
the results for three connections versus time in Fig. 10.
Since we find that the throughput of all 90 TCP connections are very similar, we just show the throughput of one
TCP connection for each source server.
We observe that all the TCP connections can share the
links fairly with and without failure. When the link fails,
the TCP connections traversing the failed link (T1 to A1)
quickly migrate to the healthy links (T1 to A2 and A3).
When the failed link recovers, it can be reused on a new
path ID resolution after the timeout of the local cache. In
our experiment, we set the cache timeout value as 1 second. However, one can change this parameter to achieve
satisfactory recovery time for resumed links. We also run
experiments for other traffic patterns, e.g., ToR-to-ToR
and All-to-ToR, and link failures at different locations,
and find that XPath works as expected in all cases.

Figure 9: The CDF of path ID resolution time.

4.3

20

Figure 10: TCP goodput of three connections versus


time on three phases: no failure, in failure, and recovered.

0.6

0
1

flow 1
flow 2
flow 3

20

Testbed setup and basic experiments

Testbed setup: We built a testbed with 54 servers connected by a Fattree(6) network (as shown in Fig. 8) using
commodity Pronto Broadcom 48-port Gigabit Ethernet
switches. On the testbed, there are 18 ToR, 18 Agg, and
9 Core switches. Each switch has 6 GigE ports. We
achieve these 45 virtual 6-port GigE switches by partitioning the physical switches. Each ToR connects 3
servers; and the OS of each server is Windows Server
2008 R2 Enterprise 64-bit version. We deployed XPath
on this testbed for experimentation.
IP table configuration: On our testbed, we consider
2754 explicit paths between ToRs (25758 paths between
end hosts). After running the two-step compression algorithm, the number of routing entries for the switch IP
tables are as follows, ToR: 3133, Agg: 48, and Core:
6. Note that the Fattree topology is symmetric, the numbers of routing entries after our heuristic are almost the
same for the switches at the same layer, which confirms
our hypothesis in 3.2.2 that equivalent nodes are likely
to have similar numbers of entries.
Path ID resolution time: We measure the path ID resolution time at the XPath daemon on end servers: from the
time when the query message is generated to the time the
response from the XPath manager is received. We repeat
the experiment 4000 times and depict the CDF in Fig. 9.
We observe that the 99-th percentile latency is 4ms. The
path ID resolution is performed for the first packet to a
destination server that is not found in the cache, or cache
timeout. A further optimization is to perform path ID
resolution in parallel with DNS queries.

XPath Applications

To showcase XPaths utility, we use it for explicit path


support in four applications. The key is that, built on
XPath, applications can freely choose which path to use
without worrying about how to set up the path and the
time cost or overhead of setting up the path. In this regard, XPath emerges as an interface for applications to
use explicit paths conveniently, but does not make any
choice on behalf of them.

XPath routing with and without failure: In this experiment, we show basic routing of XPath, with and without link failures. We establish 90 TCP connections from
the 3 servers under ToR T1 to the 45 servers under ToRs
10

24 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15)

USENIX Association

XPath

XPath
ECMP

Average IOPS
15274
4547

(a) Path P1 : T1 A1 T3; P2 : T1 A2 T3; P3 :


T1 A3 T3

(b) Throughput and completion time of XPath and ECMP.

(c) Average IOPS.

Link utilization

0.8
0.6

Figure 11: XPath utility case #1: we leverage XPath


to make necessary bandwidth easier to implement for
provisioned IOPS.

5.1

(a) Remaining bandwidth on P1,


P2, P3 is 300, 100, 100 Mbps.

0.4
0.2
0

XPath for provisioned IOPS

t
0

50

100

150 200 250


Time (seconds)

P1
P2
P3

300

350

400

(b) Time t1 : move f3 from P2 to P3 ; t2 : move f1 from P1


to P2 ; t3 : move f1 from P2 to P1 ; t4 : move f3 from P3 to
P2 .

In cloud services, there is an increasing need for provisioned IOPS. For example, Amazon EBS enforces provisioned IOPS for instances to ensure that disk resources
can be accessed with high and consistent I/O performance whenever you need them [25]. To enforce such
provisioned IOPS, it should first provide necessary bandwidth for the instances [9]. In this experiment, we show
XPath can be easily leveraged to use the explicit path
with necessary bandwidth.

Figure 12: XPath utility case #2: we leverage XPath


to assist zUpdate [26] to accomplish DCN update with
zero loss.

5.2

XPath for network updating

In production data centers, DCN update occurs frequently [26]. It can be triggered by the operators, applications and various networking failures. zUpdate [26]
is an application that aims to perform congestion-free
network-wide traffic migration during DCN updates with
zero loss and zero human effort. In order to achieve its
goal, zUpdate requires explicit routing path control over
the underlying DCNs. In this experiment, we show how
XPath assists zUpdate to accomplish DCN update and
use a switch firmware upgrade example to show how traffic migration is conducted with XPath.
In Fig. 12(a), initially we assume 4 flows (f1 , f2 , f3
and f4 ) on three paths (P1 , P2 and P3 ). Then we move f1
away from switch A1 to do a firmware upgrade for switch
A1 . However, neither P2 nor P3 has enough spare bandwidth to accommodate f1 at this point of time. Therefore
we need to move f3 from P2 to P3 in advance. Finally,
after the completion of firmware upgrade, we move all
the flows back to original paths. We leverage XPath to
implement the whole movement.
In Fig. 12(b), we depict the link utilization dynamics.
At time t1 , when f3 is moved from P2 to P3 , the link utilization of P2 drops from 0.6 to 0.4 and the link utilization of P3 increases from 0.7 to 0.9. At time t2 , when f1
is moved from P1 to P2 , the link utilization of P1 drops
from 0.5 to 0 and the link utilization of P2 increases from
0.4 to 0.9. The figure also shows the changes of the link
utilization at time t3 and t4 when moving f3 back to P2
and f1 back to P1 . It is easy to see that with the help of
XPath, P1 , P2 and P3 see no congestion and DCN update
proceeds smoothly without loss.

As shown in Fig. 11(a), we use background UDP flows


to stature the ToR-Agg links and leave the remaining
bandwidth on 3 paths (P1 , P2 and P3 ) between X-Y as
300Mpbs, 100Mbps, and 100Mbps respectively. Suppose there is a request for provisioned IOPS that requires
500Mbps necessary bandwidth (The provisioned IOPS is
about 15000 and the chunk size is 4KB.). We now leverage XPath and ECMP to write 15GB data (4 million
chunks) through 30 flows from X to Y, and measure the
achieved IOPS respectively. The storage we used for the
experiment is Kingston V+200 120G SSD, and the I/O
operations on the storage are sequential read and sequential write.
From Fig. 11(c), it can be seen that using ECMP we
cannot provide the necessary bandwidth between X-Y
for the provisioned IOPS although the physical capacity is there. Thus, the actual achieved IOPS is only 4547,
and the write under ECMP takes much longer time than
that under XPath as shown in Fig. 11(c). This is because ECMP performs random hashing and cannot specify the explicit path to use, hence it cannot accurately
make use of the remaining bandwidth on each of the
multiple paths for end-to-end bandwidth provisioning. In
contrast, XPath can be easily leveraged to provide the required bandwidth due to its explicit path control. With
XPath, we explicitly control how to use the three paths
and accurately provide 500Mbps necessary bandwidth,
achieving 15274 IOPS.
11
USENIX Association

12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) 25

2000

Shuffle time (seconds)

1000

500

40.5

81 121.5 162 202.5 243 283.5 324 364.5 405


Data size(GB)

Figure 14: XPath utility case #4: we leverage XPath


to select non-conflict paths to speed up many-to-many
data shuffle.

9 servers in one pod send data to 9 servers in the other


pod. We varied the data volume from 40G to over 400G.
We compared XPath with ECMP.
In Fig. 14, it can be seen that by using XPath for data
shuffle, we can perform considerably better than randomized ECMP hash-based routing. More specifically, it reduces the shuffle time by over 3 for most of the experiments. The reason is that XPaths explicit path IDs can
be easily leveraged to arrange non-interfering paths for
shuffling, thus the network bisection bandwidth is fully
utilized for speedup.

Virtual network enforcement with


XPath

In cloud computing, virtual data center (VDC) abstraction with bandwidth guarantees is an appealing model
due to its performance predictability in shared environments [7, 19, 45]. In this experiment, we show XPath
can be applied to enforce virtual networks with bandwidth guarantees. We assume a simple SecondNet-based
VDC model with 4 virtual links, and the bandwidth requirements on them are 50Mbps, 200Mbps, 250Mbps
and 400Mbps respectively as shown in Fig. 13(a). We
then leverage XPaths explicit path control to embed this
VDC into the physical topology.
In Fig. 13(b), we show that XPath can easily be employed to use the explicit paths in the physical topology
with enough bandwidth to embed the virtual links. In
Fig. 13(c), we measure the actual bandwidth for each
virtual link and show that the desired bandwidth is accurately enforced. However, we found that ECMP cannot
be used to accurately enable this because ECMP cannot
control paths explicitly.

5.4

1500

Figure 13: XPath utility case #3: we leverage XPath


to accurately enforce VDC with bandwidth guarantees.

5.3

ECMP
XPath

Related Work

The key to XPath is explicit path control. We note


that many other approaches such as source routing [36],
MPLS [35], OpenFlow [29] and the like, can also enable
explicit path control. However, each of them has its own
limitation.
OpenFlow [29] has been used in many recent proposals (e.g., [5, 8, 21, 22, 26]) to enable explicit path
control. OpenFlow can establish fine-grained explicit
routing path by installing flow entries in the switches
via the OpenFlow controller. But in current practice,
there are still challenges such as small flow table size
and dynamic flow entries setup that need to be solved.
For example, the on-chip OpenFlow forwarding rules in
commodity switches are limited to a small number, typically 14K. To handle this limitation, recent solutions,
e.g. [22], dynamically change, based on traffic demand,
the set of live paths available in the network at different
times through dynamic flow table configurations, which
could potentially introduce non-trivial implementation
overhead and performance degradation. XPath addresses
such challenge by pre-installing all desired paths into IP
LPM tables. In this sense, XPath complements existing OpenFlow-based solutions in terms of explicit path
control, and in the meanwhile, the OpenFlow framework
may still be able to be used as a way for XPath to preconfigure the switches and handle failures.

Map-reduce data shuffle with XPath

In Map-reduce applications, many-to-many data shuffle between the map and reduce stages can be timeconsuming. For example, Hadoop traces from Facebook
show that, on average, transferring data between successive stages accounts for 33% of the running times of
jobs [12]. Using XPath, we can explicitly express nonconflict parallel paths to speed up such many-to-many
data shuffle. Usually, for a m-to-n data shuffle, we can
use (m+n) path IDs to express the communication patterns. The shuffle patterns can be predicted using existing techniques [33].
In this experiment, we selected 18 servers in two pods
of the Fattree to emulate a 9-to-9 data shuffle by letting
12

26 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15)

USENIX Association

Source routing is usually implemented in software and


slow paths, and not supported in the hardware of the data
center switches, which typically only support destination
IP based routing. Compared to source routing, XPath
is readily deployable without waiting for new hardware
capability; and XPaths header length is fixed while it is
variable for source routing with different path lengths.
With MPLS, paths can also be explicitly set up before
data transmission using MPLS labels. However, XPath is
different from MPLS in following aspects. First, because
MPLS labels only have local significance, it requires a
dynamic Label Distribution Protocol (LDP) for label assignments. In contrast, XPath path IDs are unique, and
we do not need such a signaling protocol. Second, MPLS
is based on exact matching (EM) and thus MPLS labels
cannot be aggregated, whereas XPath is based on longest
prefix matching (LPM) and enables more efficient routing table compression. Furthermore, MPLS is typically
used only for traffic engineering in core networks instead
of application-level or flow-level path control. In addition, it is reported [6, 22] that the number of tunnels that
existing MPLS routers can support is limited.
SPAIN [30] builds a loop-free tree per VLAN and utilizes multiple paths across VLANs between two nodes,
which increases the bisection bandwidth over the traditional Ethernet STP. However, SPAIN does not scale well
because each host requires an Ethernet table entry per
VLAN. Further, its network scale and path diversity are
also restricted by the number of VLANs supported by
Ethernet switches, e.g., 4096.
PAST [40] implements a per-address spanning tree
routing for data center networks using the MAC table.
PAST supports more spanning trees than SPAIN, but
PAST does not support multi-paths between two servers,
because a destination has only one tree. This is decided
by the MAC table size and its exact matching on flat
MAC addresses.
Both SPAIN and PAST are L2 technologies. Relative to them, XPath builds on L3 and harnesses the fastgrowing IP LPM table of commodity switches. One reason we choose IP instead of MAC is that it allows prefix
aggregation. It is worth noting that our XPath framework contains both SPAIN and PAST. XPath can express
SPAINs VLAN or PASTs spanning tree using CPF, and
it can also arrange paths using DPF and perform path ID
encoding and prefix aggregation for scalability.
Finally, there are various DCN routing schemes that
come with specific topologies, such as those introduced
in Fattree [4], PortLand [31], BCube [18], VL2 [17],
ALIAS [43], and so on. For example, PortLand [31]
leverages Fattree topology to assign hierarchical PseudoMACs to end hosts, while VL2 [17] exploits folded Clos
network to allocate location-specific IPs to ToRs. These
topology-aware addressing schemes generally benefit

prefix aggregation and can lead to very small routing tables, however they do not enable explicit path control
and still rely on ECMP [31] or Valiant Load Balancing (VLB) [17] for traffic spreading over multiple paths.
Relative to them, XPath enables explicit path control for
general DCN topologies.

Conclusion

XPath is motivated by the need for explicit path control


in DCN applications. At its very core, XPath uses a path
ID to identify an end-to-end path, and pre-installs all the
desired path IDs between any s-d pairs into IP LPM tables of commodity switches using a two-step compression algorithm. Through extensive evaluation and implementation, we show that XPath is scalable and easy to
implement with existing commodity switches. Finally,
we used testbed experiments to show that XPath can directly benefit many popular DCN applications.

Acknowledgements
This work was supported by the Hong Kong RGC ECS
26200014 and China 973 Program under Grant No.
2014CB340303. Chang Lan and Hongze Zhao were interns with Microsoft Research Asia when they worked
on this project. We would like to thank our shepherd
George Porter and the anonymous NSDI reviewers for
their feedback and suggestions.

References
[1] Arista 7050QX. http://www.aristanetworks.com/media/
system/pdf/Datasheets/7050QX-32 Datasheet.pdf.
[2] H. Abu-Libdeh, P. Costa, A. Rowstron, G. OShea, and
A. Donnelly, Symbiotic Routing in Future Data Centers, in SIGCOMM, 2010.
[3] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S.
Schreiber, HyperX: Topology, Routing, and Packaging
of Efficient Large-Scale Networks, in SC, 2009.
[4] M. Al-Fares, A. Loukissas, and A. Vahdat, A Scalable,
Commodity Data Center Network Architecture, in ACM
SIGCOMM, 2008.
[5] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang,
and A. Vahdat, Hedera: Dynamic Flow Scheduling for
Data Center Networks, in NSDI, 2010.
[6] D. Applegate and M. Thorup, Load optimal MPLS routing with N + M labels, in INFOCOM, 2003.
[7] H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron,
Towards Predictable Datacenter Networks, in SIGCOMM, 2011.
[8] T. Benson, A. Anand, A. Akella, and M. Zhang, MicroTE: Fine Grained Traffic Engineering for Data Centers, in CoNEXT, 2010.

13
USENIX Association

12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) 27

[27] V. Liu, D. Halperin, A. Krishnamurthy, and T. Anderson,


F10: A Fault-Tolerant Engineered Network, in NSDI,
2013.

[9] I/O
Characteristics.
http://docs.aws.amazon.com/
AWSEC2/latest/UserGuide/ebs-io-characteristics.html.
[10] K. Chen, C. Guo, H. Wu, J. Yuan, Z. Feng, Y. Chen, S. Lu,
and W. Wu, Generic and Automatic Address Configuration for Data Centers, in SIGCOMM, 2010.

[28] B. D. McKay, Practical graph isomorphism, in Congressus Numerantium, 1981.

[11] K. Chen, A. Singla, A. Singh, K. Ramachandran, L. Xu,


Y. Zhang, X. Wen, and Y. Chen, OSA: An Optical
Switching Architecture for Data Center Networks with
Unprecedented Flexibility, in NSDI, 2012.

[29] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar,


L. Peterson, J. Rexford, S. Shenker, and J. Turner, OpenFlow: Enabling innovation in campus networks, ACM
CCR, 2008.

[12] M. Chowdhury, M. Zaharia, J. Ma, M. Jordan, and I. Stoica, Managing Data Transfers in Computer Clusters with
Orchestra, in SIGCOMM11.

[30] J. Mudigonda, P. Yalagandula, and J. Mogul, SPAIN:


COTS Data-Center Ethernet for Multipathing over Arbitrary Topologies, in NSDI, 2010.

[13] Cisco, Data center: Load balancing data center services, 2004.

[31] R. N. Mysore, A. Pamboris, N. Farrington, N. Huang,


P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat, PortLand: A Scalable Fault-Tolerant Layer 2 Data
Center Network Fabric, in SIGCOMM, 2009.

[14] P. T. Darga, K. A. Sakallah, and I. L. Markov, Faster


Symmetry Discovery using Sparsity of Symmetries, in
45st DAC, 2008.

[32] NVGRE. http://en.wikipedia.org/wiki/NVGRE.


[33] Y. Peng, K. Chen, G. Wang, W. Bai, Z. Ma, and L. Gu,
HadoopWatch: A First Step Towards Comprehensive
Traffic Forecasting in Cloud Computing, in INFOCOM,
2014.

[15] Bidirectional Forwarding Detection. http://www.cisco.


com/c/en/us/td/docs/ios/12 0s/feature/guide/fs bfd.html.
[16] R. Draves, C. King, S. Venkatachary, and B. Zill, Constructing optimal IP routing tables, in INFOCOM, 1999.

[34] Announcing
provisioned
IOPS.
http://aws.
amazon.com/about-aws/whats-new/2012/07/31/
announcing-provisioned-iops-for-amazon-ebs/.

[17] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula,


C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta,
VL2: A Scalable and Flexible Data Center Network, in
ACM SIGCOMM, 2009.

[35] E. Rosen, A. Viswanathan, and R. Callon, Multiprotocol


Label Switching Architecture, RFC 3031, 2001.

[18] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian,


Y. Zhang, and S. Lu, BCube: A High Performance,
Server-centric Network Architecture for Modular Data
Centers, in SIGCOMM, 2009.

[36] Source
Routing.
Source routing.

[37] Boolean satisfiability problem. http://en.wikipedia.org/


wiki/Boolean satisfiability problem.

[19] C. Guo, G. Lu, H. Wang, S. Yang, C. Kong, P. Sun,


W. Wu, and Y. Zhang, SecondNet: A Data Center Network Virtualization Architecture with Bandwidth Guarantees, in CoNEXT, 2010.

[38] J.-Y. Shin, B. Wong, and E. G. Sirer, Small-World Datacenters, in ACM SoCC, 2011.
[39] A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey, Jellyfish: Networking Data Centers Randomly, in NSDI,
2012.

[20] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu,


DCell: A scalable and fault-tolerant network structure
for data centers, in SIGCOMM08.

[40] B. Stephens, A. Cox, W. Felter, C. Dixon, and J. Carter,


PAST: Scalable Ethernet for Data Centers, in CoNEXT,
2012.

[21] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis,


P. Sharma, S. Banerjee, and N. McKeown, ElasticTree:
Saving Energy in Data Center Networks, in NSDI, 2010.

[41] Graph vertex coloring. http://en.wikipedia.org/wiki/


Graph coloring.

[22] C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill,


M. Nanduri, and R. Wattenhofer, Achieving high utilization using software-driven WAN, in ACM SIGCOMM,
2013.

[42] VXLAN.
Extensible LAN.

[24] Broadcom Strata XGS Trident II. http://www.broadcom.


com.
I/O-EBS.

http://en.wikipedia.org/wiki/Virtual

[43] M. Walraed-Sullivan, R. N. Mysore, M. Tewari, Y. Zhang,


K. Marzullo, and A. Vahdat, ALIAS: Scalable, Decentralized Label Assignment for Data Centers, in SoCC,
2011.

[23] C. Hopps, Analysis of an Equal-Cost Multi-Path Algorithm, RFC 2992, 2000.

[25] Provisioned
details.

http://en.wikipedia.org/wiki/

[44] X. Wu, D. Turner, G. Chen, D. Maltz, X. Yang, L. Yuan,


and M. Zhang, NetPilot: Automating Datacenter Network Failure Mitigation, in SIGCOMM, 2012.

https://aws.amazon.com/ebs/

[45] D. Xie, N. Ding, Y. C. Hu, and R. Kompella, The Only


Constant is Change: Incorporating Time-Varying Network Reservations in Data Centers, in SIGCOMM, 2012.

[26] H. Liu, X. Wu, M. Zhang, L. Yuan, R. Wattenhofer, and


D. Maltz, zUpdate: Updating Data Center Networks
With Zero Loss, in ACM SIGCOMM, 2013.

14
28 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15)

USENIX Association

You might also like