Dynamic Distributed Algorithm For Computing For Computing Multiple Next Hop On A Tree - ICNP 2013

Dynamic Distributed Algorithm for Computing
Multiple Next-Hops on a Tree

Haijun Geng , Xingang Shi , Xia Yin , Zhiliang Wang
of Computer Science & Technology, Tsinghua University
Institute for Network Sciences and Cyberspace, Tsinghua University
Tsinghua National Laboratory for Information Science and Technology (TNList)
Email:{genghj, yxia, wzl}@csnet1.cs.tsinghua.edu.cn, shixg@cernet.edu.cn
Department
AbstractHigh reliability is always pursued by network

designers. Multipath routing can provide multiple paths for
transmission and failover, and is considered to be effective in
the improvement of the network reliability. However, existing
multipath routing algorithms focus on how to find as many paths
as possible, rather than their computation or communication
overhead.
We propose a dynamic distributed multipath algorithm (DMPA) to help a router in a link-state network find multiple nexthops for each destination. A router runs the algorithm locally
and independently, where only one single shortest path tree (SPT)
needs to be constructed, and no message other than the basic link
states is disseminated. DMPA maintains the SPT and dynamically
adjusts it in response to network state changes, so the sets of nexthops can be incrementally and efficiently updated. At the same
time, DMPA guarantees loop-freeness of the induced forwarding
path by a partial order of the routers underpinning it.
We evaluate DMPA and compare it with some latest multipath
algorithms, using a set of real, inferred and synthetic topologies.
The results show that DMPA can provide good reliability and
fast recovery for the network with very low overhead.
I.
I NTRODUCTION
With the rapid development of the Internet, more and

more services and applications are widely deployed, which
pose stringent requirements on its effectiveness and reliability. Traditional routing algorithms are mostly concerned with
finding a shortest path towards the destination, and cannot
provide good connectivity under frequent network failures [1].
This highlights the need for mechanisms that possess fast and
efficient recovery capabilities. Towards this goal, multipath
routing [2][8] has been proposed to use multiple alternative
paths for data transmission. It not only improves a networks
resilience to route failures, but also provides other benefits
such as better load balance, higher throughput, and enhanced
security.
Since the performance of routing and forwarding is critical
for the Internet, multipath algorithms must be highly efficient
so as not to become a bottleneck. Existing approaches often
focus on finding paths as many as possible, but do not take
much effort in reducing their computation or communication
overhead. For example, they either build multiple shortest path
trees [2], [7], [8], or exchange excessive messages between
neighbors [3], so the induced cost will be particularly high for
c
978-1-4799-1270-4/13/$31.00 2013
IEEE
high degree nodes. 1 The only exception we are aware of is the

TBFH algorithm proposed by P. Mrindol et al. [9], which has
the lowest time complexity among these multipath algorithms.
On the other hand, consider a link state network where each
router can learn the link states of the whole network. When
a link changes its state or weight, all existing algorithms will
recompute all the next-hops for each destination from scratch,
causing router resource waste and route convergence delay. We
believe it is unnecessary and can be significantly improved.
We propose a tree based distributed multipath algorithm,
DMPA, for a link state network. On each node, only a single
shortest path tree (SPT) needs to be computed, locally and
independently, without disseminating any information other
than the typical link states. A rule to select the next-hops
is designed such that when the tree is fully built, for each
destination in the network, a set of next-hops are derived, and
any forwarding path induced from the results of the distributed
computation is loop free, guaranteed by a partial order of
the routers underpinning the path. In addition, the next-hops
can be incrementally updated in response to any link state
change, instead of being computed from scratch. As far as
we know, DMPA is the first dynamic distributed multipath
algorithm that can produce multiple loop-free paths, and it is
much more efficient than existing multipath algorithms, as will
be demonstrated by our evaluations.
The rest of the paper is organized as follows. Section
II introduces the background and related works. Section III
presents details of DMPA and proves its loop free property,
while Section IV evaluates its performance and compares it
with some latest algorithms. Finally Section V concludes the
paper.
II.
BACKGROUND
Nowadays, Internet failures have become routine events

rather than exceptions [6]. Many solutions have been proposed
to handle this problem in different aspects, from physical
level approaches such as optical routing protection, to application level schemes such as remote server backup. Since
network connectivity is a core service provided by the routing
framework, it is natural to design routing algorithms that are
more resilient to failures. Multipath routing computes multiple
paths between source-destination pairs, and provides not only
redundant backups, but also other features such as load balance
and aggregated bandwidth.
1 In
this paper, we use router and node interchangeably.
There have been many multipath routing solutions. Equalcost multi-path (ECMP) [10] allows packets to be routed
along multiple paths of equal cost, which can be tuned by
network operators in purpose. However, in certain cases it
is just impossible to achieve equal costs no matter what
link weights are used [11], and ECMP does not offer good
reliability. Source selectable deflection [2] deflects packets
based on the shortest path costs of a router and its neighbors to
each destination. It proposes three rules with increasing path
diversity as well as computation complexity, and its overhead
also increases proportionally to the degree of a node.
Multi-topology routing [7], [8] pre-computes routes based
on backup topologies tailored for specific failures, either
by removing the corresponding edges or by increasing their
associated costs. Thus it enables each router to save several
valid paths to each destination. Path splicing [3] is an enhancement to multi-topology routing. It creates a set of slices
for the network based on random link-weight perturbations,
and end system can control which slices the routers should
use by embedding control bits in packet headers. In [12],
multiple instances of a link state routing protocol are used to
provide multiple choices, where each link is associated with
a vector of weights tuned by end systems. The complexity of
these algorithms is proportional to the number of alternative
configurations they want to employ.
Discount Shortest Path Algorithm (DSPA) [13] computes
K-shortest paths and takes into account both path quantity and
path independence. However, computing the K-shortest paths
is still much more computationally intensive than finding a
single shortest path.
Since the performance of routing and forwarding is critical
to the Internet, efficient route computation methods, in particular, dynamica algorithms for shortest path tree computation
have been extensively studied in the literature [14][17].
These algorithms only need to incrementally update their data
structures when network state changes, thus are much faster
than the static algorithms which do recomputation from scratch
each time. However, they only concern about one single path
for each destination. Recently, TBFH [9] was proposed to
accelerate multipath computation based on the next-hop rule
presented in [18], but it is still a static algorithm, and we are
not aware of any dynamic algorithm for efficient multipath
computation, which is the focus of this paper.
III.
A LGORITHM FOR C OMPUTING M ULTIPLE

N EXT-H OPS
A. Notations and Basic Idea

Before formally describing our tree based multipath algorithm, we first define some notations, which are also summarized in Table I. A network is modeled as an undirected graph
G = (V, E), where V and E denote the set of nodes (routers)
and the set of edges (links) in the network respectively. Each
node v V has a unique Router-ID R(v), while each edge
between node u and v has a positive link cost L(u, v). 2
Each node independently computes its next-hops for all
destinations, so in the rest of the paper, our algorithm will
be described with respect to a particular node c that performs
2 L(u, v)
= if u and v are not neighbors.
TABLE I: Notations
G = (V, E)
L(u, v)
Tc
Cc (v)
Hc (v)
Pc (v)
Dc (v)
Nc (v)
Bc (v)
Undirected graph with nodes and edges

Direct link cost between node u and node v
Shortest path tree rooted at node c
Cost from node c to node v in Tc
Children of node v in Tc
Parent of node v in Tc
Descendants of node v (itself included) in Tc
Next-hop set computed by node c for destination node v
Best next-hop computed by node c for destination node v
such kind of computation. This node builds a shortest path tree

Tc rooted at itself, containing all the nodes in the network as
potential destinations. We denote the cost of the path from c
to v in Tc by Cc (v), the children of v in Tc by Hc (v), the
parent of v in Tc by Pc (v), and the descendants of v in Tc by
Dc (v), with c itself included.
The object of c is to compute a candidate set of nexthops Nc (v) for each destination v (v = c), so that when a
packet destined to v arrives at c, c can select a next-hop from
Nc (v) and forward this packet to . In particular, we use Bc (v)
to represent the best/default candidate, which lies along the
shortest path from c to v, and save it as the first entry in Nc (v)
(i.e., Bc (v) = Nc (v)[0]). Since Tc is a shortest path tree, Cc (v)
is the lowest cost from c to v in the network, leading to the
following lemma.
Lemma 1. The Best Next-Hop Rule
{
v
Pc (v) = c
Bc (v) =
Bc (Pc (v)) Pc (v) = c
(1)
Equation (1) in Lemma 1 means the best next-hop Bc (v)

for a destination v is cs direct child along the path from c to v
in Tc . A shortest path routing algorithm, such as open shortest
path first (OSPF) [10], [19], computes a single next-hop Bc (v)
by employing equation (1) at each step when a new node v is
added to the SPT. To compute a set Nc (v) of next-hops for v,
we start with a simple rule called downstream criterion (DC)
[20], rephrased as follows:
Theorem 1. For packets destined to a destination v, node c
(c = v) can forward them to its any neighboring node x as
long as Cx (v) < Cc (v), and there will be no forwarding loop
in the induced forwarding path.
This theorem holds true because Cx (v) and Cc (v) are
respectively the lowest costs from x to v and c to v in the
network, and the cost to the destination strictly decreases in
any forwarding step. It is easy to verify that forwarding packets
to the best next-hop also satisfies the DC rule, as stated in the
following lemma.
Lemma 2. CBc (v) (v) < Cc (v).
The DC rule is the basis of many loop-free multipath
routing algorithms [2], [9], [18], which differ in their ways
to find such neighboring nodes that satisfy this rule. 3 One of
our contribution lies in the following rule, which is slightly
more strict than the DC rule:
3 Their
ways are also the root cause of their high complexity.
Definition 1. Given any two nodes u and v in the shortest

path tree Tc , if
Cc (u) Cc (Bc (u)) + L(u, v) < Cc (v),
(2)
we say u can contribute (its best next hop) to v.

Lemma 3. (The Next-Hop Contribution Rule)
If u can contribute to v, then CBc (u) (v) < Cc (v), i.e., the best
next hop for u is a viable next hop for v as suggested by the
DC rule.
Proof: CBc (u) (v) is the cost from Bc (u) to v in the SPT
TBc (u) , and is also the lowest cost from Bc (u) to v in the network. Since Cc (u) Cc (Bc (u)) + L(u, v) is the cost of a path
from Bc (u) to u to v, it must be no smaller than CBc (u) (v),
so CBc (u) (v) Cc (u)Cc (Bc (u))+L(u, v) < Cc (v).
The merit of the next-hop contribution rule is that, it is
very easy to check whether equation (2) can be satisfied for
any two nodes u and v in the SPT, since all terms in equation
(2) have already been computed. So at each step when a new
node v is added to the SPT, we can simply check whether any
other node u added earlier can contribute to it, and add Bc (u)
to Nc (v) if that is true. Similarly, if v can contribute to u, just
add Bc (v) to Nc (u). In particular, we only need to do this
test if u and v are neighbors in the network, since otherwise
L(u, v) = and equation (2) cannot be satisfied. With this
rule, we can compute Nc (v) for any node v in a way faster
than other multipath algorithms, without introducing loops:
Theorem 2. For packets destined to a destination v, if any
node c (c = v) forwards them to any node in its next-hop set
Nc (v) = {Bc (v)} {Bc (u)|u can contribute to v}, there will
be no loop in the resulting forwarding path.
This can be simply proved by combining Theorem 1,
Lemma 2, and Lemma 3.
B. Algorithm Specification
Since our distributed algorithm is specific to a computing
node c, in the rest of the paper, we will omit the subscript c
in the notations when it is clear from the context.
Our DMPA algorithm builds the shortest path tree in a way
similar to the Dijkstras algorithm [21]. We assign a visited
attribute to each node, so that only when a node is already in
the tree, its visited attribute is true, and f alse otherwise.
DMPA maintains a priority queue Q(v, p, d) to select the
proper node to be added to the tree, where p and d denote
respectively the tentative parent and cost of node v. The
function Enqueue(Q, < v, p, d >) pushes an node into Q. if
v already exists in Q, p and d will be updated only when the
new tentative cost is smaller, or when the cost is the same but
the new tentative parent has a smaller router ID. The function
ExtractMin(Q) returns (and deletes from Q) the node which
has the smallest tentative cost in Q, where the router ID is
used as a tie breaker in case of ties. In DMPA, when a node v
is extracted, its tentative cost equals the smallest cost from c
to v, so it can be added to the shortest path tree, and v.visited
will be changed from f alse to true.
DMPA also uses two hash tables to facilitate its efficient
incremental operation. The first one, h, remembers whether the
nodes can contribute to each other till the latest computation, so

that h[u, v] = 1 if and only if u can contribute to v and B(u) =
B(v). 4 This is to avoid repeated calculations of equation (2).
Since different nodes may have the same best next-hop, another
hash table, hn, further records how many nodes contribute
their best next-hop b to N (v), the next-hop set of v. This
reference count is denoted as hn[v, b], and when any link state
update changes the contribution relation between u and v due
to equation (2), the reference count is updated, but the nexthop set N (v) is modified only when necessary, as abstracted
as Add(b, N (v)) and Del(b, N (v)).
DMPA has two versions, a static one for a full computation
for a given topology, and a dynamic one for an incremental update assuming the full computation has already been
done before. We use DMPA-f (Algorithm 1) and DMPA-i
to distinguish them, and use a variable dynamic to identify
the case when necessary. DMPA-i also handles two cases,
corresponding to a link cost increase (Algorithm 3) or decrease
(Algorithm 4) respectively. They all use a common procedure
ComputeNextHopSets (Algorithm 2) to update the set of nexthops for each destination.
Algorithm 1 DMPA-f
1:
2:
3:
4:
5:
6:
dynamic f alse
for v V do
C(v) , H(v) , P (v) nil, N (v)
v.visited f alse
Enqueue(Q, < c, nil, 0 >)
ComputeNextHopSets
1) Static Version: To build the SPT and compute the nexthop sets from scratch, DMPA-f first put the computing node c
into the priority queue Q, then goes through several iterations.
In each iteration, a node v with the smallest tentative cost will
be popped out of Q by the ExactMin function and added to
the tree (lines 8-11). 5 The best next-hop for v is then updated
according to equation (1) (lines 14-17). For each neighbor u
of v in the network, if the path from c to v to u leads to a
smaller cost of u than previously known, the algorithm will
update Q using the Enqueue function (lines 19-21). In this
way, more nodes will be included in Q and will be selected
to add to the tree later. At last, it will check whether u and
v can contribute to each other according to equation (2), and
update their next-hop sets if necessary (lines 24-35).
2) Dynamic Version When Link Cost Increases: The dynamic cases are more tricky. When a link =< s, e > increases
its cost by inc, where s and e are the two ends of this link,
may either be in the previously constructed shortest path tree
or not. Algorithm 3 illustrates the detailed procedure.
In the former case, the tree structure will not be affected,
neither will the cost C(v) of any node v. So according to
the rules to construct the next-hop set, only N (s) and N (e)
may change due to a new L(s, e) in equation (2). Consider
whether s contributes its best next-hop B(s) to N (e). If s
contributes (or does not contribute) to e both before and after
the change, N (e) will not be affected. Otherwise, since the
4 If B(u) = B(v), then N (v) already includes B(u), and there is no need
to check whether equation (2) can be satisfied (lines 25-26).
5 In the dynamic version, v is actually attached to a new parent in the tree.
Algorithm 2 ComputeNextHopSets
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
while Q is not empty do

< v, p, d > ExtractMin(Q)
H(p) = H(p) {v}, H(P (v)) = H(P (v))\{v}
P (v) p, C(v) d
v.visited true
if dynamic = true then
b B(v), N (v)
if P (v) = c then
B(v) v
else
B(v) B(p)
for each neighbor u of v do
newdist C(v) + L(v, u)
if (u.visited = f alse) (newdist C(u)) then
Enqueue(Q, < u, v, newdist >)
if (dynamic = true) (h[v, u] = true) then
Del(b, N (u))
if u.visited = true then
if (B(u) = B(v)) then
h[u, v] f alse, h[v, u] f alse
else
if u can contribute to v then
Add(B(u), N (v)), h[u, v] true
else
h[u, v] f alse
if v can contribute to u then
Add(B(v), N (u)), h[v, u] true
else
h[v, u] f alse
link cost increases, it is just not possible that s can contribute

to e after the change but not before, so only when s can no
longer contribute to e does DMPA-i update N (e) accordingly
by the Del function, which implements reference counting as
we have mentioned before. Similarly, N (s) is updated if e can
no longer contribute to s (lines 38-42 in Algorithm 3).
In the latter case, link lies in the previously constructed
SPT, and without loss of generality, let us assume s is the
parent of e in the SPT. Since increases its cost, only the
descendents of e in the tree might be affected, while all other
nodes will have their costs unchanged. So we detach from the
tree all the descendents of e (including itself), denoted by R
(lines 43-46). Then we reinsert them into the shortest path tree
in a way similar to the static version. We note that, descendent
nodes that can immediately get a path no worse than before
through the remaining unaffected nodes must be put into Q,
while other descendents can be handled later (lines 47-52). 6 At
last, ComputeNextHopSets is called to rebuild the remaining
tree and update the next-hop sets therein.
However, the ComputeNextHopSets function works a little
different in the dynamic case than in the static one. For any
node v in R, its old best next-hop B(v) has to be saved for
later use, and its next-hop set N (v) has to be cleared (lines 1213). Since node v may have contributed its old best next-hop
6 Instead, we can simply put all descendent nodes in R into Q, but that will
make Q larger and affect the performance.
Algorithm 3 DMPA-i(link =< s, e >, increment inc)

36:
37:
38:
39:
40:
41:
42:
43:
44:
45:
46:
47:
48:
49:
50:
51:
52:
53:
dynamic true
L(s, e) L(s, e) + inc
if ( T ) (s = c) (e = c) then
if (h[s, e] = true) (s cannot contribute to e) then
Del(B(s), N (e)), h[s, e] f alse
if (h[e, s] = true) (e cannot contribute to s) then
Del(B(e), N (s)), h[e, s] f alse
if T then
//assuming s is the parent of e in T
R D(e)
for v R do
v.visited f alse
for v R do
for each neighbor u of v do
if u.visited = true then
newdist C(u) + L(u, v)
if newdist C(v) + change then
Enqueue(Q, < v, u, newdist >)
ComputeNextHopSets
Algorithm 4 DMPA-i(link =< s, e >, decrement dec)

54:
55:
56:
57:
58:
59:
60:
61:
62:
63:
64:
dynamic true
L(s, e) L(s, e) dec
//assuming C(e) C(s)
if C(e) < C(s) + L(s, e) then
if (h[s, e] = f alse) (s can contribute to e) then
Add(B(s), N (e)), h[s, e] true
if (h[e, s] = f alse) (e can contribute to s) then
Add(B(e), N (s)), h[e, s] true
else
newdist C(s) + L(s, e)
Enqueue(Q, < e, s, newdist >)
ComputeNextHopSets
b to a neighboring node u according to equation (2), we first

remove b from N (u) using the Del function which implements
reference counting (lines 22-23), and re-check whether u and
v can contribute to each other later and update their next-hop
sets accordingly (lines 24-35).
3) Dynamic Version When Link Cost Decreases: Algorithm
4 handles the case when a link =< s, e > decreases its cost
by dec. Without loss of generality, we assume the cost of e
is no smaller than s before the change, i.e., C(e) C(s), so
neither ss position in the SPT nor its cost will change.
If C(e) < C(s)+L(s, e), that is, e cannot get a smaller cost
due to this change, the tree structure, as well as the costs of the
nodes, will remain unchanged. Then only N (s) and N (e) may
change due to a new L(s, e) in equation (2). If s contributes
(or does not contribute) to e both before and after the change,
N (e) will not be affected. Otherwise, it is just impossible that
s can contribute to e before the change but not after, so only
when s starts to contribute to e due to the change does DMPA-i
update N (e) accordingly by the Add function (using reference
counting). Similarly, N (s) is updated if e starts to contribute
to s. These are handled by lines 56-60 in Algorithm 4, and
are exactly the opposite to the case when link cost increases
Else, e can get a new (smaller) cost, so it is pushed into

the priority queue Q, and all nodes that will be affected will
get their costs, parents and next-hop sets updated accordingly
using ComputeNextHopSets (lines 61-64).
link < c, g > lies in the old SPT in Fig. 2(a), lines 43-53
in Algorithm 3 will be executed. All old descendents of g,
i.e., g, j, k, l, m and n in D(g) will re-compute their next-hops
from scratch, while their neighboring nodes (not in D(g)), i.e.,
e, f and i with a thick circle in the figure, only incrementally
update their next-hops, in the ComputeNextHopSets function.
C. Examples
D. Multiple Link Cost Changes
1) Static Version: Fig. 1 illustrates how a full computation

(DMPA-f) is carried out, step by step from scratch, on node a.
For each step, we show the constructed SPT and the next-hop
sets which get updated in that step.
The algorithms presented above only handle a single link

cost change, but they can be extended to handle multiple link
cost changes in a batch. For clarity, we do not present the
detailed algorithms, but only give some briefly explanation.
In steps (2), (3) and (4), node b, c and d select the root a
as their parent, so B(b) = b, B(c) = c and B(d) = d according
to the best next-hop rule. Each best next-hop is also added to
the corresponding next-hop set.
For a batch of link cost increases, lines 37-42 will be

executed for each varying link that is not in the old SPT.
Then lines 44-52 will be executed to find nodes affected by
the remaining increases and push them into the priority queue
Q. Finally ComputeNextHops will be called to update their
next-hop sets in a batch.
(lines 38-42) in Algorithm 3.
When node e is added to the tree with b as its parent in

step (5), we get B(e) = B(b) = b. The next-hop contribution
rule is employed to test whether e can contribute to any of its
neighbors (and vice versa). Since C(e)C(B(e))+L(e, c) =
4 < C(c), e can contribute to c, so B(e) = b is added to N (c).
Similarly, c can contribute to e and B(c) = c is added to N (e).
Since d and e can also contribute to each other, B(d) = d is
added to N (e) and B(e) = b to N (d).
2) Dynamic Version: We next show how the next-hops can
be incrementally updated by DMPA-i in response to a link
cost increase, using a more complex example as depicted in
Fig. 2. The shortest path trees are composed of the solid arrow
lines (from parent to child), and the number inside each circle
(node) represents the shortest distance from the source node a
to that node. The dotted lines represent the other direct links,
while a link that increases its cost is differentiated by a thick
(solid or dotted) line from links unchanged.
Fig. 2(a) is the SPT constructed before any change, together
with the computed next-hop sets and the contribution status
(denoted by h for simplicity) which will be affected later. For
example, since e can contribute to j, B(e) = b is added to
N (j), and h[e, j] = true.
Fig. 2(b) shows how DMPA-i works when the link < l, m >
increases its weight from 9 to 11, while < e, j > from 7 to 9.
First, consider < l, m >. Since < l, m > is not in the old SPT,
no node changes its cost, and only l and m need to consider
updating their next-hop sets. However, h[m, l] = h[l, m] =
f alse, which means they do not contribute to each other before
the change, so they wont after the increase either, 7 and their
next-hop sets remain unchanged.
Now consider link < e, j >, which is not in the old
SPT either. The difference from the previous case is that,
h[e, j] = h[j, e] = true, so we have to check whether e and j
can contribute to each other after the increase, using equation
(2). Since C(e) C(B(e)) + L(e, j) = 15 > C(j), e cannot
contribute to j anymore, and B(e) = b has to be removed
from N (j), and h[e, j] is set to be f alse.
Similarly, for a batch of link cost decreases, lines 55-60 will

be executed for each varying link which does not change the
tree structure. Then lines 62-63 will be executed for the other
cost changes, and finally ComputeNextHops will be called.
E. Algorithm Complexity
In this section, we discuss the time complexity of DMPA.
We first analyze the ComputeNextHopSets function, since
most work is done there.
A priority queue is used to maintain nodes that need to
update their positions (parents) and costs, and ComputeNextHopSets keeps adding nodes to, extracting nodes from, and
updating nodes in that queue. Let D denote the maximum
node degree, Nn the number of nodes that need to change
their positions or costs when a link cost changes, and Ne
the number of edges that may cause any node in the queue
to change its cost (This is implemented by the decrease-key
operation of the priority queue). Assume the time needed by
Enqueue to enqueue a node is Te , the time by ExtractMin
to extract the node with the minimum cost is Tx , and the
time by Enqueue (decrease-key) to update a node existing
in the queue is Tk . Since each of the Nn nodes has to be
enqueued and extracted exactly once, and each of the Ne
edges can cause at most one decrease-key operation, the total
queue operation time in all execution of ComputeNextHopSets
is at most O(Nn Te + Nn Tx + Ne Tk ). Beside queue
manipulations, some operations are called at most two times
for each of the Ne edges, including modifying the next-hop
sets and updating the hash tables, while others are called at
most once for each of the Nn nodes to set their attributes.
Each of them can be completed in constant time (or constant
amortized time), and they cost O(Nn + Ne ) in total. So the
final time for all execution of ComputeNextHopSets is still
O(Nn Te + Nn Tx + Ne Tk ).
Finally, consider < c, g > changing its cost from 4 to 20,

and Fig. 2(c) shows the new SPT after this change. Because
Since there are at most Nn nodes in the queue, Te =

O(1), Tx = lg(Nn ) and Tk = O(1) when the queue is
implemented as a Fibonacci heap [22], and the total time is at
most O(Nn lg(Nn ) + Ne ).
7 This property is guaranteed to be true, so there is no need to use equation

(2) to check, as suggested by lines 38-42 in Algorithm 3.
Now consider the static algorithm (Algorithm 1). The

initialization part costs at most O(V ) time in a network of

B(b)=b
N(b)={b}
F

B(c)=c
N(c)={c}
F
G N(d)={d,b}
N(c)={c,b}
B(e)=b
H N(e)={b,c,d}
G
B(d)=d
N(d)={d}
Fig. 1: Step by step construction of the SPT rooted at node a and the next-hop sets

E
H
1O ^F`
K>PO@ IDOVH

1P ^F`
K>OP@ IDOVH

G

1N ^FG`
K>LN@ WUXH

1Q ^F`
(a) SPT computed before any change

1M ^F`
K>HM@ IDOVH

!

!

1L ^GF`
K>NL@ WUXH
1M ^FE`
K>HM@ WUXH
K>JM@ IDOVH

1I ÊF` 1J ^F`
K>MI@ WUXH K>>NJ@ IDOVH

1H ÊF`
K>MH@ WUXH

M
1O Ê`
K>PO@ IDOVH
(b) < e, j >, < l, m > increase their costs

1P Ê`
K>OP@ IDOVH

1L ^G`
K>NL@ IDOVH

1M ÊF`
K>HM@ IDOVH
K>JM@ WUXH

Ê`
!
K>MI@ IDOVH 1J ^FG`
K>NJ@ WUXH
N
Q
1I
1H Ê`
+>MH@ IDOVH

1N ^G`
K>LN@ IDOVH

1Q ^G`
(c) < c, g > increases its cost
Fig. 2: Incremental Update using DMPA-i (Algorithm 3) when link cost increases
|V | nodes and |E| edges. Since Nn = |V | and Ne = |E|,

Algorithm 1 costs at most O(|V |) + O(Nn lg(Nn ) + Ne ) =
O(|V | lg(|V |) + |E|), which is similar to the complexity of
a full shortest path computation.
For a dynamic computation based on Algorithm 3 or
4, at most Nn nodes and Ne = D Nn edges of these
nodes will be involved. The operations executed besides
ComputeNextHopSets are called at most two times for each
such edge, so the time needed for these parts is at most
O(D Nn ). Combining it with the queue manipulation time
O(Nn lg(Nn ) + Ne ), we can derive the time complexity of
DMPA-i as O(Nn lg(Nn ) + D Nn ).
IV.
P ERFORMANCE E VALUATION
In this section, we present our evaluation methods as well

as results on both real and synthetic topologies.
We compare the results achieved by DMPA against ECMP,
the basic routing deflection scheme (denoted by Rule1) in
[2], and TBFH [9]. Rule1 implements the DC rule (Theorem
1), and requires SPT computation for each neighbor, 8 while
DMPA and TBFH use more strict rules than Rule1.
We have implemented DMPA, ECMP and Rule1 using c,
and performed experiments on linux with an Intel i5 1.7 GHz
8 Other
schemes in [2] are even more time-consuming.
CPU and 4G memory. For TBFH, we use the numeric results

provide in the original paper [9] when appropriate. 9
For each comparison, we use the real topology of Abilene
(a US-based research and education network with 11 nodes,
14 edges), and four ISP topologies inferred from measurement
results by Rocketfuel [23], including Sprint (52 nodes, 84
edges), Exodus (79 nodes, 294 edges), Telstra (104 nodes,
302 edges), and Tiscali (161 nodes, 656 edges). Due to space
limitations, in most of the time, we only present the results
for Sprint and Exodus, while the results for Telstra and Tiscali
are similar. We also use BRITE [24] to generate synthetic
topologies of a large range of topology sizes and node degree
distributions, using parameters as listed in Table II.
TABLE II: Parameters for BRITE Topology Generation
Model
Waxman
N
20-1000
HS
1000
LS
100
m
2-40
NodePlacement
Random
GrowthTypem
Incremental
alpha
0.15
beta
0.2
BWDist
Constant
BwMin
10.0
BwMax
1024.0
9 We are still in the progress of porting the code of TBFH and testing
its performance under the same setting as we use in this paper. However,
we believe directly using the results in [9] is meaningful, because [9] also
compares TBFH with ECMP and Rule1, and the results for ECMP and Rule1
there closely match our results.
10
10
10
10
ECMP
Rule1
DMPAf
DMPAi
10
Computing Time (us)
ECMP
Rule1
DMPAf
DMPAi
Computing Time (us)
Computing Time (us)
10
10
10
50
100
Topology Size
150
(a) Average Node Degree=4
200
10
10
10
ECMP
Rule1
DMPAf
DMPAi
10
20
30
Average Degree of The Topology
40
(b) Topology Size=200
10
20
30
40
(c) Topology Size=1000
Fig. 3: Computing Time vs Topology Size and Average Node Degree
A. Computing Time
First, we compare the computation efficiency of the different algorithms. For each network topology, we let one link
change its cost at a time and execute each algorithm on each
node. This procedure is repeated for a random selected 30%
links, and the final result for each algorithm is its computation
time averaged on all nodes and all selected link changes. The
results on the real or inferred topologies are listed in table III,
while the results on the synthetic topologies are shown in Fig.
3. Note that DMPA-i dynamically updates its next-hops, while
all other algorithms have to recompute from scratch, including
DMPA-f, the static version of our algorithm.
Our dynamic algorithm DMPA-i has a clear advantage in
all cases, running nearly an order of magnitude faster than all
the other algorithms. For example, on average, DMPA-i uses
only 0.65s to handle a link cost change in the Sprint network,
while the fastest among the others, ECMP, uses 7.51s.
Fig. 3(a) shows how their computing time increases with
the topology size, using synthetic topologies with an average
node degree of four, while fig. 3(b) and 3(c) show their time
with respect to larger average node degrees, under synthetic
topologies of 200 and 1000 nodes, respectively. We can see the
speed of DMPA-f is comparable to that of ECMP in all cases,
demonstrating that we are simply constructing a single SPT
and computing multiple next-hops based on this tree. Actually,
on average, DMPA-f is 20% slower than ECMP, while the
results in [9] show that TBFH is around 50% slower than
ECMP, indicating DMPA-f is 20% faster than TBFH. 10 Rule1
runs much slower than them, especially when the average node
degree is large, since it construct a tree for each neighbor, while
our dynamic DMPA-i runs nearly an order of magnitude faster
than all of them, since in most cases only a small portion of the
tree needs to be adjusted. So in the face of topology changes,
DMPA-i consumes much less computing resources, which is
already scarce for todays core routers.
10 The time cost by ECMP in these figures is similar to that in [9], so can
be used as a reference point.
TABLE III: Computation time for Real Topologies

Network
Real
Measured
Abilene
Sprint
Exodus
ECMP
6.82
7.51
44.36
Computing time(s)
Rule1
DMPA-f
7.27
6.82
7.96
7.55
128.29
61.23
DMPA-i
0.32
0.65
7.12
B. Reliability
One primary motivation of multipath routing is to provide
redundant and diverse paths, so that when any link fails, a new
path avoiding this link can quickly be found to improve the
network reliability. To demonstrate this capability, we define
disconnect fraction as the ratio of the number of disconnected
source-destination pairs to the number of all source-destination
pairs, when each link fails independently with a probability p.
Here, for a certain routing algorithm, connected means there
exists a forwarding path from the source to the destination,
using any next-hop computed by this algorithm. A smaller
disconnect fraction means better reliability.
Fig. 4 shows the disconnect fraction achieved by each
algorithm on Abilene, Sprint and Exodus. Since the static and
dynamic version of DMPA output exactly the same results,
we do not distinguish them here. As the link fail probability p
increases from 0.01 to 0.1, the reliability of ECMP decreases
very fast. DMPA achieve a slightly larger (but very close)
disconnect fraction than Rule1, since it uses a slightly more
strict rule than Rule1. For example, when p = 0.1, the
disconnect fraction of ECMP, Rule1 and DMPA in Exodus
is 91.66%, 35.23% and 30.23% respectively. As shown in [9],
TBFH also achieves a slighter worse (but comparable) result
than Rule1, since it also uses a more strict rule. 11
The disconnect fraction results on synthetic topologies of
different sizes are shown in Fig. 5. The curves have a similar
trend to those in Fig. 4, where ECMPs reliability decreases
11 In [9], a metric called coverage is used to measure the path diversity,
which is defined as the number of s-d pairs with at least one valid alternate
next-hop on the source to the number of all s-d pairs. Although that does
not guarantee a valid path, we believe it is reasonable to conjecture that, two
algorithms that achieve similar coverage also achieve similar disconnect
fraction.
0.3
ECMP
Rule1
DMPA
0.35
Disconnect Fraction
Disconnect Fraction
0.35
0.25
0.2
0.15
0.1
0.05
0
0.4
ECMP
Rule1
DMPA
0.3
ECMP
Rule1
DMPA
0.8
Disconnect Fraction
0.4
0.25
0.2
0.15
0.1
0.6
0.4
0.2
0.05
0.02
0.04
0.06
0.08
Probability of Link Failure
0.1
0.02
(a) Abilene
0.04
0.06
0.08
0.1
0.02
(b) Sprint
0.04
0.06
0.08
0.1
(c) Exodus
Fig. 4: Reliability on Abilene, Sprint and Exodus
0.5
ECMP
Rule1
DMPA
0.4
Disconnect Fraction
Disconnect Fraction
0.4
0.3
0.2
0.1
0.5
ECMP
Rule1
DMPA
0.3
0.2
0.1
50
100
150
Topology Size
200
(a) p=0.01 and Average Node Degree=4
ECMP
Rule1
DMPA
0.4
Disconnect Fraction
0.5
0.3
0.2
0.1
50
100
150
Topology Size
200
(b) p=0.05 and Average Node Degree=4
50
100
150
Topology Size
200
(c) p=0.1 and Average Node Degree=4
Fig. 5: Reliability for Different Topology Sizes
much faster on larger topologies, while Rule1 and DMPA can

find much more redundant paths and provide better reliability.
We also investigate how the reliability changes when the
average node degree of a topology increases, as shown in Fig.
6, using synthetic topologies with 200 nodes. It is clear that
when nodes have more neighbors, the disconnection fraction
also decreases. Rule1 and DMPA still provide comparable
performance, both of which are much better than ECMP.
These results suggest that, our dynamic algorithm can
provide reliability close to the best one, while running nearly
an order of magnitude faster than all the other algorithms.
C. Fast Recovery
The disconnect fraction results in the previous section show
the theoretically best reliability a multipath algorithm can
achieve. However, in practice, even a path to the destination
exist when some links fail, it may not be effectively used, since
each router selects its next-hop independently, not necessarily
along any path that can reach the destination. So in this section,
we use a simple forwarding scheme to test how well these
algorithms can work in a real-world.
Assuming on each node, the next-hop set is computed for
a topology. When some links fail, we let the nodes enter a

recovery mode, and each randomly choose one next-hop from
its next-hop set. Such a scheme can be easily implemented
by the end system embedding a control bit in the packet,
without explicitly coordinating all routers. We call the process
of forwarding packets along the randomly built path until
either the destination is reached or no next-hop is available
a trial. A source-destination pair is considered to be connected
when the destination can be reached within five trials, and the
corresponding disconnect fraction results are shown in Fig.
7 and Fig. 8. Due to space limitations, synthetic topologies
with different average node degrees are not included, but all
results are very similar to those in the previous section. This
indicates that a simple forwarding scheme can effectively let
DMPA achieve reasonable reliability.
D. Partial Deployment
DMPA is compatible with nowadays link-state routing
protocol, such as OSPF and IS-IS [25], so it can be partially deployed on only a portion of nodes. However, starting
with the right nodes for deployment may result in different
network reliability. We test three simple strategies, namely,
selecting the highest degree nodes, the lowest degree nodes,
or random nodes. In Fig. 9, we present the disconnection
0.2
Disconnect Fraction
0.035
0.03
0.025
0.02
0.015
0.01
0.4
ECMP
Rule1
DMPA
0.15
0.1
0.05
0.005
0
ECMP
Rule1
DMPA
0.35
Disconnect Fraction
ECMP
Rule1
DMPA
Disconnect Fraction
0.04
0.3
0.25
0.2
0.15
0.1
0.05
10
20
30
40
(a) p=0.01 and Topology Size=200
10
20
30
40
(b) p=0.05 and Topology Size=200
10
20
30
40
(c) p=0.1 and Topology Size=200
Fig. 6: Reliability for Different Average Node Degrees
0.4
ECMP
Rule1
DMPA
0.3
0.35
Disconnect Fraction
Disconnect Fraction
0.35
0.25
0.2
0.15
0.1
0.05
0
1
ECMP
Rule1
DMPA
0.3
ECMP
Rule1
DMPA
0.8
Disconnect Fraction
0.4
0.25
0.2
0.15
0.1
0.6
0.4
0.2
0.05
0.02
0.04
0.06
0.08
0.1
(a) Abilence
0.02
0.04
0.06
0.08
0.1
0.02
(b) Sprint
0.04
0.06
0.08
0.1
(c) Exodus
Fig. 7: Recovery Results on Abilene, Sprint and Exodus
fraction achieved on the Exodus topology. It can be seen that

starting with the high degree nodes is the best cost-effective,
when a deployment on only 20% nodes already reduces the
disconnect fraction clearly. Results on other topologies give
similar indications, and are omitted due to space limitations.
This is reasonable, since high degree nodes may find more
next-hops to the destinations.
V.
VI.
ACKNOWLEDGE
We are grateful to the anonymous ICNP reviewers for

their insightful comments.. This work is supported by the
National Basic Research Program of China (973 Program)
under Grant No. 2009CB320502 and the National Natural
Science Foundation of China (Grant No. 61272446).
C ONCLUSION
In this paper, we propose a shortest path tree based multipath routing algorithm called DMPA. We carefully define the
next-hop contribution rule for computing multiple next-hops,
and prove that no loop will be introduced when this distribute
algorithm is executed independently on each router. DMPA
not only avoids the overhead of computing multiple shortest
path trees, but also dynamically handles the link state changes,
so that the next-hops can be incrementally updated, but not
recomputed from scratch. In this way, it runs much faster than
the other multipath algorithms, and consumes little computing
resource which is scarce on todays routers. DMPA effectively
increases the network reliability. It can help fast recovery with
a simple forwarding scheme, and can be partially deployed in
the network. We believe DMPA provides a basic mechanism
on which the network can be made more efficient and reliable.
R EFERENCES
[1] G. Iannaccone, C. nee Chuah, R. Mortier, S. Bhattacharyya, and C. Diot,
Analysis of link failures in an ip backbone, in In Proc. of the Internet
Measurement Workshop. ACM, 2002, pp. 237242.
[2] X. Yang and D. Wetherall, Source selectable path diversity via routing
deflections. in SIGCOMM. ACM, 2006, pp. 159170.
[3] M. Motiwala, M. Elmore, N. Feamster, and S. Vempala, Path splicing,
in SIGCOMM, 2008, pp. 2738.
[4] J. Chen, S. Chan, and V. Li, Multipath routing for video delivery over
bandwidth-limited networks, Selected Areas in Communications, IEEE
Journal on, vol. 22, no. 10, pp. 19201932, 2004.
[5] S. Vutukury and J. Garcia-Luna-Aceves, Mpath: a loop-free multipath
routing algorithm, Microprocessors and Microsystems, vol. 24, no. 6,
pp. 319327, 2000.
[6] J. He and J. Rexford, Toward internet-wide multipath routing, Network, IEEE, vol. 22, no. 2, pp. 1621, 2008.
0.5
ECMP
Rule1
DMPA
0.4
Disconnect Fraction
Disconnect Fraction
0.4
0.5
ECMP
Rule1
DMPA
0.3
0.2
0.1
0.3
0.2
0.1
50
100
150
Topology Size
(a) pe = 0.01 and Average Node Degree=4
0.3
0.2
0.1
200
ECMP
Rule1
DMPA
0.4
Disconnect Fraction
0.5
50
100
150
Topology Size
200
(b) pe = 0.05 and Average Node Degree=4
50
100
150
Topology Size
200
(c) pe = 0.1 and Average Node Degree=4
Fig. 8: Recovery Results for Different Topology Sizes
1
ECMP
DMPA(10%)
DMPA(20%)
DMPA(50%)
DMPA(100%)
0.6
0.8
Disconnect Fraction
Disconnect Fraction
0.8
0.4
0.2
1
ECMP
DMPA(10%)
DMPA(20%)
DMPA(50%)
DMPA(100%)
0.6
0.4
0.2
0.02
0.04
0.06
0.08
(a) Low Node Degree
0.1
ECMP
DMPA(10%)
DMPA(20%)
DMPA(50%)
DMPA(100%)
0.8
Disconnect Fraction
0.6
0.4
0.2
0.02
0.04
0.06
0.08
0.1
(b) Random
0.02
0.04
0.06
0.08
0.1
(c) High Node Degree
Fig. 9: Incremental Deployment for the Exodus Topology.
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
S. Gjessing, Implementation of two resilience mechanisms using multi

topology routing and stub routers, in Telecommunications, 2006. AICTICIW06. IEEE, 2006, pp. 2929.
G. Apostolopoulos, Using multiple topologies for ip-only protection
against network failures: A routing performance perspective, ICSFORTH, Greece, Tech. Rep, 2006.
P. Merindol, P. Francois, O. Bonaventure, S. Cateloin, and J. J. Pansiot,
An efficient algorithm to enable path diversity in link state routing
networks, Comput. Netw., vol. 55, no. 5, pp. 11321149, Apr. 2011.
J. Moy, Rfc 2328: Ospf version 2, Internet Society (ISOC), 1998.
G. Lee and J. Choi, A survey of multipath routing for traffic engineering, Information and Communications University, Korea, 2002.
D. Andersen, H. Balakrishnan, F. Kaashoek, and R. Morris, Resilient
overlay networks, ACM SIGCOMM Computer Communication Review,
vol. 32, no. 1, pp. 6666, 2002.
H. Palakurthi, Study of multipath routing for qos provisioning, EECS,
2001.
P. M. Spira and A. Pan, On finding and updating spanning trees and
shortest paths, SIAM Journal on Computing, vol. 4, no. 3, pp. 375380,
1975.
P. G. Franciosa, D. Frigioni, and R. Giaccio, Semi-dynamic shortest
paths and breadth-first search in digraphs, in STACS 97. Springer,
1997, pp. 3346.
P. Narvaez, K. Siu, and H. Tzeng, New dynamic algorithms for shortest
path tree computation, IEEE/ACM Transactions on Networking (TON),
vol. 8, no. 6, pp. 734746, 2000.
, New dynamic spt algorithm based on a ball-and-string model,
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
IEEE/ACM Transactions on Networking (TON), vol. 9, no. 6, pp. 706

718, 2001.
P. Narvaez, K.-Y. Siu, and H.-Y. Tzeng, Efficient algorithms for multipath link-state routing, in ISCOM99, 1999.
J. T. Moy, OSPF: anatomy of an Internet routing protocol. AddisonWesley Professional, 1998.
A.
INCITS,
Iso/iec
10589,
Information
technology
Telecommunications and information exchange between systems
Intermediate System to intermediate system intra-domain routeing
information exchange protocol for use in conjunction with the protocol
for providing the connectionless-mode network service (ISO 8473),
2002.
E. W. Dijkstra, A note on two problems in connexion with graphs,
Numerische mathematik, vol. 1, no. 1, pp. 269271, 1959.
M. L. Fredman and R. Tarjan, Fibonacci heaps and their uses in
improved network optimization algorithms, 25th Annual Symposium
on Foundations of Computer Science (IEEE), pp. 338346, 1984.
N. Spring, R. Mahajan, and D. Wetherall, Measuring isp topologies
with rocketfuel, ACM SIGCOMM Computer Communication Review,
vol. 32, no. 4, pp. 133145, 2002.
A. Medina, A. Lakhina, I. Matta, and J. Byers, Brite: An approach to
universal topology generation, in Modeling, Analysis and Simulation
of Computer and Telecommunication Systems, 2001. Proceedings. Ninth
International Symposium on. IEEE, 2001, pp. 346353.
R. Perlman, A comparison between two routing protocols: Ospf and
is-is, Network, IEEE, vol. 5, no. 5, pp. 1824, 1991.

Dynamic Distributed Algorithm For Computing For Computing Multiple Next Hop On A Tree - ICNP 2013

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dynamic Distributed Algorithm For Computing For Computing Multiple Next Hop On A Tree - ICNP 2013

Uploaded by

Copyright:

Available Formats

Dynamic Distributed Algorithm for Computing

Multiple Next-Hops on a Tree

AbstractHigh reliability is always pursued by network

With the rapid development of the Internet, more and

high degree nodes. 1 The only exception we are aware of is the

Nowadays, Internet failures have become routine events

this paper, we use router and node interchangeably.

A LGORITHM FOR C OMPUTING M ULTIPLE

A. Notations and Basic Idea

= if u and v are not neighbors.

Undirected graph with nodes and edges

such kind of computation. This node builds a shortest path tree

Equation (1) in Lemma 1 means the best next-hop Bc (v)

ways are also the root cause of their high complexity.

Definition 1. Given any two nodes u and v in the shortest

we say u can contribute (its best next hop) to v.

nodes can contribute to each other till the latest computation, so

while Q is not empty do

link cost increases, it is just not possible that s can contribute

Algorithm 3 DMPA-i(link =< s, e >, increment inc)

Algorithm 4 DMPA-i(link =< s, e >, decrement dec)

b to a neighboring node u according to equation (2), we first

Else, e can get a new (smaller) cost, so it is pushed into

D. Multiple Link Cost Changes

1) Static Version: Fig. 1 illustrates how a full computation

The algorithms presented above only handle a single link

For a batch of link cost increases, lines 37-42 will be

(lines 38-42) in Algorithm 3.

When node e is added to the tree with b as its parent in

Similarly, for a batch of link cost decreases, lines 55-60 will

Finally, consider < c, g > changing its cost from 4 to 20,

Since there are at most Nn nodes in the queue, Te =

7 This property is guaranteed to be true, so there is no need to use equation

Now consider the static algorithm (Algorithm 1). The

(a) SPT computed before any change

(b) < e, j >, < l, m > increase their costs

(c) < c, g > increases its cost

|V | nodes and |E| edges. Since Nn = |V | and Ne = |E|,

In this section, we present our evaluation methods as well

schemes in [2] are even more time-consuming.

CPU and 4G memory. For TBFH, we use the numeric results

Computing Time (us)

Computing Time (us)

Computing Time (us)

(a) Average Node Degree=4

(b) Topology Size=200

(c) Topology Size=1000

Fig. 3: Computing Time vs Topology Size and Average Node Degree

TABLE III: Computation time for Real Topologies

Fig. 4: Reliability on Abilene, Sprint and Exodus

(a) p=0.01 and Average Node Degree=4

(b) p=0.05 and Average Node Degree=4

(c) p=0.1 and Average Node Degree=4

Fig. 5: Reliability for Different Topology Sizes

much faster on larger topologies, while Rule1 and DMPA can

a topology. When some links fail, we let the nodes enter a

(a) p=0.01 and Topology Size=200

(b) p=0.05 and Topology Size=200

(c) p=0.1 and Topology Size=200

Fig. 6: Reliability for Different Average Node Degrees

Fig. 7: Recovery Results on Abilene, Sprint and Exodus

fraction achieved on the Exodus topology. It can be seen that