Professional Documents
Culture Documents
Vinay Pai, Kapil Kumar, Karthik Tamilmani, Vinay Sambamurthy, Alexander E. Mohr
Department of Computer Science
Stony Brook University
{vinay,kkumar,tamilman,vsmurthy,amohr}@cs.stonybrook.edu
1
2.2 Bullet The window of availability will typically be larger than the
Bullet [11] is another high-bandwidth data dissemination window of interest.
method. It aims to provide nodes with a steady flow of For every neighbor, a peer creates a list of desired pack-
data at a high rate. A Bullet network consists of a tree ets, i.e. a list of packets that the peer wants, and is in the
with a mesh overlaid on top of it. neighbor’s window of availability. It will then apply some
strategy to pick one or more packets from the list and re-
The data stream is divided into blocks which are fur-
quest them via a REQUEST message. Currently, we sim-
ther divided into packets. Nodes transmit a disjoint sub-
ply pick packets at random, but more intelligent strategies
set of the packets to each of their children. An algorithm
may yield enhanced improvements (see Section 5.2).
called RanSub [10] distributes random, orthogonal sub-
A peer keeps track of what packets it has requested
sets of nodes every epoch to each node participating in the
from every neighbor and ensures that it does not request
overlay. Nodes receive a subset of the data from their par-
the same packet from multiple neighbors. It also limits
ents and recover the remaining by locating a set of disjoint
the number of outstanding requests with a given neigh-
peers using these random subsets.
bor, to ensure that requests are spread out over all neigh-
2.3 Gossip-based Broadcast bors. Nodes keep track of requests from their neighbors
and send the corresponding packets as bandwidth allows.
Gossip protocols provide a scalable option for large scale
The algorithms that nodes use to manipulate their win-
information dissemination. Pcast [2] is a two phase proto-
dows and to decide when to pass data up to the application
col in which the exchange of periodic digests takes place
layer are determined by the specific requirements of the
independent of the data dissemination. Lpbcast [8] ex-
end application. For example, if the application does not
tends pcast in that it requires nodes to have only partial
require strict ordering, data may be passed up as soon as it
membership information.
is received. On the other hand, if order must be preserved,
2.4 BitTorrent data would be passed up as soon as a contiguous block is
available.
The BitTorrent [7] file sharing protocol creates an unstruc- For the experiments outlined in this paper, we built our
tured overlay mesh to distribute a file. Files are divided graph by having every node repeatedly connect to a ran-
into discrete pieces. Peers that have a complete copy of domly picked node, from the set of known hosts, until it
the file are called seeds. Interested peers join this overlay was connected to a specified minimum number of neigh-
to download pieces of the file. It is pull-based in that peers bors. Our system does not rely on any specific topology,
must request a piece in order to download it. Peers may however we could use other membership protocols like in
obtain pieces either directly from the seed or exchange BitTorrent [7] or Gnutella [1]
pieces with other peers. For the remainder of this paper, we assume that the ap-
3 System Description plication is similar to live streaming. The seed generates
new packets at a constant rate that we refer to as the stream
We built a request-response based high-bandwidth data rate. Nodes maintain a window of interest of a constant
dissemination protocol drawing upon gossip-based pro- size and slide it forward at a rate equal to the stream rate.
tocols and BitTorrent. The source node, called a seed, If a packet has not been received by the time it “falls off”
generates a series of new packets with monotonically in- the trailing edge of the window, the node will consider that
creasing sequence numbers. If desired, one could eas- packet lost and will no longer try to acquire it.
ily have multiple seeds scattered throughout the network. During our initial investigations, we observed that some
In this paper we assume that there is only one seed in packets were never requested from the seed until several
the system. We could also support many-to-many mul- seconds after they were generated. As a result, those pack-
ticast applications by replacing the sequence number with ets wouldn’t propagate to all the nodes in time, resulting in
a (stream-id, sequence #) tuple. However, for the appli- packet loss. This is an artifact of picking pieces to request
cations we describe in this paper, a single sender and an at random and independently from each neighbor, result-
integer sequence number suffice. ing in some pieces not being requested when that neighbor
Every peer connects to a set of nodes that we call its is the seed.
neighbors. Peers only maintain state about their neigh- We fixed this problem with an algorithm called Request
bors. The main piece of information they maintain is a list Overriding. The seed maintains a list of packets that have
of packets that each neighbor has. When a peer receives a never been uploaded before. If the list is not empty and
packet it sends a NOTIFY message to its neighbors. The the seed receives a request for a packet that is not on
seed obviously does not download packets, but it sends out the list, the seed ignores the sequence number requested,
NOTIFY messages whenever it generates new packets. sends the oldest packet on the list instead, and deletes that
Every peer maintains a window of interest, which is the packet from the list. This algorithm ensures that at least
range of sequence numbers that the peer is interested in one copy of every packet is uploaded quickly, and the seed
acquiring at the current time. It also maintains and informs will not spend its upload bandwidth on uploading packets
its neighbors about a window of availability, which is the that could be obtained from other peers unless it has spare
range of packets that it is willing to upload to its neighbors. bandwidth available.
2
12000
240 Seed Upload Rate Seed
Avg. Download Rate 11000 A typical node
220 Avg. Non-seed Upload Rate Trailing edge of buffer
10000
200
9000
180
Bandwidth (kB/sec)
Sequence number
8000
160
140 7000
120 6000
100 5000
80 4000
60 3000
40 2000
20 1000
0 0
0 10 20 30 40 50 60 70 80 90 100 110 120 0 10 20 30 40 50 60 70 80 90 100 110 120
Time (seconds) Time (seconds)
Figure 1: The seed’s upload rate and the average upload and Figure 2: A plot of the highest sequence number of contiguous
download rate for all other nodes. data downloaded by a typical node as a function of time. The
diagonal line on top (dashed) represents the new pieces gener-
In most cases, it is better to have the seed push out ated by the seed, while the bottom line (dotted) represents the
new packets quickly, but there are situations when Re- trailing edge of the node’s buffer.
quest Overriding is undesirable. For example, a packet
may be very old and in danger of being lost. Therefore,
8200 Seed
REQUEST packets could have a bit that tells the seed to A typical node
disable Request Overriding. We have not yet implemented 8100 Earliest possible playback
Trailing edge of buffer
this bit in our simulator or prototype. Sequence number
8000
7900
4 Experimental Results
7800
We built a discrete-time simulator to evaluate our system
7700
and run experiments on large networks. Using it, we were
able to simulate 10,000 node networks. We also built a 7600
3
7000 250
Seed Seed Upload Rate
A typical node 225 Non-failing Node Download Rate
6500 A new node Failing Node Download Rate
Trailing edge of buffer 200
6000
5500
150
5000 125
100
4500
75
4000
50
3500
25
3000 0
40 45 50 55 60 65 70 75 80 0 10 20 30 40 50 60 70 80 90 100 110 120
Time (seconds) Time (seconds)
Figure 4: The bold line shows the behavior of a new node join- Figure 6: Observed bandwidth trends when 50% of the nodes
ing at 50 sec contrasted with a node that has been in the system are simultaneously failed at 50 seconds.
since the start of the experiment.
4.3 Resilience to Catastrophic Failure
5300
Seed
We believe that Chainsaw is resilient to node failure be-
A typical node cause all a node has to do to recover from the failure of its
5200 A new node
Earliest possible playback neighbor is to redirect packet requests from that neighbor
Trailing edge of buffer
to a different one. We simulated a catastrophic event by
5100
Sequence number
4
7000 120
Typical Node Chainsaw Useful Data
6800 Seed 110 Bullet Useful Data
6600 Splitstream Useful Data
100 Bullet Duplicate Data
6400
90 Chainsaw Duplicate Data
6200
Bandwidth (kB/sec)
Sequence Number
6000 80
5800 70
5600
60
5400
5200 50
5000 40
4800 30
4600
20
4400
4200 10
4000 0
40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 120 130 140 150 160 170 180 190 200 210 220 230 240
Time (seconds) Time(seconds)
Figure 7: Progress of a non-failing node when 50% of the Figure 8: Useful and duplicate data rates for Chainsaw, Bullet
nodes in the network simultaneously fail at 50 seconds. All ef- and SplitStream as a function of time from our PlanetLab ex-
fects of the catastrophic event are eliminated within 5 seconds. periment. 50% of the nodes in the system were killed at the 180
second mark.
list of known peers in addition to a neighbor list. When a
neighbor disappears, the node picks a neighbor randomly mechanisms. Once implemented, we expect SplitStream’s
from the known peers list and repeats this process until it bandwidth to return to its original level in a few seconds,
has a sufficient number of neighbors. We expect such a once the trees are repaired. Therefore, we ignore Split-
mechanism to be robust, even with high rates of churn. Stream’s packet loss and focus on comparing Chainsaw to
Bullet for now.
4.4 PlanetLab: Bullet and SplitStream The packet loss rates for both Chainsaw and Bullet were
In order to compare Chainsaw against Bullet [11] and unaffected by the catastrophic failure. With Chainsaw 73
SplitStream [3], we used the Macedon [13] prototyping of the 76 non-failing nodes had no packet loss at all. One
tool, developed by the authors of Bullet. Macedon al- of the nodes had an a consistent loss rate of nearly 60%
lows one to specify the high-level behavior of a system, throughout the experiment, whereas two others had brief
while letting it take care of the implementation details. bursts of packet loss over intervals spanning a few sec-
The Macedon distribution already includes implementa- onds. With Bullet, every node consistently suffered some
tions of Bullet and SplitStream, so we implemented our packet loss rates. The overall packet loss for various nodes
protocol in their framework to allow a fair comparison be- varied from 0.88% to 3.64% with a mean of 1.30%.
tween these systems. With Chainsaw, nodes did receive a small number of
We conducted our experiments on the PlanetLab [6] duplicate packets due to spurious timeouts. However, the
test-bed, using 174 nodes with good connectivity and a duplicate data rate rarely exceeded 1%. With Bullet, on the
large memory capacity. For each of the three protocols, other hand, nodes consistently received 5-10% duplicate
we deployed the application, allowed time for it to build data, resulting in wasted bandwidth.
the network and then streamed 600 kbits/sec (75 kB/sec) We think that the improved behavior that Chainsaw ex-
over it for 360 sec. Half way into the streaming, at the hibits is primarily due to its design assumption that in the
180 second mark, we killed off half the nodes to simulate common case most of a peer’s neighbors will eventually
catastrophic failure. receive most packets. When combined with the direct ex-
Figure 8 shows the average download rate achieved by change of ”have” information, Chainsaw is able to locate
the non-failing nodes before and after the event. Initially and request packets that it does not yet have within a few
both Chainsaw and Bullet achieved the target bandwidth RTTs, whereas Bullet’s propagation of such information
of 75 kB/sec. However, after the nodes failed, Bullet’s is divided into epochs spanning multiple seconds and is
bandwidth dropped by 30% to 53 kB/sec and it took 14 dependent on few assumptions to the RanSub tree remain-
seconds to recover, while Chainsaw continued to deliver ing relatively intact. As a result Chainsaw has near-zero
data at 75 kB/sec with no interruption. SplitStream deliv- packet loss, minimal duplicates and low delay.
ered 65 kB/sec initially, but the bandwidth dropped to 13
kB/sec after the failure event. 5 Future Work
In SplitStream, every node is an interior node in one of In our experiments we have used symmetric links so that
the trees, so its possible for a node with insufficient upload aggregate upload bandwidth was sufficient for every node
bandwidth to become a bottleneck. When a large num- to receive the broadcast at the streaming rate. If large num-
ber of nodes fail, every tree is is likely to lose a number bers of nodes have upload capacities less than the stream-
of interior nodes, resulting in a severe reduction in band- ing rate, as might be the case with ADSL or cable modem
width. Macedon is still a work in progress and its au- users, users might experience packet loss. Further work
thors have not fully implemented SplitStream’s recovery is needed to allocate bandwidth when insufficient capac-
5
ity exists. Also we have not demonstrated experimentally Acknowledgements
that Chainsaw performs well under high rates of churn, We would like to thank Dejan Kostić and Charles Killian
although we expect that with its pure mesh architecture, for helping us out with MACEDON.
churn will not be a significant problem.
References
5.1 Incentives [1] E. Adar and B. A. Huberman. Free Riding on Gnutella.
So far, we have assumed that nodes are cooperative, in that First Monday, 5(10), Oct 2000.
they willingly satisfy their neighbor’s requests. However, [2] K. P. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu,
studies [1, 15] have shown that large fractions of nodes and Y. Minsky. Bimodal multicast. ACM Trans. Comput.
in peer-to-peer networks can be leeches, i.e. they try to Syst., 1999.
benefit from the system without contributing. Chainsaw is [3] M. Castro, P. Druschel, A. Kermarrec, A. Nandi, A. Row-
very similar in design to our unstructured file-transfer sys- stron, and A. Singh. Splitstream: High-Bandwidth Multi-
tem SWIFT [16]. Therefore, we believe that we can adapt cast in Cooperative Environments. In SOSP, 2003.
SWIFT’s pairwise currency system to ensure that nodes [4] M. Castro, P. Druschel, A. Kermarrec, and A. Rowstron.
that do not contribute are the ones penalized when the to- SCRIBE: A large-scale and decentralized application-
tal demand for bandwidth exceeds the total supply. level multicast infrastructure. IEEE JSAC, 2002.
5.2 Packet Picking Strategy [5] Y. Chu, S. G. Rao, and H. Zhang. A case for end sys-
tem multicast. In Measurement and Modeling of Computer
Currently, nodes use a purely random strategy to decide Systems, 2000.
what packets to request from their neighbors. We find that [6] B. Chun, D. Culler, T. Roscoe, A. Bavier, L. Peterson,
this strategy works well in general, but there are patholog- M. Wawrzoniak, and M. Bowman. Planetlab: an over-
ical cases where problems occur. For example, a node will lay testbed for broad-coverage services. SIGCOMM Com-
give the same importance to a packet that is in danger of puter Communication Review, 2003.
being delayed beyond the deadline as one that has just en-
[7] B. Cohen. BitTorrent, 2001.
tered its window of interest. As a result it may pick the
http://www.bitconjurer.org/BitTorrent/.
new packet instead of the old one, resulting in packet loss.
We may be able to eliminate these pathological cases [8] P. Eugster, R. Guerraoui, S. B. Handurukande,
and improve system performance by picking packets to re- P. Kouznetsov, and A. Kermarrec. Lightweight proba-
quest more intelligently. Possibilities include taking into bilistic broadcast. ACM Trans. Comput. Syst., 2003.
account the rarity of a packet in the system, the age of the [9] J. Jannotti, D. K. Gifford, K. L. Johnson, M. Frans
packet, and its importance to the application. Some ap- Kaashoek, and J. O’Toole, Jr. Overcast: Reliable multi-
plications may assign greater importance to some parts of casting with an overlay network. In OSDI, 2000.
the stream than others. For example, lost metadata pack- [10] D. Kostić, A. Rodriguez, J. Albrecht, A. Bhirud, and
ets may be far more difficult to recover from than lost data A. Vahdat. Using random subsets to build scalable net-
packets. work services. In USENIX USITS, 2003.
[11] D. Kostić, A. Rodriguez, J. Albrecht, and A. Vahdat. Bul-
6 Conclusion let: high bandwidth data dissemination using an overlay
We built a pull-based peer-to-peer streaming network on mesh. In SOSP, 2003.
top of an unstructured topology. Through simulations, we [12] S. Ratnasamy, M. Handley, R. M. Karp, and S. Shenker.
demonstrated that our system was capable of disseminat- Application-level multicast using content-addressable net-
ing data at a high rate to a large number of peers with no works. In Workshop on Networked Group Communica-
packet loss and extremely low duplicate data rates. We tion, 2001.
also showed that a new node could start downloading and [13] A. Rodriguez, C. Killian, S. Bhat, D. Kostić, and A. Vaha-
begin play back within a fraction of a second after joining dat. Macedon: Methodology for Automtically Creating,
the network, making it highly suitable to applications like Evaluating, and Designing Overlay Networks. In NSDI,
on-demand media streaming. Finally, we showed that our 2004.
system is robust to catastrophic failure. A vast majority of [14] A. Rowstron and P. Druschel. Pastry: Scalable, decentral-
the nodes were able to download data with no packet loss ized object location, and routing for large-scale peer-to-
even when half the nodes in the system failed simultane- peer systems. In IFIP/ACM International Conference on
ously. Distributed Systems Platforms, 2001.
So far we have only investigated behavior in a cooper- [15] S. Saroiu, P. K. Gummadi, and S. D. Gribble. A measure-
ative environment. However, Chainsaw is very similar in ment study of peer-to-peer file sharing systems. Proceed-
its design to the SWIFT [16] incentive-based file-trading ings of Multimedia Computing and Networking, 2002.
network. Therefore, we believe that we will be able to [16] K. Tamilmani, V. Pai, and A. E. Mohr. SWIFT: A system
adapt SWIFT’s economic incentive model to streaming, with incentives for trading. In Second Workshop on the
allowing our system to work well in non-cooperative envi- Economics of Peer-to-Peer Systems, 2004.
ronments.