Professional Documents
Culture Documents
Mohammad Alizadehzy, Albert Greenbergy, David A. Maltzy, Jitendra Padhyey, Parveen Pately, Balaji Prabhakarz, Sudipta Senguptay,
Murari Sridharany
ACM SIGCOMM 2010 conference
We focus on
Soft real-time applications
Supporting
Web search
Retail
Advertising
Recommendation systems
Reducing network latency allows application developers to invest more cycles in the
algorithms that improve relevance and end user experience.
For example: in DC networks: round trip times (RTTs) can be less than 250s, in absence of queuing.
Partition/Aggregate
TLA
Picasso
Art is
1.
Deadline
2. Art is=a250ms
lie
..
3.
Picasso
Time is money
MLA MLA
1.
Strict deadlines (SLAs)
Deadline = 50ms
2.
2. The chief
3.
..
3.
..
Missed deadline
1. Art is a lie
Iterative Queries
ItI'd
isArt
Computers
Inspiration
your
chief
like
Bad
isto
you
awork
enemy
lie
live
artists
can
that
in
as
are
does
of
imagine
life
amakes
copy.
useless.
creativity
poor
that
exist,
man
us
is the
real.
is
DeadlineEverything
=The
10ms
They
but can
itultimate
with
Good
must
realize
only
good
lots
artists
find
give
seduction.
the
sense.
ofyou
money.
you
truth.
steal.
working.
answers.
Worker Nodes
Data Collection
The measurements reveal that 99.91% of trafc in our data center is TCP
trafc.
Our key learning from these measurements is that to meet the requirements of
such a diverse mix of short and long ows, switch buffer occupancies need to
be persistently low, while maintaining high throughput for the long ows.
10
Workloads
Partition/Aggregate
(Query) (2KB to 20KB in size)
Delay-sensitive
Delay-sensitive
Throughput-sensitive
11
Switches
Like most commodity switches in clusters are shared memory switches
that aim to exploit statistical multiplexing gain through use of logically
common packet buffers available to all switch ports.
Packets arriving on an interface are stored into a high speed multi-ported
memory shared by all the interfaces.
Memory from the shared pool is dynamically allocated to a packet by a
MMU (attempts to give each interface as much memory as it needs while
preventing unfairness by dynamically adjusting the maximum amount of
memory any one interface can take).
Building large multi-ported memories is very expensive, so most cheap
switches are shallow buffered, with packet buffer being the scarcest
resource.
12
Impairments
Switches with
shared memory
result in 3
impairments
Incast
Impairments
Queue
Buildup
Buffer Pressure
13
Incast
Worker 1
Worker 2
Worker 3
RTOmin = 300 ms
Worker 4
TCP timeout
If many flows converge on the same interface of a switch over a short period of
time, the packets may exhaust either the switch memory or the maximum
permitted buffer for that interface, resulting in packet losses.
14
Queue Buildup
Sender 1
Receiver
Sender 2
15
Buffer Pressure
16
2. Low Latency
Short flows, queries
3. High Throughput
Large file transfers
Low Latency
Deep Buffers:
Queuing Delays
Increase Latency
Shallow Buffers:
Bad for Bursts &
Throughput
Reduced RTOmin
(SIGCOMM 09)
Doesnt Help Latency
AQM RED:
Avg Queue Not Fast
Enough for Incast
Objective:
Low Queue Occupancy & High
Throughput
DCTCP
18
19
Some drawbacks
Lock-out: several flows occupy queue exclusively, prevent the packets from others
flows entering queue
Full queues: send congestion signals only when the queues are full, so the queue is
full state in quite a long period
20
21
22
IP
TCP
TCP
ECT
CE
ECT
CE
CWR
CWR
1
CWR
Host
3
4
Switch
ACK
1
ECN-Echo
Host
23
Receiver
Sender 2
24
TCP
DCTCP
1011110111
0000000001
Cut window by 5%
27
Mark K Dont
Mark
Sender side:
Maintain an estimate of fraction of packets marked ().
In each RTT:
where F is the fraction of packets that were marked in the last window of data
0 < g < 1 is the weight given to new samples against the past in the estimation of
(Kbytes)
DCTCP in Action
29
Why it Works
1. High Burst Tolerance
Large buffer bursts fit.
Aggressive marking sources react before packets are dropped.
2. Low Latency
Small buffer occupancies low queuing delay
3. High Throughput
ECN averaging smooth rate adjustments, cwind low variance.
31
Analysis
We want to analyze the behavior of DCTCP for N infinitely long-lived flows with identical round trip
times (RTT), sharing a single bottleneck link of capacity C.
We assume flows are synchronized.
Window Size
W*+1
W*
(W*+1)(1-/2)
Time
33
Analysis
The queue size at time t is given by
Q(t) = NW(t)-CRTT
where W(t) is the window size of a single source
(3)
30
Analysis
Let W* = (C RTT +K)/N
This is the critical window size at which the queue size reaches K, and the switch starts
marking packets with the CE codepoint. During the RTT it takes for the sender to react to
these marks, its window size increases by one more packet, reaching W* + 1.
Hence
. We can now
31
Analysis
Note that the amplitude of oscillation in window size of a single ow, D, is
given by:
32
Analysis
How do we set the DCTCP parameters?
Marking Threshold(K). The minimum value of the queue occupancy in the
sawtooth is given by:
Choose K so that this minimum is larger than zero, i.e. the queue
underow. This results in:
does not
37
Conclusions
DCTCP satisfies all our requirements for Data Center packet transport.
Handles bursts well
Keeps queuing delays low
Achieves high throughput
Features:
Very simple change to TCP and a single switch parameter K.
Based on ECN mechanisms already available in commodity switch.
46