DCTCP PDF

Advanced Communication Networks
Data Center TCP (DCTCP)

Spring 2016
Mahshid R. Naeini, PhD
Data Center TCP-Reference
Data Center TCP (DCTCP)
Mohammad Alizadehzy, Albert Greenbergy, David A. Maltzy, Jitendra Padhyey, Parveen Pately, Balaji Prabhakarz, Sudipta Senguptay,
Murari Sridharany
ACM SIGCOMM 2010 conference
Some slides borrowed from Liyong at SYSU
We focus on
Soft real-time applications
Supporting
Web search
Retail
Advertising
Recommendation systems
These applications generate a diverse mix of short and long flows

Require three things from the data center network
Low latency for short flows
High burst tolerance
High utilization for long flows
Reducing network latency allows application developers to invest more cycles in the
algorithms that improve relevance and end user experience.
Foreground and Background Traffic

Foreground traffic is typically shorter data flows with low latency
tolerance. As used herein, the term foreground traffic may include any
traffic that is directly connected to the user experience, or any other
traffic designated as having a higher priority than background traffic.
Background traffic is often contained within the data center or between
data centers. Background traffic performs background tasks such as
updating databases or applications, backing up data, or doing other
administrative functions that are usually not time sensitive and that do not
impact or interact with users directly. Background traffic usually has
minimal latency requirements and, further, tends to include long
streams of data that can take a significant of amount of time to transmit.
Some Properties of DC Networks
Delays in Wide Area Networks are very different than DC networks.
For example: in DC networks: round trip times (RTTs) can be less than 250s, in absence of queuing.
Applications simultaneously need extremely high bandwidths and very low

latencies.
Data center environment offers certain luxuries:
The network is largely homogeneous and under a single administrative control.
Backward compatibility, incremental deployment and fairness to legacy
protocols are not major concerns.
Connectivity to the external Internet is typically managed through load
balancers and application proxies that effectively separate internal traffic from
external, so issues of fairness with conventional TCP are irrelevant.
TCP in the Data Center

Well see TCP does not meet demands of apps.
Incast
Suffers from bursty packet drops
Not fast enough utilize spare bandwidth
Builds up large queues:

Adds significant latency.
Wastes precious buffers, esp. bad with shallow-buffered switches.
Operators work around TCP problems.

Ad-hoc, inefficient, often expensive solutions
Solution: Data Center TCP
Mechanisms for Detecting Congestion

(i) Delay-based protocols use increases in RTT measurements as a sign of
growing queueing delay, and hence of congestion.
These protocols rely heavily on accurate RTT measurement, which is
susceptible to noise in the very low latency environment of data centers. Small
noisy fluctuations of latency become indistinguishable from congestion and the
algorithm can over-react.
(ii) Active Queue Management (AQM) approaches use explicit feedback

from congested switches.
DCTCP
Partition/Aggregate
TLA
Picasso
Art is
1.
Deadline
2. Art is=a250ms
lie
..
3.
Picasso
Time is money
MLA MLA
1.
Strict deadlines (SLAs)
Deadline = 50ms
2.
2. The chief
3.
..
3.
..
Missed deadline
1. Art is a lie
Lower quality result
Iterative Queries
ItI'd
isArt
Computers
Inspiration
your
chief
like
Bad
isto
you
awork
enemy
lie
live
artists
can
that
in
as
are
does
of
imagine
life
amakes
copy.
useless.
creativity
poor
that
exist,
man
us
is the
real.
is
DeadlineEverything
=The
10ms
They
but can
itultimate
with
Good
must
realize
only
good
lots
artists
find
give
seduction.
the
sense.
ofyou
money.
you
truth.
steal.
working.
answers.
Worker Nodes
Partition/Aggregate Application Structure
Common application structureL foundation of many large scale web applications

Web search, social network content composition, and advertisement selection are
all based around this application design pattern.
The synchronized and bursty traffic patterns that result from these application
structure, and identify three performance impairments these patterns cause.
9
Data Collection
Over 6000 servers in over 150 racks

Each rack in the clusters holds 44 servers
Each server connects to a Top of Rack switch (ToR) via 1Gbps Ethernet
The ToRs are shallow buffered, shared-memory switches; each with 4MB of
buffer shared among 48 1Gbps ports and two 10Gbps ports.
Latency Information Collected by
Passively collects socket level logs
Selected packet-level logs
App-level logs
>150TB of compressed data, collected over the course of a month from
The measurements reveal that 99.91% of trafc in our data center is TCP
trafc.
Our key learning from these measurements is that to meet the requirements of
such a diverse mix of short and long ows, switch buffer occupancies need to
be persistently low, while maintaining high throughput for the long ows.
10
Workloads
Partition/Aggregate
(Query) (2KB to 20KB in size)
Delay-sensitive
Short messages [100KB-1MB]

(Coordination, Control state)
Delay-sensitive
Large flows [1MB-100MB]

(Data update)
Throughput-sensitive
11
Switches
Like most commodity switches in clusters are shared memory switches
that aim to exploit statistical multiplexing gain through use of logically
common packet buffers available to all switch ports.
Packets arriving on an interface are stored into a high speed multi-ported
memory shared by all the interfaces.
Memory from the shared pool is dynamically allocated to a packet by a
MMU (attempts to give each interface as much memory as it needs while
preventing unfairness by dynamically adjusting the maximum amount of
memory any one interface can take).
Building large multi-ported memories is very expensive, so most cheap
switches are shallow buffered, with packet buffer being the scarcest
resource.
12
Impairments
Switches with
shared memory
result in 3
impairments
Incast
Impairments
Queue
Buildup
Buffer Pressure
13
Incast
Worker 1
Synchronized mice collide.

Caused by Partition/Aggregate.
Aggregator
Worker 2
Worker 3
RTOmin = 300 ms
Worker 4
TCP timeout
If many flows converge on the same interface of a switch over a short period of
time, the packets may exhaust either the switch memory or the maximum
permitted buffer for that interface, resulting in packet losses.
14
Queue Buildup
Sender 1
Big flows buildup queues.

Increased latency for short flows.
When long and short flows traverse the

same queue two impairments occur:
First, packet loss on the short flows
can cause incast problems.
Second, there is a queue buildup
impairment: even when no packets
are lost, the short flows experience
increased latency.
Receiver
Sender 2
15
Buffer Pressure
Since buffer space is a shared

resource, the queue build up reduces
the amount of buffer space available
to absorb bursts of traffic from
Partition/Aggregate traffic
The loss rate of short ows in this
trafc pattern depends on the number
of long ows traversing other ports
The bad result is packet loss and
timeouts, as in incast, but without
requiring synchronized ows.
16
Data Center Transport Requirements

1. High Burst Tolerance
Incast due to Partition/Aggregate is common.
2. Low Latency
Short flows, queries
3. High Throughput
Large file transfers
The challenge is to achieve these three together.

17
Balance Between Requirements

High Throughput
High Burst Tolerance
Low Latency
Deep Buffers:
Queuing Delays
Increase Latency
Shallow Buffers:
Bad for Bursts &
Throughput
Reduced RTOmin
(SIGCOMM 09)
Doesnt Help Latency
AQM RED:
Avg Queue Not Fast
Enough for Incast
Objective:
Low Queue Occupancy & High
Throughput
DCTCP
18
Review TCP Congestion Control

Four Stage:
Slow Start
Congestion Avoidance
Quickly Retransmission
Quickly Recovery
Router must maintain one or more queues on port, so it is

important to control queue
Two queue control algorithm
Queue Management Algorithm: manage the queue length through dropping
packets when necessary
Queue Scheduling Algorithm: determine the next packet to send
19
Queue management algorithm

Passive management Algorithm: dropping packets after queue is full.
Traditional Method
Drop-tail
Random drop
Drop front
Some drawbacks
Lock-out: several flows occupy queue exclusively, prevent the packets from others
flows entering queue
Full queues: send congestion signals only when the queues are full, so the queue is
full state in quite a long period
Solution: Active Management Algorithm (AQM)
20
AQM (dropping packets before queue is full)

RED (Random Early Detection)[RFC2309]
Calculate the average queue length(aveQ): Estimate the degree of congestion
Calculate probability of dropping packets (P): according to the degree of congestion.
(two threshold: minth, maxth)
abeQ<minth: dont drop packets
Minth<abeQ<maxth: drop packets in P
abeQ>maxth: drop all packets
Drawback: drop packets sometimes when queue isnt full
ECN (Explicit Congestion Notification)[RFC3168]

A method to use multibit feed-back notifying congestion instead of
dropping packets
21
Explicit Congestion Notification (ECN)

Routers or Switches must support it.(ECN-capable)
Set two bits by the ECN field in the IP packet header
ECT (ECN-Capable Transport): set by sender, to display the senders transmission
protocol whether support ECN or not.
CE (Congestion Experienced): set by routers or switches, to display whether
congestion occur or not.
Set two bits field in TCP header

ECN-Echo: receiver notify sender that it has received CE packet
CWR (Congestion Window Red-UCed): sender notify receiver that it has decreased
the congestion window
22
ECN working principle
IP
TCP
TCP
ECT
CE
ECT
CE
CWR
CWR
1
CWR
Host
3
4
Switch
ACK
1
ECN-Echo
Host
23
Review: The TCP/ECN Control Loop

Sender 1
ECN = Explicit Congestion Notification
ECN Mark (1 bit)
Receiver
Sender 2
24
Two Key Ideas

1. React in proportion to the extent of congestion, not its presence.
ECN Marks
TCP
DCTCP
1011110111
Cut window by 50%
Cut window by 40%
0000000001
Cut window by 50%
Cut window by 5%
2. Mark based on instantaneous queue length.

Fast feedback to better deal with bursts.
27
Data Center TCP Algorithm

Switch side:
Mark packets when Queue Length > K.
Mark K Dont
Mark
Sender side:
Maintain an estimate of fraction of packets marked ().
In each RTT:
where F is the fraction of packets that were marked in the last window of data
0 < g < 1 is the weight given to new samples against the past in the estimation of
Adaptive window decreases:

28
(Kbytes)
DCTCP in Action
Setup: Win 7, Broadcom 1Gbps Switch

Scenario: 2 long-lived flows, K = 30KB
29
Why it Works
1. High Burst Tolerance
Large buffer bursts fit.
Aggressive marking sources react before packets are dropped.
2. Low Latency
Small buffer occupancies low queuing delay
3. High Throughput
ECN averaging smooth rate adjustments, cwind low variance.
31
Analysis
We want to analyze the behavior of DCTCP for N infinitely long-lived flows with identical round trip
times (RTT), sharing a single bottleneck link of capacity C.
We assume flows are synchronized.
Window Size
Packets sent in this

RTT are marked.
W*+1
W*
(W*+1)(1-/2)
Time
We are interested in computing the following quantities:

The maximum queue size (Qmax)
The amplitude of queue oscillations (A)
The period of oscillations (TC)
33
Analysis
The queue size at time t is given by
Q(t) = NW(t)-CRTT
where W(t) is the window size of a single source
(3)
The fraction of marked packets

S(W1,W2) denote the number of packets sent by the sender, while its
window size increases from W1 to W2 > W1.
Since this takes W2-W1 round-trip times, during which the average
window size is (W1 +W2)/2
30
Analysis
Let W* = (C RTT +K)/N
This is the critical window size at which the queue size reaches K, and the switch starts
marking packets with the CE codepoint. During the RTT it takes for the sender to react to
these marks, its window size increases by one more packet, reaching W* + 1.
Hence
Plugging (4) into (5) and rearranging, we get:
Assuming is small, this can be simplied as

compute A , TC and in Qmax.
. We can now
31
Analysis
Note that the amplitude of oscillation in window size of a single ow, D, is
given by:
Since there are N ows in total
Finally, using (3), we have:
32
Analysis
How do we set the DCTCP parameters?
Marking Threshold(K). The minimum value of the queue occupancy in the
sawtooth is given by:
Choose K so that this minimum is larger than zero, i.e. the queue
underow. This results in:
does not
37
Conclusions
DCTCP satisfies all our requirements for Data Center packet transport.
Handles bursts well
Keeps queuing delays low
Achieves high throughput
Features:
Very simple change to TCP and a single switch parameter K.
Based on ECN mechanisms already available in commodity switch.
46
More readings on Data Center Networking

On the Feasibility of Completely Wireless Datacenters
J. Y. Shin, E. G. Sirer, H. Weatherspoon, and D. Kirovski, IEEE/ACM Transactions
on Networking (ToN), Volume 21, Issue 5 (October 2013), pages 1666-1680.
Analysis and Network Traffic Characteristics of Data Centers in the wild

T. Benson, A. Akella, and D. A. Maltz. In Proceedings of the 10th ACM
SIGCOMM conference on Internet measurement (IMC), pp. 267-280. ACM,
2010.
SoNIC: Precise Realtime Software Access and Control of Wired Networks
Ki Suh Lee , Han Wang , Hakim Weatherspoon
USENIX symposium , 2013
Energy-aware routing in data center network
Yunfei Shang, Dan Li, Mingwei Xu
ACM SIGCOMM workshop on Green networking, 2010
Data Center Network Virtualization: A Survey
Md. Faizul Bari, Raouf Boutaba, Rafael Esteves, Lisandro Zambenedetti Granville, Maxim Podlesny, Md Golam
Rabbani, Qi Zhang, and Mohamed Faten Zhani
IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 15, NO. 2, SECOND QUARTER 2013
35

DCTCP PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DCTCP PDF

Uploaded by

Copyright:

Available Formats

Advanced Communication Networks

Data Center TCP (DCTCP)

Data Center TCP-Reference

Data Center TCP (DCTCP)

Some slides borrowed from Liyong at SYSU

These applications generate a diverse mix of short and long flows

Foreground and Background Traffic

Some Properties of DC Networks

Delays in Wide Area Networks are very different than DC networks.

Applications simultaneously need extremely high bandwidths and very low

TCP in the Data Center

Builds up large queues:

Operators work around TCP problems.

Solution: Data Center TCP

Mechanisms for Detecting Congestion

(ii) Active Queue Management (AQM) approaches use explicit feedback

Lower quality result

Partition/Aggregate Application Structure

Common application structureL foundation of many large scale web applications

Over 6000 servers in over 150 racks

>150TB of compressed data, collected over the course of a month from

Short messages [100KB-1MB]

Large flows [1MB-100MB]

Synchronized mice collide.

Big flows buildup queues.

When long and short flows traverse the

Since buffer space is a shared

Data Center Transport Requirements

The challenge is to achieve these three together.

Balance Between Requirements

Review TCP Congestion Control

Router must maintain one or more queues on port, so it is

Queue management algorithm

Solution: Active Management Algorithm (AQM)

AQM (dropping packets before queue is full)

ECN (Explicit Congestion Notification)[RFC3168]

Explicit Congestion Notification (ECN)

Set two bits field in TCP header

ECN working principle

Review: The TCP/ECN Control Loop

ECN = Explicit Congestion Notification

ECN Mark (1 bit)

Two Key Ideas

Cut window by 50%

Cut window by 40%

Cut window by 50%

2. Mark based on instantaneous queue length.

Data Center TCP Algorithm

Mark packets when Queue Length > K.

Adaptive window decreases:

Setup: Win 7, Broadcom 1Gbps Switch

Packets sent in this

We are interested in computing the following quantities:

The fraction of marked packets

Plugging (4) into (5) and rearranging, we get:

Assuming is small, this can be simplied as

Since there are N ows in total

Finally, using (3), we have:

More readings on Data Center Networking

Analysis and Network Traffic Characteristics of Data Centers in the wild

You might also like