Transport Layer
- Congestion and TCP -
Kyunghan Lee
Networked Computing Lab (NXC Lab)
Department of Electrical and Computer Engineering Seoul National University
https://nxc.snu.ac.kr
kyunghanlee@snu.ac.kr
Principles of Congestion Control
Congestion:
□ informally: “too many sources sending too much data too fast for network to handle”
□ different from flow control!
□ manifestations:
§ lost packets (buffer overflow at routers)
§ long delays (queueing in router buffers)
□ One of the most important problems!
Causes/costs of congestion: scenario 1
§ two senders, two receivers
§ one router, infinite buffers
§ output link capacity: R
§ no retransmission
maximum per-connection throughput: R/2
unlimited shared output link buffers
Host A
original data: 𝜆 !"
Host B
throughput: 𝜆 #$%
R/2
R/2 R/2
del ay
large delays as arrival rate, 𝜆 !" , approaches capacity
𝜆 !"
throughput: 𝜆 #$%
𝜆 !"
R
Causes/costs of congestion: scenario 2
§ one router, finite buffers
§ sender retransmission of timed-out packet
• application-layer input = application-layer output: 𝜆 !" = 𝜆 #$%
• transport-layer input includes retransmissions : 𝜆 !" & ≥ 𝜆 #$%
finite shared output link buffers Host A
𝜆 !" ∶ original data
Host B
𝜆 !" & : original data, plus
retransmitted data
𝜆 #$%
Causes/costs of congestion: scenario 2
finite shared output link buffers Host A
𝜆 !" ∶ original data
Host B
𝜆 !" & : original data, plus
retransmitted data
𝜆 #$%
Idealization: perfect knowledge
§ sender sends only when router buffers available
free buffer space!
R/2
𝜆 !" R/2
throughput: 𝜆 #$%
Idealization: known loss
§ packets can be lost, dropped at router due to full buffers
§ sender only resends if packet known to be lost
Causes/costs of congestion: scenario 2
Host A
𝜆 !" ∶ original data
Host B
𝜆 !" # : original data, plus
retransmitted data
𝜆 $%&
no buffer space!
R/2
R/2
when sending at R/2, some packets are retransmissions.
𝜆 !" #
𝜆 $%&
Reality: duplicates
§ packets can be lost, dropped at router due to full buffers
§ sender times out prematurely, sending two copies, both of which are delivered frequently
Causes/costs of congestion: scenario 2
R/2
R/2
when sending at R/2, some packets are retransmission
including duplicated that are delivered already!
𝜆 !" #
𝜆 $%&
“ costs” of congestion:
§ more work (retransmission) for given “goodput”
§ unneeded retransmissions: link carries multiple copies of pkt
• decreasing goodput
Causes/costs of congestion: scenario 3
§ four senders (with timeout/retransmit)
§ multihop paths
C/2
𝜆 !" # C/2
𝜆 $%&
another “cost” of congestion:
§ when packet dropped, any transmission capacity over multiple links used for that packet is wasted!
𝜆
!"𝜆
#!"𝜆
$%&Congestion Collapse
□ Congestion control
§ Regulate the packet injection rate at the source
§ Without flow control & congestion control, the network can run into “congestion collapse”è no useful throughput
Offered Load Time Delay
Onset of Congestion
Offered Load
Throughput Congestion
Deadlock Mode (congestion collapse)
TCP History
1975 1980 1985 1990
1981 TCP & IP RFC 793 & 791
1983 BSD Unix 4.2 supports TCP/IP
1984
Nagle’s algorithm to reduce overhead of small packets;
predicts congestion collapse
1987
Karn’s algorithm to better estimate
round-trip time
1986 Congestion
collapse 1
stobserved
1990 4.3BSD Reno
fast recovery delayed ACK’s 1975
Three-way handshake Ray Tomlinson
In SIGCOMM 75 1988
Van Jacobson’s algorithms
slow start,
congestion avoidance, fast retransmit (all implemented in
4.3BSD Tahoe) SIGCOMM 88 1974
TCP described by
Vint Cerf, Bob Kahn
in IEEE Trans Comm
TCP History
1993 1994 1996
1994 ECN Explicit Congestion Notification
(Floyd) 1993
TCP Vegas real congestion
avoidance (Brakmo et. al.)
1994 T/TCP
Transaction TCP (Braden)
1996 New Reno modified fast
recovery SACK TCP Selective Ack (Floyd et. al.) 1996
Improving TCP startup
(Hoe)
1996 FACK TCP Forward Ack extension to SACK
(Mathis et. al.)
Drawing source:
Prof. Paul D. Amer
2008
2008
TCP CUBIC
Linux Default TCP
(Ha et. al.)
TCP History
□ TCP Tahoe
§ In TCP Tahoe, when a loss occurs, fast retransmit is sent, half of the current
CWND is saved as a Slow Start Threshold (SSThresh) and slow start begins again from its initial CWND. Once the CWND reaches the SSThresh, TCP changes to congestion avoidance algorithm
□ TCP Reno
§ TCP Reno implements an algorithm called fast recovery. A fast retransmit is sent, half of the current CWND is saved as SSThresh and as new CWND, thus
skipping slow start and going directly to Congestion Avoidance algorithm
□ TCP New Reno
§ TCP New Reno, defined by RFC 6582, improves retransmission during the fast- recovery phase of TCP Reno. During fast recovery, for every duplicate ACK that is returned to TCP New Reno, a new unsent packet from the end of the
congestion window is sent, to keep the transmit window full. For every ACK that
makes partial progress in the sequence space, the sender assumes that the ACK
points to a new hole, and the next packet beyond the ACKed sequence number
is sent.
TCP congestion control:
Additive Increase Multiplicative Decrease (AIMD)
§ approach: sender increases transmission rate (window size), probing for usable bandwidth, until loss occurs
• additive increase: increase cwnd by 1 MSS every RTT until loss detected
• multiplicative decrease: cut cwnd in half after loss
cw nd :TC P send er co ng es tio n w in do w s ize
AIMD saw tooth behavior: probing for bandwidth
additively increase window size …
…. until loss occurs (then cut window in half)
time
TCP congestion control
§ sender limits transmission:
§ cwnd is dynamic, function of perceived network congestion
TCP sending rate:
§ roughly: send cwnd bytes, wait RTT for ACKS, then send more bytes
last byte
ACKed sent, not- yet ACKed (“in-flight”)
last byte sent cwnd
LastByteSent-LastByteAcked ≤ cwnd
sender sequence number space
rate ~~ cwnd
RTT bytes/sec
Congestion control in TCP
□ CWND and Ack
§ “ACK clocking”: the sender transmits a data packet
upon an ACK reception
□ Control the size of the sliding window
§ Controlling the window size to match with the receiver buffer (à flow control)
§ Controlling the window size to determine the TX rate (à congestion control)
Ref: http://www.soi.wide.ad.jp/class/20020032/
TCP slow start
§ when connection begins, increase rate exponentially until first loss event:
• initially cwnd = 1 MSS
• double cwnd every RTT
• done by incrementing cwnd for every ACK received
Host A
one segment
RT T
Host B
time
two segments
four segments
initialize: cwnd = 1
for (each segment ACKed) cwnd ++
until (loss event OR
cwnd > ssthresh)
Slow Start algorithm
TCP: detecting, reacting to loss
§ loss indicated by timeout:
• cwnd set to 1 MSS;
• window then grows exponentially (as in slow start) to threshold, then grows linearly
§ loss indicated by 3 duplicate ACKs: TCP RENO
• dup ACKs indicate network capable of delivering some segments
• cwnd is cut in half window then grows linearly
§ TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate
acks)
TCP congestion avoidance (CA)
/* slow start is over */
if (cwnd > ssthresh) {
for (each segment ACKed) cwnd += 1/cwnd
until (loss event) } if (loss event) {
ssthresh = cwnd / 2 cwnd = 1
Perform slow start }
Congestion Avoidance (TCP-Tahoe)
□ On each new ack, increase cwnd by 1/cwnd packets
□ The window size grows “ linearly ” with time during
congestion avoidance: 1 MSS per RTT
Fast Retransmit
□ Fast Retransmit occurs when multiple (≥ 3) dupacks come back
§ Usually, Fast Recovery follows Fast Retransmit (explained later)
§ In contrast, on a timeout, Slow Start follows
□ Fast Retransmit occurs
§ When a packet is lost but subsequent packets get through
• In contrast, a timeout occurs, when no more packets are getting across
§ “ACK clocking” may be still there, if only one packet is lost
§ We may NOT need Slow Start after Fast Retransmit
è Fast Recovery.
Fast Recovery
□ Algorithm: after Fast Retransmit, do…
§ ssthresh = min( cwnd , receiver’s advertised window)/2
§ Retransmit the missing segment (Fast Retransmit)
§ cwnd = ssthresh + [number of received dupacks]
§ Transmit new packets if allowed by the new value of cwnd
§ Fast recovery lasts until a non-duplicate ACK is received
§ When a new ack comes: cwnd = ssthresh , and enter congestion avoidance
0 2 4 6 8 10 12
0 2 4 6 8 10 12 14
Time (round trips)
After Fast Retransmit and Fast Recovery, the window size is reduced in half.
no slow start
Fast recovery
Wi nd ow s ize
Fast Recovery - Example
8
15 15-28
ack 29 window
3 dupack-14
Actions after dupacks for lost pkt 14:
1. On 3rd dupack 14, enter fast recovery 2. Set ssthresh = cwnd = 15/2 = 7
3. Retransmit 14
4. Receipt of 4th dupack sets W=11 5. By 14th dupack, W = 21, send 29,…
6. After ack 29, exit fast recovery 7. Set cwnd = 7
7. Continue with Congestion Avoidance
4th dupack-14
14 7-14
21 7
14th dupack -14
29-34 11
7
...
ack 8-14
seqnum
A Trace of TCP Sequence Number over Time
0 2 4 6 8 10 12 14
0 1 2 3 4 5 6 7 8
Congestion W indow size
Time (round trips)
Slow start
Congestion avoidance
SS-threshold
pkt tx
ack reception
Duplicated
acks
TCP Congestion Control: FSM
timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0
retransmit missing segment Λ
cwnd ≥ ssthresh
congestion avoidance
cwnd = cwnd + MSS/cwnd dupACKcount = 0
transmit new segment(s), as allowed new ACK
dupACKcount++
duplicate ACK
fast recovery
cwnd = cwnd + MSS
transmit new segment(s), as allowed duplicate ACK
ssthresh= cwnd/2 cwnd = ssthresh + 3 retransmit missing segment dupACKcount == 3
timeout
ssthresh = cwnd/2 cwnd = 1
dupACKcount = 0
retransmit missing segment
ssthresh= cwnd/2 cwnd = ssthresh + 3 retransmit missing segment dupACKcount == 3
cwnd = ssthresh dupACKcount = 0 new ACK
slow start
timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment
cwnd = cwnd+MSS dupACKcount = 0
transmit new segment(s), as allowed new ACK
dupACKcount++
duplicate ACK
Λ cwnd = 1 MSS ssthresh = 64 KB dupACKcount = 0