Transport Layer

(1)

Transport Layer

- Congestion and TCP -

Kyunghan Lee

Networked Computing Lab (NXC Lab)

Department of Electrical and Computer Engineering Seoul National University

https://nxc.snu.ac.kr

kyunghanlee@snu.ac.kr

(2)

Principles of Congestion Control

Congestion:

□ informally: “too many sources sending too much data too fast for network to handle”

□ different from flow control!

□ manifestations:

§ lost packets (buffer overflow at routers)

§ long delays (queueing in router buffers)

□ One of the most important problems!

(3)

Causes/costs of congestion: scenario 1

§ two senders, two receivers

§ one router, infinite buffers

§ output link capacity: R

§ no retransmission

maximum per-connection throughput: R/2

unlimited shared output link buffers

Host A

original data: 𝜆 !"

Host B

throughput: 𝜆 #$%

R/2

R/2 R/2

del ay

large delays as arrival rate, 𝜆 !" , approaches capacity

𝜆 _!"

throughput: 𝜆 #$%

𝜆 _!"

R

(4)

Causes/costs of congestion: scenario 2

§ one router, finite buffers

§ sender retransmission of timed-out packet

• application-layer input = application-layer output: 𝜆 !" = 𝜆 _#$%

• transport-layer input includes retransmissions : 𝜆 !" & ≥ 𝜆 _#$%

finite shared output link buffers Host A

𝜆 _!" ∶ original data

Host B

𝜆 _!" ^& : original data, plus

retransmitted data

𝜆 _#$%

(5)

Causes/costs of congestion: scenario 2

finite shared output link buffers Host A

𝜆 _!" ∶ original data

Host B

𝜆 _!" ^& : original data, plus

retransmitted data

𝜆 _#$%

Idealization: perfect knowledge

§ sender sends only when router buffers available

free buffer space!

R/2

𝜆 _!" R/2

throughput: 𝜆 #$%

(6)

Idealization: known loss

§ packets can be lost, dropped at router due to full buffers

§ sender only resends if packet known to be lost

Causes/costs of congestion: scenario 2

Host A

𝜆 _!" ∶ original data

Host B

𝜆 _!" ^# : original data, plus

retransmitted data

𝜆 _$%&

no buffer space!

R/2

when sending at R/2, some packets are retransmissions.

𝜆 _!" ^#

𝜆 _$%&

(7)

Reality: duplicates

§ packets can be lost, dropped at router due to full buffers

§ sender times out prematurely, sending two copies, both of which are delivered frequently

Causes/costs of congestion: scenario 2

R/2

when sending at R/2, some packets are retransmission

including duplicated that are delivered already!

𝜆 _!" ^#

𝜆 _$%&

“ costs” of congestion:

§ more work (retransmission) for given “goodput”

§ unneeded retransmissions: link carries multiple copies of pkt

• decreasing goodput

(8)

Causes/costs of congestion: scenario 3

§ four senders (with timeout/retransmit)

§ multihop paths

C/2

𝜆 _!" ^# C/2

𝜆 _$%&

another “cost” of congestion:

§ when packet dropped, any transmission capacity over multiple links used for that packet is wasted!

𝜆

_!"

𝜆

^#_!"

𝜆

_$%&

(9)

Congestion Collapse

□ Congestion control

§ Regulate the packet injection rate at the source

§ Without flow control & congestion control, the network can run into “congestion collapse”è no useful throughput

Offered Load Time Delay

Onset of Congestion

Offered Load

Throughput Congestion

Deadlock Mode (congestion collapse)

(10)

TCP History

1975 1980 1985 1990

1981 TCP & IP RFC 793 & 791

1983 BSD Unix 4.2 supports TCP/IP

1984

Nagle’s algorithm to reduce overhead of small packets;

predicts congestion collapse

1987

Karn’s algorithm to better estimate

round-trip time

1986 Congestion

collapse 1

^st

observed

1990 4.3BSD Reno

fast recovery delayed ACK’s 1975

Three-way handshake Ray Tomlinson

In SIGCOMM 75 1988

Van Jacobson’s algorithms

slow start,

congestion avoidance, fast retransmit (all implemented in

4.3BSD Tahoe) SIGCOMM 88 1974

TCP described by

Vint Cerf, Bob Kahn

in IEEE Trans Comm

(11)

TCP History

1993 1994 1996

1994 ECN Explicit Congestion Notification

(Floyd) 1993

TCP Vegas real congestion

avoidance (Brakmo et. al.)

1994 T/TCP

Transaction TCP (Braden)

1996 New Reno modified fast

recovery SACK TCP Selective Ack (Floyd et. al.) 1996

Improving TCP startup

(Hoe)

1996 FACK TCP Forward Ack extension to SACK

(Mathis et. al.)

Drawing source:

Prof. Paul D. Amer

2008

TCP CUBIC

Linux Default TCP

(Ha et. al.)

(12)

TCP History

□ TCP Tahoe

§ In TCP Tahoe, when a loss occurs, fast retransmit is sent, half of the current

CWND is saved as a Slow Start Threshold (SSThresh) and slow start begins again from its initial CWND. Once the CWND reaches the SSThresh, TCP changes to congestion avoidance algorithm

□ TCP Reno

§ TCP Reno implements an algorithm called fast recovery. A fast retransmit is sent, half of the current CWND is saved as SSThresh and as new CWND, thus

skipping slow start and going directly to Congestion Avoidance algorithm

□ TCP New Reno

§ TCP New Reno, defined by RFC 6582, improves retransmission during the fast- recovery phase of TCP Reno. During fast recovery, for every duplicate ACK that is returned to TCP New Reno, a new unsent packet from the end of the

congestion window is sent, to keep the transmit window full. For every ACK that

makes partial progress in the sequence space, the sender assumes that the ACK

points to a new hole, and the next packet beyond the ACKed sequence number

is sent.

(13)

TCP congestion control:

Additive Increase Multiplicative Decrease (AIMD)

§ approach: sender increases transmission rate (window size), probing for usable bandwidth, until loss occurs

• additive increase: increase cwnd by 1 MSS every RTT until loss detected

• multiplicative decrease: cut cwnd in half after loss

cw nd :TC P send er co ng es tio n w in do w s ize

AIMD saw tooth behavior: probing for bandwidth

additively increase window size …

…. until loss occurs (then cut window in half)

time

(14)

TCP congestion control

§ sender limits transmission:

§ _cwnd is dynamic, function of perceived network congestion

TCP sending rate:

§ roughly: send cwnd bytes, wait RTT for ACKS, then send more bytes

last byte

ACKed sent, not- yet ACKed (“in-flight”)

last byte sent cwnd

LastByteSent-LastByteAcked ≤ _cwnd

sender sequence number space

rate ~~ cwnd

RTT bytes/sec

(15)

Congestion control in TCP

□ CWND and Ack

§ “ACK clocking”: the sender transmits a data packet

upon an ACK reception

□ Control the size of the sliding window

§ Controlling the window size to match with the receiver buffer (à flow control)

§ Controlling the window size to determine the TX rate (à congestion control)

Ref: http://www.soi.wide.ad.jp/class/20020032/

(16)

TCP slow start

§ when connection begins, increase rate exponentially until first loss event:

• initially cwnd = 1 MSS

• double cwnd every RTT

• done by incrementing cwnd for every ACK received

Host A

one segment

RT T

Host B

time

two segments

four segments

initialize: cwnd = 1

for (each segment ACKed) cwnd ++

until (loss event OR

cwnd > ssthresh)

Slow Start algorithm

(17)

TCP: detecting, reacting to loss

§ loss indicated by timeout:

• cwnd set to 1 MSS;

• window then grows exponentially (as in slow start) to threshold, then grows linearly

§ loss indicated by 3 duplicate ACKs: ^{TCP RENO}

• dup ACKs indicate network capable of delivering some segments

• cwnd is cut in half window then grows linearly

§ TCP Tahoe always sets ^cwnd to 1 (timeout or 3 duplicate

acks)

(18)

TCP congestion avoidance (CA)

/* slow start is over */

if (cwnd > ssthresh) {

for (each segment ACKed) cwnd += 1/cwnd

until (loss event) } if (loss event) {

ssthresh = cwnd / 2 cwnd = 1

Perform slow start }

Congestion Avoidance (TCP-Tahoe)

□ On each new ack, increase cwnd by 1/cwnd packets

□ The window size grows “ linearly ” with time during

congestion avoidance: 1 MSS per RTT

(19)

Fast Retransmit

□ Fast Retransmit occurs when multiple (≥ 3) dupacks come back

§ Usually, Fast Recovery follows Fast Retransmit (explained later)

§ In contrast, on a timeout, Slow Start follows

□ Fast Retransmit occurs

§ When a packet is lost but subsequent packets get through

• In contrast, a timeout occurs, when no more packets are getting across

§ “ACK clocking” may be still there, if only one packet is lost

§ We may NOT need Slow Start after Fast Retransmit

è Fast Recovery.

(20)

Fast Recovery

□ Algorithm: after Fast Retransmit, do…

§ _ssthresh = min( cwnd , receiver’s advertised window)/2

§ Retransmit the missing segment (Fast Retransmit)

§ _cwnd = ssthresh + [number of received dupacks]

§ Transmit new packets if allowed by the new value of cwnd

§ Fast recovery lasts until a non-duplicate ACK is received

§ When a new ack comes: cwnd = ssthresh , and enter congestion avoidance

0 2 4 6 8 10 12

0 2 4 6 8 10 12 14

Time (round trips)

After Fast Retransmit and Fast Recovery, the window size is reduced in half.

no slow start

Fast recovery

Wi nd ow s ize

(21)

Fast Recovery - Example

8 15 15-28

ack 29 window

3 dupack-14

Actions after dupacks for lost pkt 14:

1. On 3rd dupack 14, enter fast recovery 2. Set ssthresh = cwnd = 15/2 = 7

3. Retransmit 14

4. Receipt of 4th dupack sets W=11 5. By 14th dupack, W = 21, send 29,…

6. After ack 29, exit fast recovery 7. Set cwnd = 7

7. Continue with Congestion Avoidance

4th dupack-14

14 7-14

21 7

14th dupack -14

29-34 11

7 ...

ack 8-14

seqnum

(22)

A Trace of TCP Sequence Number over Time

0 2 4 6 8 10 12 14

0 1 2 3 4 5 6 7 8

Congestion W indow size

Time (round trips)

Slow start

Congestion avoidance

SS-threshold

pkt tx

ack reception

Duplicated

acks

(23)

TCP Congestion Control: FSM

timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0

retransmit missing segment Λ

cwnd ≥ ssthresh

congestion avoidance

cwnd = cwnd + MSS/cwnd dupACKcount = 0

transmit new segment(s), as allowed new ACK

dupACKcount++

duplicate ACK

fast recovery

cwnd = cwnd + MSS

transmit new segment(s), as allowed duplicate ACK

ssthresh= cwnd/2 cwnd = ssthresh + 3 retransmit missing segment dupACKcount == 3

timeout

ssthresh = cwnd/2 cwnd = 1

dupACKcount = 0

retransmit missing segment

ssthresh= cwnd/2 cwnd = ssthresh + 3 retransmit missing segment dupACKcount == 3

cwnd = ssthresh dupACKcount = 0 new ACK

slow start

timeout ssthresh = cwnd/2 cwnd = 1 MSS dupACKcount = 0 retransmit missing segment

cwnd = cwnd+MSS dupACKcount = 0

transmit new segment(s), as allowed new ACK

dupACKcount++

duplicate ACK

Λ cwnd = 1 MSS ssthresh = 64 KB dupACKcount = 0

New ACK!

(24)

TCP throughput

□ avg. TCP thruput as function of window size, RTT?

§ ignore slow start, assume always data to send

□ W: window size (measured in bytes) where loss occurs

§ avg. window size (# in-flight bytes) is ¾ W (in RENO)

W W/2

avg. TCP throughput = 34 W

RTT ^bytes/sec

(25)

TCP over “long, fat pipes”

□ example: 1500-byte segments, 100ms RTT, want 10 Gbps throughput

□ requires W = 83,333 in-flight segments

□ throughput in terms of segment loss probability, L [Mathis 1997]:

è to achieve 10 Gbps throughput, we need a loss rate of L = 2⋅10 ^-10 – a very small loss rate!

□ TCP for high-speed links needed TCP throughput = 1.22 . MSS

RTT L

(26)

TCP CUBIC (as a high-speed TCP)

Packet loss event

W

_max

Time

• 𝐜𝐰𝐧𝐝 = 𝐶 𝑡 − 𝐾 ^' + 𝑊 _()*

• 𝐾 = ⁽ 𝑊 _()* 𝛽/𝐶

• Typically, 𝛽 = 0.7, 𝐶 = 0.4

Steady state behavior

Max Probing

(27)

TCP Fairness

fairness goal: if K TCP sessions share same

bottleneck link of bandwidth R, each should have average rate of R/K

TCP connection 1

bottleneck

router capacity R

TCP connection 2

(28)

Why does TCP achieve fairness?

two competing sessions:

§ additive increase gives slope of 1, as throughout increases

§ multiplicative decrease decreases throughput proportionally

R

equal bandwidth share

Connection 1 throughput

Co nnec tio n 2 thr oug hp ut

congestion avoidance: additive increase

loss: decrease window by factor of 2

congestion avoidance: additive increase

loss: decrease window by factor of 2

(29)

Fairness (more)

Fairness and UDP

§ multimedia apps often do not use TCP

• do not want rate

throttled by congestion control

§ UDP in most cases

• sends audio/video at constant rate, tolerate packet loss

Fairness, parallel TCP connections

§ application can open multiple parallel

connections between two hosts

§ web browsers do this

§ e.g., link of rate R with 9 existing connections:

• new app asks for 1 TCP, gets rate R/10

• new app asks for 9 TCPs, gets

R/2

(30)

NUM for Fairness and Efficiency

□ NUM: Network Utility Maximization

□ Objectives (𝑈 # )

§ Proportional fairness: 𝑈 + 𝑥 ₊ = log 𝑥 ₊ (e.g., FAST TCP, TCP Vegas)

§ Max efficiency: 𝑈 + 𝑥 ₊ = 𝑥 ₊

□ Proportional fairness:

§ a feasible allocation solution 𝑥 ^∗ is proportional fair if for any

feasible 𝑥, the eq. above is satisfied (i.e., the sum of proportional changes will be non-positive).

{% max _! :∀#} &

#

𝑈 _# (𝑥 _# )

subject to: ∑ _{+:.∈0 +} 𝑥 ₊ ≤ 𝑐 _. , ∀𝑙

Utility maximization

Maximize =

subject to: and A necessary condition:

= , , is optimal only if for any unless = 0

𝑐 _@

) !

𝑥 _! − 𝑥 _! ^∗

𝑥 _! ^∗ ≤ 0

❖ Most loss-based TCPs (e.g., RENO, CUBIC) achieve RTT-fairness (i.e., 𝑥

#

∝ 1/𝑅𝑇𝑇

_#

)

(31)

Active Queue Management (AQM) Cross-Layer

□ Marking mechanism that is currently available on Cisco routers (not widely turned on though!)

□ RED (Random Early Detection or Random Early Drop):

§ Drops incoming packets with a probability that is a function of average recent buffer occupancy

§ Sources learn about congestion before router buffer is full, and slow down before facing multiple packet losses

RED WRED

(32)

Explicit Congestion Notification (ECN) Cross-Layer

Network-assisted congestion control:

§ two bits in IP header (ToS field) marked by network router to indicate congestion

§ congestion indication carried to receiving host

§ receiver (seeing congestion indication in IP datagram) ) sets ECE (ECN-Echo) bit on receiver-to-sender ACK segment to notify sender of congestion

Transport Layer

Transport Layer

- Congestion and TCP -

Kyunghan Lee

Networked Computing Lab (NXC Lab)

Department of Electrical and Computer Engineering Seoul National University

https://nxc.snu.ac.kr

kyunghanlee@snu.ac.kr

Principles of Congestion Control

Congestion:

□ informally: “too many sources sending too much data too fast for network to handle”

□ different from flow control!

□ manifestations:

§ lost packets (buffer overflow at routers)

§ long delays (queueing in router buffers)

□ One of the most important problems!

Causes/costs of congestion: scenario 1

§ two senders, two receivers

§ one router, infinite buffers

§ output link capacity: R

§ no retransmission

maximum per-connection throughput: R/2

unlimited shared output link buffers

original data: 𝜆 !"

throughput: 𝜆 #$%

R/2

R/2 R/2

del ay

large delays as arrival rate, 𝜆 !" , approaches capacity

𝜆 !"

throughput: 𝜆 #$%

𝜆 !"

R

Causes/costs of congestion: scenario 2

§ one router, finite buffers

§ sender retransmission of timed-out packet

• application-layer input = application-layer output: 𝜆 !" = 𝜆 #$%

• transport-layer input includes retransmissions : 𝜆 !" & ≥ 𝜆 #$%

finite shared output link buffers Host A

𝜆 !" ∶ original data

Host B

𝜆 !" & : original data, plus

retransmitted data

𝜆 #$%

Causes/costs of congestion: scenario 2

finite shared output link buffers Host A

𝜆 !" ∶ original data

Host B

𝜆 !" & : original data, plus

retransmitted data

𝜆 #$%

Idealization: perfect knowledge

§ sender sends only when router buffers available

free buffer space!

R/2

𝜆 !" R/2

throughput: 𝜆 #$%

Idealization: known loss

§ packets can be lost, dropped at router due to full buffers

§ sender only resends if packet known to be lost

Causes/costs of congestion: scenario 2

Host A

𝜆 !" ∶ original data

Host B

𝜆 !" # : original data, plus

retransmitted data

𝜆 $%&

no buffer space!

R/2

R/2

when sending at R/2, some packets are retransmissions.

𝜆 !" #

𝜆 $%&

Reality: duplicates

§ packets can be lost, dropped at router due to full buffers

§ sender times out prematurely, sending two copies, both of which are delivered frequently

Causes/costs of congestion: scenario 2

R/2

R/2

when sending at R/2, some packets are retransmission

𝜆 _!"

𝜆 _!"

• application-layer input = application-layer output: 𝜆 !" = 𝜆 _#$%

• transport-layer input includes retransmissions : 𝜆 !" & ≥ 𝜆 _#$%

𝜆 _!" ∶ original data

𝜆 _!" ^& : original data, plus

𝜆 _#$%

𝜆 _!" ∶ original data

𝜆 _!" ^& : original data, plus

𝜆 _#$%

𝜆 _!" R/2

𝜆 _!" ∶ original data

𝜆 _!" ^# : original data, plus

𝜆 _$%&

𝜆 _!" ^#

𝜆 _$%&

𝜆 _!" ^#

𝜆 _$%&

𝜆 _!" ^# C/2

𝜆 _$%&