Solved Optimization of TCP, Win and possible fight against lags

I don’t know how these fixes will behave on this server,
but on the ruof they helped me a lot...
Techniques that increase the responsiveness of the game and, in some cases, eliminate lags:

These steps are applicable and tested on Windows 7.
If the branch specified in paragraph 5 is missing, then do the following:
Open – Start – Control Panel – Programs and Features – (left) Turn Windows components on and off.
There we find the point - Microsoft Message Queuing (MSMQ) Server, and put a checkmark in front of it and all the checkboxes inside in the drop-down list of components. We reboot, go to the registry and see the entry we need there
There is an option to change the registry key
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WindowsNT\CurrentVersion\Multimedia\SystemProfile
Name: NetworkThrottlingIndex (if not, create it)
Parameter: DWORD
The value means the number of non-multimedia traffic packets per 1 millisecond, the default is 10. You can try increasing the number or simply setting the hexadecimal FFFFFFFF , in the latter case traffic throttling will be completely disabled.
Extra options:
These parameters are also capable of optimizing network exchange for our case. When choosing their values, I was guided by personal experience and didn't just take my word for it various councils. I'm temporarily on 3G Internet, where the ping itself is not very good, especially in the evening, and to me below listed settings helped. However, there is a risk that some parameter from them may worsen the ping situation (albeit not by much), which is why I called them additional and optional to set.
Chapter HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
SackOpts
Selective transmission of corrupted data. It helps a lot in combating lags if the client is not crooked.
Recommended value: 1 (unit).
To disable: 0

EnablePMTUDiscovery
Automatically detect maximum size transmitted block of data.
Recommended value: 1 (unit).
To disable: 0

EnablePMTUBHDetect
Enables the black hole router detection algorithm. I saw advice on setting this parameter to 0, however, for myself, I did not notice the effect of this parameter on ping, and everyone needs a reliable connection =)
Recommended value: 1 (unit).
To disable: 0

DisableTaskOffload
Allows you to unload CPU, freeing him from calculations checksums For TCP protocol, shifting this task to the network adapter.
Recommended value: 0 (zero).
To disable: 1
Disadvantage: If there are connection failures, disable the option.

DefaultTTL
Defines maximum time finding an IP packet on the network if it cannot reach the destination host. This allows us to significantly limit the number of routers an IP packet can go through before being dropped (what if the packet gets lost, why would we wait for it?).
Recommended decimal value: 64
To disable: remove option

Chapter HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\Psched
NonBestEffortLimit
Disables channel bandwidth reservation for QoS.
Recommended value: 0 (zero).

To avoid manually editing these Extra options in the registry, you can use ready-made reg files to Inclusions And Shutdowns these features.
Network tweaks:

Starting with this OS version, additional network parameters, which may be useful to us. These tweaks are commands that in this case, immediately containing recommended settings. To apply them, you need to run the command line (cmd) as an administrator. To see current settings, you can use the command netsh int tcp show global
So, the commands:
netsh int tcp set global rss=enabled
Using several processes to process an incoming stream, without RSS TCP/IP always works on only one processor, even if the PC is multiprocessor.

netsh int tcp set global netdma=enable
Exchange of information between network card and RAM memory without CPU participation (NetDMA).
Possible values: enable / disable

netsh int tcp set global dca=enable
Direct access to the NetDMA 2.0 cache (Direct Cache Acess).
Possible values: enable / disable

netsh interface tcp set heuristics wsh=enable
Automatic TCP window sizing (WSH). In theory, it negates the setting of the next parameter, but let it be so that later you can painlessly enable/disable something without deviating too much from the goal.
Possible values: enable / disable

netsh int tcp set global autotuninglevel=highlyrestricted
Auto-adjust the TCP receive window size without deviating too much from the default value.
Possible values: disable / higlyrestricted / restricted / normal / experimental

netsh int tcp set global timestamps=enable
Time stamps when installed with keys like Auto-Tuning Level optimal choice reception window size.
Possible values: enable / disable

netsh int tcp set global ecncapability=enable
ECN is a mechanism for routers to communicate about network congestion. It is designed to reduce packet relaying. This allows you to automatically reduce the data transfer rate to prevent data loss. The description speaks for itself, for reliability.
Possible values: enable / disable

netsh int tcp set global congestionprovider=none
CTCP increases the transmission rate while controlling the window size and throughput (Add-On Congestion Control Provider). In all the guides on the Internet that I came across, they advised setting this parameter equal to ctcp. However, in practice, everything turned out to be much more complicated. In my case, it only caused longer lags, despite the fact that it seems to be designed to eliminate packet losses (and everything like that). Therefore, I still recommend the value none, based on experience. Perhaps in networks with more reliable connection CTCP will give you profit.
Possible values: none / ctcp / default

Disable network protocol Teredo (for those who do not use IPv6).

An innovation that constantly checks connections and packets to see if they belong to the IPv6 network, loading network card and clogging our data feed. Disabling Teredo can speed up your network and Internet, as follows:
Let's launch Command line (Start > Run > cmd) and enter the commands one by one.
netsh
interface
teredo
set state disabled
To return Teredo, the same commands are entered, except for the last one. The last one should be set state default
Switch between windows.

I don’t know about you, but I’m faced with the problem of switching windows of a running client. The essence of the problem is that when switching active window, the system either switched to the desktop or did not switch the window at all. Luckily I found a solution! The problem was hidden in the Aero interface of the standard window switcher. A small fix will change the style of the switch to the style of classic Win XP. Archive link below...

There are two files in the archive, one for installation, the other for undoing changes, if suddenly this fix did not help you.

Translation

And some technologies are TCP Fast Open, flow and congestion control and window scaling. In the second part we will learn what TCP Slow Start is, how to optimize the data transfer rate and increase the initial window, and also put together all the recommendations for optimizing the TCP/IP stack.

Slow-Start

Despite the presence of flow control in TCP, network accumulation collapse was real problem in the mid 80s. The problem was that while flow control prevented the sender from drowning the recipient in data, there was no mechanism to prevent this from happening to the network. After all, neither the sender nor the recipient knows the channel width at the time the connection begins, and therefore they need some kind of mechanism to adapt the speed to changing network conditions.

For example, if you are at home and downloading big video With remote server, which loaded your entire downlink to ensure maximum speed. Then another user from your home decided to download extensive update BY. The available channel for the video suddenly becomes much smaller, and the server sending the video must change its data sending rate. If he continues at the same speed, the data will simply “crowd together” on some intermediate gateway, and packets will “drop”, which means inefficient use of the network.

In 1988, Van Jacobson and Michael J. Carels developed several algorithms to combat this problem: slow start, congestion avoidance, fast retransmission, and fast recovery. They soon became a mandatory part of the TCP specification. It is believed that thanks to these algorithms it was possible to avoid global problems with the internet in the late 80s/early 90s when traffic grew exponentially.

To understand how slow start works, let's return to the example of a client in New York trying to download a file from a server in London. First, a triple handshake is performed, during which the parties exchange their receive window values in ACK packets. When the last ACK packet has left the network, data exchange can begin.

The only way to estimate the channel width between client and server is to measure it while the data is being exchanged, and that's exactly what slow start does. First, the server initializes a new congestion window variable (cwnd) for the TCP connection and sets its value conservatively, according to system value(in Linux this is initcwnd).

The value of the cwnd variable is not exchanged between the client and server. This will be a local variable for the London server. Next, a new rule is introduced: the maximum amount of data “in transit” (not confirmed via ACK) between the server and client must be the smallest value of rwnd and cwnd. But how can the server and client “agree” on the optimal values of their overload windows? After all, network conditions change constantly, and I would like the algorithm to work without the need to adjust each TCP connection.

Solution: start transmission from slow speed and enlarge the window as packets are confirmed to be received. This is a slow start.

The initial value of cwnd was initially set to 1 network segment. RFC 2581 changed this to 4 segments, and then RFC 6928 changed this to 10 segments.

Thus, the server can send up to 10 network segments to the client, after which it must stop sending and wait for confirmation. Then, for each ACK received, the server can increase its cwnd value by 1 segment. That is, for every packet confirmed via ACK, two new packets can be sent. This means that the server and client quickly "borrow" available channel.

Rice. 1. Overload control and prevention.

How does slow start affect browser app development? Since every TCP connection must go through a slow start phase, we cannot immediately use the entire available channel. It all starts with a small congestion window that gradually grows. Thus, the time it takes to reach a given bit rate is a function of the round-trip delay and the initial value of the congestion window.

Time to reach cwnd equal to N.

To get a feel for what this will be like in practice, let's make the following assumptions:

Client and server receive windows: 65,535 bytes (64 KB)
Initial congestion window value: 10 segments
Round-trip delay between London and New York: 56 milliseconds

Despite the 64 KB receive window, the throughput of a TCP connection is initially limited by the congestion window. To reach the 64 KB limit, the congestion window must grow to 45 segments, which will take 168 milliseconds.

The fact that the client and server may be able to exchange megabits per second between themselves makes no difference to the slow start.

Rice. 2. Growth of the overload window.

To reduce the time it takes to reach the maximum congestion window, you can reduce the time it takes packets to travel round-trip - that is, locate the server geographically closer to the client.

A slow start has little effect on downloading large files or streaming video, since the client and server will reach their maximum congestion window values in a few tens or hundreds of milliseconds, but this will be a single TCP connection.

However, for many HTTP requests, when the target file is relatively small, the transfer may end before the maximum congestion window is reached. That is, the performance of web applications is often limited by the round-trip delay between the server and the client.

Slow-Start Restart (SSR)

In addition to throttling the transfer rate on new connections, TCP also provides a slow start restart mechanism that resets the congestion window if the connection has not been used. specified period time. The logic here is that network conditions may have changed while the connection was idle, and to avoid congestion, the window value is reset to a safe value.

Not surprisingly, SSR can have a serious impact on the performance of long-lived TCP connections that may be temporarily idle, for example due to user inactivity. Therefore, it is better to disable SSR on the server to improve the performance of long-lived connections. On Linux, you can check the SSR status and disable it with the following commands:

$> sysctl net.ipv4.tcp_slow_start_after_idle $> sysctl -w net.ipv4.tcp_slow_start_after_idle=0

To demonstrate the impact of a slow start on a small file transfer, let's imagine that a client in New York requested a 64 KB file from a server in London over a new TCP connection with the following parameters:

Round-trip delay: 56 milliseconds
Client and server throughput: 5 Mbps
Client and server receive window: 65,535 bytes
Congestion window initial value: 10 segments (10 x 1460 bytes = ~14 KB)
Server processing time to generate response: 40 milliseconds
Packets are not lost, ACK for each packet, GET request fits into 1 segment

Rice. 3. Download the file over a new TCP connection.

0 ms: client starts TCP handshake with SYN packet
28 ms: server sends SYN-ACK and sets its rwnd size
56 ms: the client confirms SYN-ACK, sets its rwnd size and immediately sends HTTP request GET
84 ms: server receives HTTP request
124 ms: The server finishes generating the 64 KB response and sends 10 TCP segments before waiting for an ACK ( initial value cwnd equals 10)
152 ms: client receives 10 TCP segments and responds with ACK to each
180 ms: server increments cwnd for each ACK received and sends 20 TCP segments
208 ms: client receives 20 TCP segments and responds with ACK to each
236 ms: server increments cwnd for each ACK received and sends 15 remaining TCP segments
264 ms: client receives 15 TCP segments and responds with ACK to each

It takes 264 milliseconds to transfer a 64 KB file over a new TCP connection. Now let's imagine that the client reuses the same connection and makes the same request again.

Rice. 4. Download a file over an existing TCP connection.

0 ms: client sends HTTP request
28 ms: server receives HTTP request
68 ms: The server generates a 64 KB response, but the cwnd value is already larger than the 45 segments required to send this file. Therefore, the server sends all segments at once
96 ms: client receives all 45 segments and responds with ACK to each

The same request, made over the same connection, but without the overhead of a handshake and slow-start ramp-up, now completes in 96 milliseconds, an increase of 275%!

In both cases, the fact that the client and server were using a 5 Mbps channel had no effect on the file download time. Only congestion window sizes and network latency were the limiting factors. Interestingly, the difference in performance when using the new and existing TCP connections will increase if network latency increases.

Once you understand the latency issues when creating new connections, you will be tempted to use optimization techniques such as keepalive, pipelining, and multiplexing.

Increasing the initial value of the TCP congestion window

This is the easiest way to improve performance for all users or applications that use TCP. Many operating systems are already using the new value of 10 in their updates. For Linux 10, this is the default value for the overload window, starting with kernel version 2.6.39.

Overload Prevention

It is important to understand that TCP uses packet loss as a mechanism feedback, which helps regulate performance. Slow start creates a connection with a conservative congestion window and incrementally doubles the amount of data transmitted at a time until it reaches the receiver's receive window, the sshtresh system threshold, or until packets begin to be lost, at which point the congestion avoidance algorithm kicks in.

Congestion avoidance is based on the assumption that packet loss is an indicator of network congestion. Somewhere along the path of the packets, packets have accumulated on the link or router, and this means that the congestion window needs to be reduced to prevent further traffic from clogging the network.

Once the congestion window has been reduced, a separate algorithm is applied to determine how the window should be further increased. Sooner or later another packet loss will occur, and the process will repeat. If you've ever seen a saw-tooth graph of traffic flowing through a TCP connection, that's because congestion control and avoidance algorithms adjust the congestion window according to packet losses on the network.

It is worth noting that improving these algorithms is an active area of both research and development. commercial products. There are options that work better on certain types of networks or for transferring certain types of files, and so on. Depending on which platform you are running on, you use one of many options: TCP Tahoe and Reno (original implementation), TCP Vegas, TCP New Reno, TCP BIC, TCP CUBIC (default on Linux), or Compound TCP (by default). default on Windows) and many others. Regardless of the specific implementation, the effects of these algorithms on web application performance are similar.

Proportional speed reduction for TCP

Definition the best way recovery after packet loss is not the best trivial task. If you react too "aggressively" to this, then random packet loss will have an undue impact negative impact on connection speed. If you don't respond quickly enough, it will likely cause further packet loss.

TCP originally used a Multiplicative Decrease and Additive Increase (AIMD) algorithm: when a packet is lost, the congestion window is halved, and gradually increases by a specified amount with each round trip. In many cases, AIMD has proven to be an overly conservative algorithm, so new ones have been developed.

Proportional Rate Reduction (PRR) – new algorithm, described in RFC 6937, whose goal is faster recovery from packet loss. According to measurements from Google, where the algorithm was developed, it provides an average reduction in network latency of 3-10% in connections with packet losses. PPR is enabled by default in Linux 3.2 and higher.

Bandwidth-Delay Product (BDP)

The built-in congestion control mechanisms in TCP have an important consequence: the optimal window values for the receiver and the sender must vary according to the round-trip delay and the target data rate. Recall that the maximum number of unacknowledged packets "in transit" is defined as the smallest of the receive and congestion windows (rwnd and cwnd). If the sender has exceeded maximum amount unacknowledged packets, it must stop transmitting and wait until the receiver has acknowledged a certain number of packets before the sender can start transmitting again. How long should he wait? This is determined by the circular delay.

BDP determines the maximum amount of data that can be "in transit"

If a sender must frequently stop and wait for ACKs for previously sent packets, this will create a gap in the data flow that will limit the maximum speed of the connection. To avoid this problem, window sizes must be set large enough to allow data to be sent while waiting for ACKs to arrive on previously sent packets. Then it will be possible maximum speed transfers, no breaks. Respectively, optimal size window depends on the speed of the circular delay.

Rice. 5. Gap in transmission due to small window values.

How large should the reception and congestion windows be? Let's look at an example: let cwnd and rwnd be equal to 16 KB, and the round-trip delay is equal to 100 ms. Then:

It turns out that no matter the channel width between the sender and the recipient, such a connection will never give a speed greater than 1.31 Mbit/s. To achieve greater speed, you need to either increase the window value or reduce the circular delay.

In a similar way we can calculate optimal value windows, knowing the circular delay and the required channel width. Let's assume that the time remains the same (100 ms), and the sender's channel width is 10 Mbit/s, and the recipient is on a high-speed channel of 100 Mbit/s. Assuming that the network between them has no problems in intermediate sections, we get for the sender:

The window size must be at least 122.1 KB to fully occupy a 10 Mbps channel. Remember that the maximum receive window size in TCP is 64 KB unless window scaling is enabled (RFC 1323). Another reason to double-check your settings!

The good news is that window size negotiation is done automatically in the network stack. The bad news is that sometimes this can be a limiting factor. If you've ever wondered why your connection transmits at a speed that is only a small fraction of the available bandwidth, it's most likely due to small window sizes.

BDP in high-speed local networks

Circular delay can also be a bottleneck in local networks. To achieve 1 Gbps with 1 ms round-trip latency, you must have a congestion window of at least 122 KB. The calculations are similar to those shown above.

Head-of-line blocking (HOL blocking)

Although TCP is a popular protocol, it is not the only one, and not always the most suitable one for each specific case. Features such as in-order delivery are not always necessary and can sometimes add to the delay.

Each TCP packet contains a unique sequence number, and the data must arrive in order. If one of the packets is lost, then all subsequent packets are stored in the recipient's TCP buffer until the lost packet is resent and reaches the recipient. Because this happens in the TCP layer, the application does not "see" these retransmissions or the queue of packets in the buffer, and simply waits until the data is available. All that the application “sees” is the delay that occurs when reading data from the socket. This effect is known as head-of-queue blocking.

Locking the head of the queue frees applications from having to deal with packet ordering, which simplifies the code. But on the other hand, there is an unpredictable delay in the arrival of packets, which negatively affects application performance.

Rice. 6. Blocking the beginning of the queue.

Some applications may not require guaranteed or in-order delivery. If each packet is separate message, then order delivery is not needed. And if each new message overwrites the previous ones, then guaranteed delivery is also not needed. But TCP has no configuration for such cases. All packages are delivered in turn, and if one is not delivered, it is sent again. Applications for which latency is critical can use an alternative transport, such as UDP.

Packet loss is normal

Packet loss is even necessary to ensure better performance TCP. A lost packet acts as a feedback mechanism that allows the receiver and sender to change the sending rate to avoid network congestion and minimize latency.

Some applications can "cope" with packet loss: for example, to play audio, video, or to transfer state in a game, guaranteed delivery or in-order delivery is not necessary. Therefore, WebRTC uses UDP as its main transport.

If a packet is lost while playing audio, the audio codec can simply insert a small gap into the playback and continue to process incoming packets. If the gap is small, the user may not notice it, but waiting for a lost packet to arrive can cause a noticeable delay in playback, which will be much worse for the user.

Likewise, if a game communicates its states, then there is no point in waiting for a packet describing the state at time T-1 if we already have information about the state at time T.

Optimization for TCP

TCP is an adaptive protocol designed to get the most out of a network. Optimizing for TCP requires understanding how TCP responds to network conditions. Applications may need their own quality of service (QoS) method to ensure stable work for users.

The application requirements and numerous features of TCP algorithms make their interconnection and optimization in this area a huge field for study. In this article, we have only touched on some of the factors that affect TCP performance. Additional mechanisms, such as selective acknowledgments (SACKs), delayed acknowledgments, fast retransmission, and many others, make TCP sessions difficult to understand and optimize.

While the specific details of each algorithm and feedback mechanism will continue to change, the key principles and their implications will remain:

The TCP triple handshake incurs significant latency;
TCP slow start applies to every new connection;
TCP flow control and congestion mechanisms regulate throughput all connections;
TCP throughput is controlled through the congestion window size.

As a result, the speed at which a TCP connection can transfer data on modern high-speed networks is often limited by round-trip delay. While link widths continue to increase, latency is limited by the speed of light, and in many cases it is latency, not link width, that bottlenecks TCP.

Setting up server configuration

Instead of having to worry about setting up every single TCP parameter, it's better to start by updating to the latest version operating system. Best practics TCP continues to evolve, and most of these changes are already available in latest versions OS.

“Update the OS on the server” seems like trivial advice. But in practice, many servers are configured to certain version kernels, and system administrators may be against updates. Yes, the upgrade carries its own risks, but in terms of TCP performance, it will most likely be the most effective action.

After updating the OS, you need to configure the server according to best practices:

Increase the initial value of the congestion window: this will allow more data to be transferred in the first exchange and significantly accelerates the growth of the congestion window
Disable slow start: disabling slow start after a period of connection inactivity will improve the performance of long-lived TCP connections
Enable window scaling: this will increase the maximum receive window and speed up connections where latency is high
Enable TCP Fast Open: this will make it possible to send data in the initial SYN packet. This is a new algorithm, both the client and server must support it. Explore whether your application could benefit from it.

You may also need to configure other TCP parameters. Refer to the material

Translation

In the first part, we looked at the TCP “triple handshake” and some technologies - TCP Fast Open, flow and congestion control, and window scaling. In the second part we will learn what TCP Slow Start is, how to optimize the data transfer rate and increase the initial window, and also put together all the recommendations for optimizing the TCP/IP stack.

Slow-Start

Despite the presence of flow control in TCP, network accumulation collapse was a real problem in the mid-80s. The problem was that while flow control prevented the sender from drowning the recipient in data, there was no mechanism to prevent this from happening to the network. After all, neither the sender nor the recipient knows the channel width at the time the connection begins, and therefore they need some kind of mechanism to adapt the speed to changing network conditions.

For example, if you are at home and downloading a large video from a remote server, which has downloaded your entire downlink to ensure maximum speed. Then another user in your home decided to download a massive software update. The available channel for the video suddenly becomes much smaller, and the server sending the video must change its data sending rate. If he continues at the same speed, the data will simply “crowd together” on some intermediate gateway, and packets will “drop”, which means inefficient use of the network.

In 1988, Van Jacobson and Michael J. Carels developed several algorithms to combat this problem: slow start, congestion avoidance, fast retransmission, and fast recovery. They soon became a mandatory part of the TCP specification. It is believed that thanks to these algorithms, it was possible to avoid the global problems with the Internet in the late 80s/early 90s, when traffic grew exponentially.

Solution: Start transmission at a slow rate and increase the window as packets are acknowledged. This is a slow start.

The initial value of cwnd was initially set to 1 network segment. RFC 2581 changed this to 4 segments, and then RFC 6928 changed this to 10 segments.

Rice. 1. Overload control and prevention.

Time to reach cwnd equal to N.

To get a feel for what this will be like in practice, let's make the following assumptions:

Client and server receive windows: 65,535 bytes (64 KB)
Initial congestion window value: 10 segments
Round-trip delay between London and New York: 56 milliseconds

The fact that the client and server may be able to exchange megabits per second between themselves makes no difference to the slow start.

Rice. 2. Growth of the overload window.

To reduce the time it takes to reach the maximum congestion window, you can reduce the time it takes packets to travel round-trip - that is, locate the server geographically closer to the client.

A slow start has little effect on downloading large files or streaming video, since the client and server will reach their maximum congestion window in a few tens or hundreds of milliseconds, but this will be a single TCP connection.

Slow-Start Restart (SSR)

In addition to throttling the transfer rate on new connections, TCP also provides a slow start restart mechanism that resets the congestion window if the connection has not been used for a specified period of time. The logic here is that network conditions may have changed while the connection was idle, and to avoid congestion, the window value is reset to a safe value.

$> sysctl net.ipv4.tcp_slow_start_after_idle $> sysctl -w net.ipv4.tcp_slow_start_after_idle=0

Round-trip delay: 56 milliseconds
Client and server throughput: 5 Mbps
Client and server receive window: 65,535 bytes
Congestion window initial value: 10 segments (10 x 1460 bytes = ~14 KB)
Server processing time to generate response: 40 milliseconds
Packets are not lost, ACK for each packet, GET request fits into 1 segment

Rice. 3. Download the file over a new TCP connection.

0 ms: client starts TCP handshake with SYN packet
28 ms: server sends SYN-ACK and sets its rwnd size
56 ms: the client confirms the SYN-ACK, sets its rwnd size and immediately sends an HTTP GET request
84 ms: server receives HTTP request
124 ms: The server finishes creating the 64 KB response and sends 10 TCP segments before waiting for an ACK (initial cwnd value is 10)
152 ms: client receives 10 TCP segments and responds with ACK to each
180 ms: server increments cwnd for each ACK received and sends 20 TCP segments
208 ms: client receives 20 TCP segments and responds with ACK to each
236 ms: server increments cwnd for each ACK received and sends 15 remaining TCP segments
264 ms: client receives 15 TCP segments and responds with ACK to each

It takes 264 milliseconds to transfer a 64 KB file over a new TCP connection. Now let's imagine that the client reuses the same connection and makes the same request again.

Rice. 4. Download a file over an existing TCP connection.

0 ms: client sends HTTP request
28 ms: server receives HTTP request
68 ms: The server generates a 64 KB response, but the cwnd value is already larger than the 45 segments required to send this file. Therefore, the server sends all segments at once
96 ms: client receives all 45 segments and responds with ACK to each

The same request, made over the same connection, but without the overhead of a handshake and slow-start ramp-up, now completes in 96 milliseconds, an increase of 275%!

In both cases, the fact that the client and server were using a 5 Mbps channel had no effect on the file download time. Only congestion window sizes and network latency were the limiting factors. Interestingly, the performance difference between new and existing TCP connections will increase as network latency increases.

Once you understand the latency issues when creating new connections, you will be tempted to use optimization techniques such as keepalive, pipelining, and multiplexing.

Increasing the initial value of the TCP congestion window

Overload Prevention

It is important to understand that TCP uses packet loss as a feedback mechanism that helps regulate performance. Slow start creates a connection with a conservative congestion window and incrementally doubles the amount of data transmitted at a time until it reaches the receiver's receive window, the sshtresh system threshold, or until packets begin to be lost, at which point the congestion avoidance algorithm kicks in.

It is worth noting that improving these algorithms is an active area of both scientific research and commercial product development. There are options that work better on certain types of networks or for transferring certain types of files, and so on. Depending on which platform you are running on, you use one of many options: TCP Tahoe and Reno (original implementation), TCP Vegas, TCP New Reno, TCP BIC, TCP CUBIC (default on Linux), or Compound TCP (by default). default on Windows) and many others. Regardless of the specific implementation, the effects of these algorithms on web application performance are similar.

Proportional speed reduction for TCP

Determining the best way to recover from a packet loss is not a trivial task. If you react too aggressively to this, then random packet loss will have an unduly negative impact on your connection speed. If you don't respond quickly enough, it will likely cause further packet loss.

Proportional Rate Reduction (PRR) is a new algorithm described in RFC 6937 that aims to recover faster from packet loss. According to measurements from Google, where the algorithm was developed, it provides an average reduction in network latency of 3-10% in connections with packet losses. PPR is enabled by default in Linux 3.2 and higher.

Bandwidth-Delay Product (BDP)

The built-in congestion control mechanisms in TCP have an important consequence: the optimal window values for the receiver and the sender must vary according to the round-trip delay and the target data rate. Recall that the maximum number of unacknowledged packets "in transit" is defined as the smallest of the receive and congestion windows (rwnd and cwnd). If the sender exceeds the maximum number of unacknowledged packets, it must stop transmitting and wait until the receiver has acknowledged a certain number of packets before the sender can begin transmitting again. How long should he wait? This is determined by the circular delay.

BDP determines the maximum amount of data that can be "in transit"

If a sender must frequently stop and wait for ACKs for previously sent packets, this will create a gap in the data flow that will limit the maximum speed of the connection. To avoid this problem, window sizes must be set large enough to allow data to be sent while waiting for ACKs to arrive on previously sent packets. Then the maximum transmission speed will be possible, no interruptions. Accordingly, the optimal window size depends on the round-trip speed.

Rice. 5. Gap in transmission due to small window values.

How large should the reception and congestion windows be? Let's look at an example: let cwnd and rwnd be equal to 16 KB, and the round-trip delay is equal to 100 ms. Then:

In a similar way, we can calculate the optimal window value given the circular delay and the required channel width. Let's assume that the time remains the same (100 ms), and the sender's channel width is 10 Mbit/s, and the recipient is on a high-speed channel of 100 Mbit/s. Assuming that the network between them has no problems in intermediate sections, we get for the sender:

BDP in high-speed local networks

Round-trip delay can also be a bottleneck in local networks. To achieve 1 Gbps with 1 ms round-trip latency, you must have a congestion window of at least 122 KB. The calculations are similar to those shown above.

Head-of-line blocking (HOL blocking)

Rice. 6. Blocking the beginning of the queue.

Some applications may not require guaranteed or in-order delivery. If each packet is a separate message, then in-order delivery is not necessary. And if each new message overwrites the previous ones, then guaranteed delivery is also not needed. But TCP has no configuration for such cases. All packages are delivered in turn, and if one is not delivered, it is sent again. Applications for which latency is critical can use an alternative transport, such as UDP.

Packet loss is normal

Packet loss is even necessary to provide better TCP performance. A lost packet acts as a feedback mechanism that allows the receiver and sender to change the sending rate to avoid network congestion and minimize latency.

Likewise, if a game communicates its states, then there is no point in waiting for a packet describing the state at time T-1 if we already have information about the state at time T.

Optimization for TCP

TCP is an adaptive protocol designed to get the most out of a network. Optimizing for TCP requires understanding how TCP responds to network conditions. Applications may need their own Quality of Service (QoS) method to ensure a consistent experience for users.

The application requirements and numerous features of TCP algorithms make their interconnection and optimization in this area a huge field for study. In this article, we have only touched on some of the factors that affect TCP performance. Additional mechanisms such as selective acknowledgments (SACKs), delayed acknowledgments, fast retransmission, and many others complicate understanding and optimizing TCP sessions.

While the specific details of each algorithm and feedback mechanism will continue to change, the key principles and their implications will remain:

The TCP triple handshake incurs significant latency;
TCP slow start applies to every new connection;
TCP's flow control and congestion mechanisms regulate the throughput of all connections;
TCP throughput is controlled through the congestion window size.

Setting up server configuration

Instead of worrying about setting up each separate parameter TCP, it is better to start by updating to the latest version of the operating system. Best practices for working with TCP continue to evolve, and most of these changes are already available in the latest versions of the OS.

“Update the OS on the server” seems like trivial advice. But in practice, many servers are configured for a specific kernel version, and system administrators may be against updates. Yes, updating has its risks, but in terms of TCP performance, it will most likely be the most effective action.

After updating the OS, you need to configure the server according to best practices:

Increase the initial value of the congestion window: this will allow more data to be transferred in the first exchange and significantly accelerates the growth of the congestion window
Disable slow start: disabling slow start after a period of connection inactivity will improve the performance of long-lived TCP connections
Enable window scaling: this will increase the maximum receive window and speed up connections where latency is high
Enable TCP Fast Open: this will make it possible to send data in the initial SYN packet. This is a new algorithm, both the client and server must support it. Explore whether your application could benefit from it.

You may also need to configure other TCP parameters. Refer to the material