Tuning linux network stack - with examples - Part 2

In Tuning linux network stack - with examples - Part 1, I discussed about how tuning certain kernel level settings can affect capacity of the networking application. In this post I am going to discuss about settings which could affect throughput for the established socket connections

Similar to the previous post, I have prepared set of example client/server applications in both Java and Rust to check how networking behavior is affected after some network setting is tuned. You can refer to source code and instructions on how to run particular set of client and server applications at network-stack-tuning-examples. Some of the examples are using Netty (for Java) and Tokio (for Rust) as those examples are capable of handling thousands of active connections and creating individual threads for handling each connection is not a scalable option. All commands presented in this post are temporarily affecting network settings until host/VM is restarted

Let’s start by discussing about window scaling feature of the TCP protocol and how that feature can affect throughput in terms of data transfer rate. TCP uses windowing mechanism that regulates how much data can be sent to a receiver from a sender over network. When transferring large amount of data from sender to receiver, sender sends some amount of data (defined as per window size) and waits until acknowledgement is received from the receiver before sending next window of data. This results in efficient usage of resources on both sender and receiver side and if some packets are lost they can be re-transmitted

By original design of the TCP protocol, window size is limited to maximum of 65535 bytes. Sender can’t send more than 64K bytes of data until acknowledgement is received from receiver. When network bandwidth is high and latency (round-trip time) is also high, this upper limit of 64K bytes window results in poor utilization of network bandwidth for the single channel. Let’s say if latency is 30 ms and we can send only 64K bytes at a time, we can achieve data transfer rate of 17.48 Mbps even though available network bandwidth is 1 Gbps or 100 Gbps

TCP window scaling feature is extension to original TCP protocol that allows to use much larger window size, which means more data (greater than 64K bytes) can be sent by sender before receiving any acknowledgement from the receiver. In modern distributions of linux — window scaling feature is usually enabled by default and represented by net.ipv4.tcp_window_scaling setting. TCP window size depends on read/write TCP network buffer sizes of the established socket connection which can be tuned by following settings -

net.core.rmem_default
net.core.rmem_max
net.ipv4.tcp_rmem
net.core.wmem_default
net.core.wmem_max
net.ipv4.tcp_wmem

More details about these settings can be found at tcp: Linux Manual Pages. How to tune those settings are explained a bit later in this post. When window scaling feature is disabled, setting these parameters to higher value than 64K bytes will be considered for network buffer allocations but max window size over the network will be capped to 64K bytes. When window scaling feature is enabled, we can take advantage of larger window size by setting these parameters to higher values than 64K bytes

To enable window scaling feature -

sudo sysctl -w net.ipv4.tcp_window_scaling=1
sudo sysctl -p

To disable window scaling feature -

sudo sysctl -w net.ipv4.tcp_window_scaling=0
sudo sysctl -p

Let’s check how TCP window scaling feature and network buffer size tuning affects throughput. In this example, we are going to use single threaded client and server apps. When connection is established - server app is going to repeatedly send data to the client with 1 MB chunk size. While server is sending data to the client - per second data transfer bitrate is logged to the console and in the end it shows average bitrate throughout the run. For this test - we will be using client and server apps from Transfer Rate Test section

To configure various buffer sizes on client and server host we will be using following command template -

sudo sysctl -w net.core.rmem_default="65535"
sudo sysctl -w net.core.wmem_default="65535"
sudo sysctl -w net.core.rmem_max="<BUFFER_SIZE>"
sudo sysctl -w net.core.wmem_max="<BUFFER_SIZE>"
sudo sysctl -w net.ipv4.tcp_rmem="4096 65535 <BUFFER_SIZE>"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65535 <BUFFER_SIZE>"
sudo sysctl -p

<BUFFER_SIZE>in above command template should be replaced with buffer sizes (in bytes) as per the scenario

In my case, latency between client and server host is 196 ms. Here is a summary statistics after running client/server apps with different combinations of buffer sizes -

In this example I took some arbitrary buffer sizes to demonstrate how changing them affects utilization of network bandwidth for a single channel. In real-world scenario, Bandwidth-Delay Product (BDP) can be calculated to identify optimal buffer size. Calculated BDP value specifies the optimal number of bits which can be sent by sender in order to keep the network channel full. BDP is calculated as following -

BDP (bits) = bandwidth (bits/second) * latency (seconds)

While calculating optimal window size it should be important to note that -
- Network data communication considers values in bits and 1 KBit = 1000 Bits
- TCP network buffer settings consider values in bytes and 1 KByte = 1024 Bytes

Let’s consider 1 Gbps network with 30 ms latency (round-trip time) then optimal BDP value is 30 MBits. Which means window size should of at least 3.75 MBytes and sender should be able to send that much amount of data without waiting for any acknowledgement from receiver in order to fully utilize bandwidth for a single channel. Of course calculated BDP value is just theoretical optimal number which suggests how big window size should be in order to utilize full capacity of the channel and to achieve maximum throughput. However in real-world, networks are subject to TCP protocol overheads, packet loss, network jitters etc and even receiver’s processing speed for received packets can also affect overall sender throughput

Each allocated TCP buffer consumes system memory. net.ipv4.tcp_mem setting represents bounds on how much memory can be allocated for the purpose of TCP buffers, TCP socket options and other related metadata. These bounds are measured in units of system page size. We can use following command to determine current setting values -

$ sysctl net.ipv4.tcp_mem
net.ipv4.tcp_mem = 190791 254389 381582

Following command can be used to determine current system page size (in bytes) -

$ getconf PAGESIZE
4096

To check current utilization of system memory for TCP buffers and other related metadata, we can use -

$ cat /proc/net/sockstat
sockets: used 165
TCP: inuse 5 orphan 0 tw 0 alloc 6 mem 1054
UDP: inuse 3 mem 1
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

This command outputs many of the network stack related stats and mem value for TCP represents currently utilized memory for TCP buffers and other related metadata in units of the system page size

To examine buffer size usage on server side, let’s check another example where client app will create 10K active connections but won’t read any data from them. Server app will try to write as much data as possible in 64 KB chunks on all established connections and fill the write buffers of each connection. Since client app is not going to read any of the received data, server app will eventually become idle as write buffers are going to be full soon and no more data can be written to the buffers from the server app until client app consumes some data. In the end server app also prints stats on how much data were written from application perspective. For this test - we will be using client and server apps from Max Send Test section. To systematically check the statistics, our example server app runs in 3 stages -

  1. Server app is up and listening for incoming connections. Once connections are accepted - writing to established socket connections won’t start until second stage starts. Server app remains in this stage for the duration specified by conn_wait (default=15s)
  2. Server app starts writing (in parallel) to all established socket connections accepted in previous stage until write buffers are full. Server app remains in this stage for the duration specified by write_wait (default=15s)
  3. Server app remains idle in this stage and allows us to monitor overall system memory usage for TCP and per socket network buffer usage using other tools. Server app remains in this stage for the duration specified by close_wait (default=30s)

Once server app is in stage 3, we can monitor TCP memory utilizations and other statistics. To check TCP socket level memory utilizations, we can execute ss -tm command which produces output like following -

Output from the previous command shows memory utilization related stats for each TCP socket connection. More details about each value specified in the output can be found at ss: Linux Manual Pages. Reading through this large output and digesting summary is hard especially when thousands of socket connections are established. Alternatively we can use example utility app mentioned in Check Server TCP Memory section which internally executes same ss -tm command but produces output in different format which is more human readable. This app shows individual socket level memory utilizations sorted by wmem_queued field and in the end shows summary statistics based on wmem_queued field. Here is a summary statistics produced by Check Server TCP Memory app when running Max Send Test client/server apps and when server is in stage 3 (with 4 MB buffer size and 1 GB upper bound for net.ipv4.tcp_mem)

min: 4180 (0.004 mb)
max: 338736 (0.323 mb)
average: 105136.6496 (0.1 mb)
sum: 1051366496 (1002.661 mb)
25th percentile: 61768 (0.059 mb)
50th percentile: 95744 (0.091 mb)
75th percentile: 134256 (0.128 mb)
85th percentile: 168232 (0.16 mb)
95th percentile: 237832 (0.227 mb)
99th percentile: 333576 (0.318 mb)

As we can see from the output, buffer sizes are not distributed evenly between all sockets but kernel attempts adjust allocated buffer sizes considering net.ipv4.tcp_mem bounds and currently utilized memory for the TCP buffers and other related metadata. Some sockets are getting send buffer of size 150 KB or more and some of them are getting buffer sizes low as 60 KB or less. Smaller allocated buffer sizes will result in smaller window

Let’s consider how changing upper bound of the net.ipv4.tcp_mem to 4 GB changes distribution of buffer sizes over the same (10K) number of established socket connections

min: 215004 (0.205 mb)
max: 338112 (0.322 mb)
average: 298133.2868 (0.284 mb)
sum: 2981332868 (2843.221 mb)
25th percentile: 236184 (0.225 mb)
50th percentile: 321748 (0.307 mb)
75th percentile: 333576 (0.318 mb)
85th percentile: 335256 (0.32 mb)
95th percentile: 335256 (0.32 mb)
99th percentile: 338112 (0.322 mb)

When more memory is configured for net.ipv4.tcp_mem, buffer sizes are allocated considering that and more number of sockets are able to get larger buffer sizes. And as we have seen - if socket gets large send/receive buffer then that socket connection’s window size is adjusted accordingly for the respective read/write operation and will result in better throughput if overall network bandwidth limit is not reached

Tuning TCP buffer sizes and memory bounds are tricky parts when performing linux network stack tuning. We can start with operating system default values and make adjustments according to identified bottlenecks and limits. Here are few scenarios that could help in making decision -

  • When considering public facing servers and having clients with varying amount of latencies and bandwidths — there are no fixed values available to calculate optimal BDP and usually medium sized buffers like 2-8 MB could be appropriate depending on available system memory and expected concurrent clients. Other network monitoring tools can be used further to measure latency statistics of the connected clients and based on that buffer sizes can be adjusted over time
  • If packet loss is frequent in the network than buffer sizes can be kept lower to avoid frequent cycles of re-transmission for the lost packets
  • If we are considering server-to-server communications and bandwidth is high and so is latency, we can use large sized buffers of 12-32 MB or more (depending on usecase). Doing that could result in higher memory usage for the TCP buffers and upper bounds for overall TCP memory usage should be tuned properly so that if other memory intensive applications are running in the same host then they get enough memory without crashing
  • When clients and servers are part of the same network or having latency lower than 2-6 ms then TCP buffer sizes can be kept very low (i.e. less than 200K). Depending on usecase - window scaling can also be disabled and max buffer sizes can be kept to 64K without losing any network throughput. This minimizes TCP buffer memory utilization overheads and available system memory can be utilized for other purposes

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Parth Mistry

Enterprise Application and BigData engineering enthusiast. Interested in highly efficient low-cost app design and development with Java/Scala and Rust 🦀