Tuning linux network stack - with examples - Part 1
By default Linux network stack (network related kernel settings) is configured for general-purpose networking requirement. This default configuration is usually adequate for typical usages such as running desktop applications and some sorts of server applications where full potential of available network capacity may not need to be utilized. In this post we will see why sometimes these default configurations may not be sufficient.
For understanding limitations of default configurations and checking how tuning specific configuration alters behavior (and potentially fix a problem), I will go step-by-step with examples for better understanding. I will be using a set of client and server applications specially written for demonstration purpose. These examples are developed in Java and Rust programming languages with equivalent code just to demonstrate effect of tuning kernel level network settings are similar whether applications are running on runtime like JVM or running natively closed to the system. You can refer to source code and instructions on how to run particular set of client and server applications at network-stack-tuning-examples.
Client and server applications will be running in separate virtual machines on a same network in the cloud with following configuration -
Host : Ubuntu 21.10 (4 Cores, 16 GB RAM)
Java : 17.0.3 (Amazon Corretto JDK)
Rust : 1.60.0
Let’s start by checking how many simultaneous persistent TCP connections can be established between a client and a server. We will try to keep creating persistent connections from client to server (in sequence) and see how many active TCP connections a system can handle. Either side of the application will stop processing for new connection when any exception occurs. In the log we will be able to see how many connections were created successfully (that log is printed on every 2 seconds as part of separate thread). For this test - we will be using client and server apps from Capacity Test section.
When running application with Java, client is able to create roughly 28232 connections and then fails with this error
java.net.BindException: Cannot assign requested address
When running application with Rust, server fails after accepting 1020 connections with this error
Too many open files (os error 24)
So as per the default (OS specific) configuration, Java client and server applications are able to establish 28232 connections and Rust server fails after making 1020 connections. Here number of successfully accepted connections are far less than what Java application server handled. But this time we have error due to open file limits. Since we are running both Java and Rust server apps on the same machine (one after another), shouldn’t that same open file limits applicable to Java application as well ?
We can check open file limit of the process by using process-id using
cat /proc/<pid>/limits. Let’s see whether it is differing when running different applications
For Java app (omitted irrelevant lines from the output) -
Limit Soft Limit Hard Limit Units
Max open files 1048576 1048576 files
For Rust app (omitted irrelevant lines from the output) -
Limit Soft Limit Hard Limit Units
Max open files 1024 1048576 files
As we can see when Java app is running, Soft Limit is implicitly considered as value similar to Hard limit and when checking output for Rust app it shows these limits remain the same as default value defined as per the OS. Running application on runtime like JVM has its own quirks. It appears that when running application on JVM, it internally updates Soft Limit related to max open files to maximum allowed Hard Limit. This may lead to false assumption like Java server can handle far more connections than the Rust server, which is not the fact as we have seen.
Let’s try to manually set Soft limit (on server host) by using command -
ulimit -n 1048576
This change is applicable to just current SSH session. We re-run the Rust server and client apps again. This time there were no errors on server side, but client failed after making 1021 connections with similar open file limit error. So we can re-run the apps again by increasing Soft Limit on the client host as well.
Now there is an error in Rust client after making 28232 connections and fails with error
Cannot assign requested address (os error 99). This behavior is now similar to Java client. At this point both Java and Rust client applications are not failing due to open file limits but failing due to error
Cannot assign requested address. This error appears on the host on which our client apps are running and when the host runs out of ephemeral ports for client side binding.
I won’t be discussing about what are the ephemeral ports, but in simple terms - when TCP connection needs to be established from a client side, a random free port (from allowed port range) is binded with client IP. This binding represents client end of the TCP channel. For each TCP connection, one ephemeral port is utilized and allowed port range is limited. So when all ephemeral ports are utilized, we can’t make any more outbound connections until a connection is closed and its ephemeral port is free to be used for newer outbound connections.
Ephemeral ports are needed just on client side from where we are making outbound connections. From server side, the same port on which server is listening for incoming connections will be utilized to bind port across all inbound connections.
So next question could be
Can we adjust some settings so that we can make 50K active connections from client to server ? This may lead to another question
Why should we make 50K active connections from client to server ?
Reasons can be many, For example -
1) Load-test our backend servers with 50K simultaneous requests per second (may be with keep-alive type of HTTP requests). In this case load-testing client should be able to make 50K simultaneous connections to the server
2) Client is load balancer kind of application and acting as a proxy for 100s of backend servers hosting different services
We can determine current ephemeral port range using command -
on my virtual machine it results with following output
net.ipv4.ip_local_port_range = 32768 60999
This port range specifies which ports can be utilized as ephemeral ports and are inclusive for both start and end parts of the range. According to that range, we can make at most 28232 simultaneous outbound connections from the host. This limit can be easily increased by executing
sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"
sudo sysctl -p
This change is applicable till virtual machine is running. By setting that, we can make about 64512 outbound connections from the host. For our example, these commands need to be executed on the client host. If server is load balancer kind of application and is hitting the limits of ephemeral port ranges then we can apply the same settings on server end too. We can verify that ephemeral port range is increased by re-running same Java and Rust applications.
Now our applications are able to make about 64K outbound connections from the client host. However we need to be aware that any of the ports in range 1024–65535 can be used as ephemeral port. If we are running some server on the same host which is listening to port 8080, then that server start-up could fail sometime when port 8080 is already utilized as ephemeral port for some outbound client connection and that server wants to bind listening port to 8080. That server can’t start until port 8080 is released.
This inconvenience can be easily aided by setting list of ports or port-ranges which should be reserved and must not be used as ephemeral ports. That can be done using -
sudo sysctl -w net.ipv4.ip_local_reserved_ports="8080,10080-10089"
sudo sysctl -p
This change is applicable till virtual machine is running. Once ports are reserved, they won’t be used as ephemeral ports for making outbound connections. In our previous example we were able to make 64512 outbound connections. Now after reserving 11 ports - we can make 64501 outbound connections.
We can re-run example applications (both Java and Rust) and once all possible outbound connections are made, verify that none of the reserved ports are being used as ephemeral ports using following command -
sudo netstat -tupn | grep ":<port>"
Can we increase this limit further more so that we can make 100K outbound connections from a single host ? Well, As per the TCP protocol - port range is just limited with upper bound 65535 and up to 1024 ports are usually kept reserved for the typical linux services. That leaves us 64512 ephemeral ports and that should be considered as a limit - No matter how many CPU cores or memory you have. If we want to make more outbound connections then we should generate that traffic from separate virtual machine.
This ~64K outbound connections limit is not overall limit of the client host for making outbound connections. We can establish ~64K outbound connections per target host/port combinations. So if another server app instance is running on different port of current server host or running on any port of completely different host, then we could make another ~64K outbound connections to that server host/port combination from current client host. If we have 100 instances of server application running on different host/port combinations, then we could make ~6M outbound connections from single client host (as long as open file limit permits it).
Let’s check another set of example where we have a Client application with 4 threads. Each thread has an infinite loop that establishes socket connection to the server and immediately closes it and whether this operation is successful or not is tracked and printed to the console on every 2 seconds using a task running in yet another thread, if there are any errors then unique error messages are listed too. Server application keeps accepting socket connection and transfers accepted connection on thread pool of 4 threads where it waits until client closes a connection and immediately after that closes server end of the connection. On client side there will be no more than 10 active connections. Server end of connection is closed after little delay but through out the runs usually there will not be more than 100 active connections. We will be now using client and server apps from New Connection Test section.
When running an application, Client is going to make a lot of short living connections which are immediately closed. According to nature of the application, probably we assume that no failures will be there as neither client or server has to handle large number of active connections. However once client app is started, after few seconds we will start noticing errors on client side. For some duration error will remain and after some time those errors will disappear and all connection requests will be successful and this type of cycle will keep repeating. If you don’t see any errors after few seconds, try reducing ephemeral port range from 64K to 28K and re-run the apps.
Again for this client app, cause of error is
Cannot assign requested address. But wait — we don’t have lot of active connections open on client side and a lot of ephemeral ports must be available already. Then Why client app is not able to establish an outbound connection ? While client app is running, we can check TCP connection statuses using one of the following method -
Using netstat command (large output omitted) -
Above command prints a lot of details and we can see lot of connections are in TIME_WAIT status. Alternatively we can use ss command -
This gives short readable summary and also shows that there are lot of TCP connections in timewait status. Connection with TIME_WAIT status appears on the host in which apps are initiating connection close request. So if server is closing a connection first then TIME_WAIT status will appear for the closed connections on server host. To simulate this scenario on client side, we are closing connection immediately in client app, and on server side we are waiting for client to close connection first before server closes a connection. I won’t go in a lot more detail about TIME_WAIT connections. But in simple terms for our scenario where client initiates connection close request - connection on client side will remain in TIME_WAIT status for about 60 seconds so that any duplicate or late arriving packets are handled gracefully. That means once connection is closed by the client (in our example), ephemeral port used by connection will remain occupied for this duration and newer connections can’t utilize that port during this period.
Once connection is closed in our client app, ephemeral port for the connection will remain occupied for about 60 seconds. This will result in lot of connections in TIME_WAIT status. And since short-living connection making process runs faster than the rate by which ephemeral ports become free, host will eventually run out of available ephemeral ports and will result in similar kind of error which we saw in the earlier example of this post.
There is no direct way of reducing duration for which connection remains in TIME_WAIT status. Some articles suggests to adjust
net.ipv4.tcp_fin_timeoutaccording to requirement — but that setting has nothing to do with connections in TIME_WAIT status or duration for which connection remains in TIME_WAIT status. Although there is a way to enable reuse of locally binded sockets which are in TIME_WAIT status -
sudo sysctl -w net.ipv4.tcp_timestamps=1
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
sudo sysctl -p
This change is applicable till virtual machine is running. For our example, these commands need to be executed on the client host. After setting those parameters when we re-run our client and server apps (both Java and Rust), no client request fails.
net.ipv4.tcp_timestampsis usually enabled by default in latest linux distributions. However
net.ipv4.tcp_tw_reusesetting needs to be enabled with caution and after proper testing has been conducted in non-production environment. For internal non-production applications like load-testing tools which is testing backend services with lot of short-living connections, it should be fine to enable
net.ipv4.tcp_tw_reuseon the host where load-testing application is running. But generally it is not safe to blindly enable reuse related setting without having deep understanding about a usecase and implications.
In this post we saw how tuning some specific network settings alters capacity of the host for making outbound connections. All commands presented are temporarily changing system behavior and will be reset on restart of the virtual machine. To make those settings permanent, please refer to distribution specific documentation. Also statistics presented in this post depends on the environment where applications are running and may be different for different distributions of linux and hardware sizes.