Shaving off milliseconds in AWS

In a high speed, low latency networks, every millisecond counts. This article takes a look at jumbo frames and their effect not only on throughput, but also latency.

Aug 16, 2024

Hot path application

Our work is highly specialized, focusing on optimizing network performance to ensure exchanges can handle the maximum number of requests without delay. To give you an idea, our median response time on certain endpoints is approximately 3-4ms. Enhancing performance beyond this level is extremely challenging. While our hot path applications are written in Go and we might achieve slight improvements by transitioning to Rust, we don’t see that as a viable option. Such small potential improvement does not justify the work and effort needed to achieve it.

Further optimization

At this point, we start looking into other kind of improvements that we can squeeze out of our infrastructure. One of them would be, using cpu and network optimized instances, using placement groups or cluster placement groups etc.

Arguably you should not put whole infrastructure in placement group that is latency optimized, since that brings additional risks, because instances are possibly located on the same rack in the datacenter. That being said, if something happened to that rack, a large portion of infrastructure would be at risk, not just few instances. It’s not really acceptable risk for us.

What we can actually do, without affecting overall risk, is improve general network performance inside VPC by utilizing Jumbo frames. But, Jumbo frames are used to improve throughput when dealing with large files, not the latency! Arguably that is true, but let’s have a look at the approach.

Hint: There is an overhead of handling a large amount of small packets vs. handling small amount of large packets.

Jumbo frames on AWS

In general AWS supports 9001 MTU Jumbo frames. But how do we get to the 9001? Seems rather weird number. We would assume 9000 would be sensible in the first place. The answer lays in the AWS hardware setup. AWS hardware actually supports 9216 MTU, but AWS, to be able to encapsulate all the traffic inside a tunnel uses 215 bytes per frame. So (9216 - 215) brings us to 9001 and this is the portion that is left for us to use.

Jumbo frames are basically supported on latest generation instances1, but keep in mind, AWS Traffic is limited to a maximum MTU of 1500 in the following cases:

Traffic over an internet gateway
Traffic over an inter-region VPC peering connection
Traffic over VPN connections
Traffic outside of a given AWS Region

For our case, this is totally fine. We need Jumbo frames support on cross AZ and intra-AZ.

Proof of Concept

To prove the theory that overhead of processing large number of packets has its effect on overall latency, we will be using the largest packet that we can find on our hot path infrastructure, and that would be orderbook call. Fully fledged order books for coin pairs are usually up to 500kB of size, so we will take this as data size we want to test on.

Enabling Jumbo frames

To enable Jumbo frames on Amazon Linux 2023 upon boot up:

echo "MTU=9001" | sudo tee -a /etc/sysconfig/network-scripts/ifcfg-eth0
echo "request subnet-mask, broadcast-address, time-offset, routers, domain-name, domain-search, domain-name-servers, host-name, nis-domain, nis-servers, ntp-servers;" | sudo tee -a /etc/dhcp/dhclient.conf

Setting it manually for testing purposes:

ip link set dev ens5 mtu 9001
ip link show ens5

Tracepath

Once we’ve set MTU on instances that we wish to test, to be able to verify if our path is actually supporting that MTU, we need tracepath:

sudo yum install iputils

Testing if our server to client path supports our desired MTU:

[root@mtu1 ec2-user]# tracepath 10.0.1.18
 1?: [LOCALHOST]                      pmtu 9001
 1:  10.0.1.18                                             0.166ms reached
 1:  10.0.1.18                                             0.127ms reached
     Resume: pmtu 9001 hops 1 back 2
[root@mtu1 ec2-user]#

Benchmark

References in code:
mtu1 - 10.0.1.89 - eu-west-1a - Server that is running listening server to accept data
mtu2 - 10.0.2.55 - eu-west-1b - Server that is running tcpdump script and client that will send 500kB of data
local - My own laptop that will trigger benchmark automation procedure remotely

I’ll be running server.py on mtu1 server. It will listen to connection, accept data, write some info about the transfer and shutdown:

server.py:

import socket

# Server configuration
HOST = '10.0.1.89'  # Localhost
PORT = 65432        # Port to listen on (non-privileged ports are > 1023)

# Create a TCP/IP socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as server_socket:
    # Bind the socket to the address and port
    server_socket.bind((HOST, PORT))

    # Listen for incoming connections
    server_socket.listen(1)
    print(f"Server listening on {HOST}:{PORT}")

    # Wait for a connection
    conn, addr = server_socket.accept()
    with conn:
        print(f"Connected by {addr}")
        total_data = b''

        # Keep receiving data until the connection is closed
        while True:
            data = conn.recv(1000000)
            if not data:
                break
            total_data += data
            print(f"Received {len(data)} bytes")

        print(f"Total data received: {len(total_data)} bytes")
        print("Data received successfully.")

Client code is run on mtu2 instance that is in another AZ than mtu1. It basically crafts string of 512kB and sends it to mtu1 server. Nothing fancy.

client.py:

import socket

def start_client(host='10.0.1.89', port=65432):
    # Create a TCP/IP socket
    client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

    # Connect to the server
    client_socket.connect((host, port))

    try:
        # Create a 500 KB message (512000 bytes)
        message = b'a' * 500 * 1024

        # Send the data
        client_socket.sendall(message)
        print(f"Sent {len(message)} bytes of data to the server.")

    finally:
        # Clean up the connection
        client_socket.close()

if __name__ == "__main__":
    start_client()

We will run tcpdump script in automation script called analyzer besides the client on the mtu2 server, so that we can verify the whole path, from the send of SYN packet, to the FIN packet received, marking our connection closed. We will subtract the times in this script and get total microseconds needed for connection and transfer to finish. This will be used to determine how changing MTU affects the latency.

tcpdump script:

#!/bin/bash
echo "Starting analysis"

# Get current MTU
MTU=$(ip addr show | grep ens5 | grep mtu | awk -F" " {'print $5'})
echo "MTU: $MTU"

# Run for 5 seconds. Enough for us to run one execution and everything to be written down
timeout 4 tcpdump -i ens5 'port 65432' > dump.log 2>/dev/null

# Gather start and stop numbers
START_T=$(cat dump.log | grep "\[S\]" | awk -F" " {'print $1'} | awk -F "." {'print $2'})
END_T=$(cat dump.log | grep "\[F\.\]" | awk -F" " {'print $1'} | awk -F "." {'print $2'})
NUM_PACKETS=$(wc -l dump.log|awk -F" " {'print $1'})

# Cast to string cause sometimes we can have 092340 number
START=$(expr $START_T + 0)
END=$(expr $END_T + 0)
DIFF=$(($END - $START))
echo "Time: $(($END - $START)) μs"

# Graph section:
# If we have argument, use that as index otherwise just put in timestamp
if [ $# -eq 0 ]; then
    INDEX=$(date +%s)
else
    INDEX=$1
fi

TIMESTAMP=$(date +%s)
# Log to csv
echo "$INDEX,$MTU,$NUM_PACKETS,$DIFF" >> data.csv

To put everything together, we will run automation script on our own local computer that will execute everything in correct order. It’s all quick and dirty since this is PoC to show you how things behave.

automation script:

#!/bin/bash
# Note:
# mtu1 = server
# mtu2 = client

# Stop if theres an error
set -eu -o pipefail

# We will test these MTUs
MTUS=("1500" "9001")
for MTU in "${MTUS[@]}"
do
    # Set MTU on both servers, then proceed with testing
    echo "Setting MTU to $MTU"
    ssh mtu1 sudo ip link set dev ens5 mtu $MTU
    ssh mtu2 sudo ip link set dev ens5 mtu $MTU

    # Run test 40 times and gather results. They will be available in data.csv as per analyze script code.
    for NUM in {1..40}
    do
        echo "Iteration ${NUM}: ${MTU} MTU"
        ssh mtu1 sudo "python3 server.py" &
        sleep 1
        ssh mtu2 sudo "./analyze $NUM" &
        sleep 1
        ssh mtu2 sudo "python3 client.py"
        sleep 4
    done
done

I think 40 times is good enough to get at least some insight if there is difference between the two MTUs and transfer over network.

Benchmark results

We can notice a rough estimation of 500 microseconds latency improvement sending 500kB of data over the wire. Difference in number of packets needed to send this data is noticable. In 9001 MTU the data was sent in total of ~88 packets logged by tcpdump, while the 1500 MTU needed ~401 packets to send this data. We also notice that cross AZ latencies vary. They are not consistent and they take 1ms more from time to time. If I ran this long enough we would see some spikes where latency cross AZ can spike much more than 1ms though, in range of 4ms or so. This is something we raised with AWS in the past, because this kind of fluctuation does affect our matching engine performance if we use it cross AZ, but unfortunately there was no follow up on it.

Final thoughts

The grand total of improvement on orderbook API call that in start needed 3ms and shaving off 0.5 millisecond is 16,67% and if we count the max numbers at 4ms, then its 12.5% improvement. Some may not think this is a lot, but when dealing with so low numbers and going into microseconds realm, this does make a difference, for very low effort needed to achieve it.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html#jumbo_frame_instances

BitNirmata

Discussion about this post