A Brief Discussion on Maglev, Google's Soft Load Balancing Practice

It's been a while since I last blogged, so let's write a simple note on reading papers. This article is based on a paper published by Google in 2016, Maglev: A Fast and Reliable Software Network Load Balancer, which shares the implementation of their internal software load balancing system that has been used on a large scale since 2008. There are many interesting details in it, and I'll write as much as I can.

Background#

The concept of load balancing is familiar to everyone, so I won't elaborate on it again. Now we need to consider Google's scenario. At the beginning of the design, Google needed a high-performance LB to handle the traffic of some of its major services, such as Google Search, Gmail, and so on. Due to the enormous traffic, the LB needed to have very strong performance to handle a large amount of traffic.

In this case, the traditional idea might be to directly use professional hardware load balancing; problems that can be solved with money are not considered issues (laughs). However, such a solution has significant problems:

The performance of a single point of hardware load balancing determines the number of requests the entire network can handle.
There are flaws in HA. To ensure that the entire network cluster does not become paralyzed when a single point fails, we usually need a 1:1 redundancy.
Lack of flexibility and programmability; there are no entry points for doing fancy operations.
It's too expensive. So expensive that even Google can't afford it (runs away).

In such a situation, Google began to consider building its own SLB (Software Load Balancer) system. The benefits are also very obvious. For example, convenient scaling, reducing the required redundancy for HA from the previous 1:1 to N+1, and convenient customization, etc. The architecture evolved into the diagram below:

However, the challenges are also evident. First, sufficient performance is needed to ensure that the cluster has enough throughput. At the same time, connection tracking is required to ensure that packets from the same connection can be delivered to the same machine. It may also be necessary to ensure transparent failover capability.

These requirements combined lead us to today's topic: Maglev. The LB system that Google has been using on a large scale since 2008.

A Glimpse of Maglev#

Background Knowledge#

Before continuing to discuss Maglev, we need to understand how Google currently uses Maglev. Below is a simplified schematic diagram:

At the same time, we need to introduce a very important concept called VIP (Virtual IP Address). Those who have used Kubernetes will definitely be familiar with this concept. A VIP is not a physical IP bound to a network card. It can be seen as an abstraction of a group of backend endpoints. When you access this VIP, you are actually accessing the backend endpoints. Here’s a more understandable example: after creating a set of Pods in Kubernetes, to expose the services provided by the Pods, we usually create a Service to associate with the corresponding Pods. The Service typically has an IP, which is a VIP. When we access the Service's IP, it will usually randomly select one of the backend Pods to handle the request.

Now, back to Maglev. Let's look at the entire process. Maglev associates with the VIP and then transparently passes the VIP to a group of Routers. When a user types https://www.google.com in the browser and hits enter, the browser will perform DNS resolution. The DNS resolution will be handled by Google's DNS servers. The DNS server will select the nearest cluster's VIP based on the user's region and return it to the user, and then the browser will establish a connection based on the obtained VIP.

When the Router receives the corresponding packet, it will forward the packet to any node in the Maglev cluster that the VIP belongs to. Each node in the cluster has balanced weights. When the Maglev node receives the packet, it will use GRE (Generic Routing Encapsulation) to encapsulate it and then transmit it to the corresponding backend endpoint.

When the backend endpoint receives the packet, it will process the request. When the response data is ready, it will perform encapsulation, using the VIP as the source address, the user's IP as the destination address, and the response data as the packet operation. At this point, the backend endpoint will use DSR (Direct Server Return) to return the packet directly, bypassing Maglev. This avoids adding extra burden to Maglev when the response is large. In fact, DSR has been widely used in L4 LB implementations like HAProxy, Envoy, etc. I’ll write a blog about it when I have time.

Maglev Configuration#

As mentioned earlier, Maglev receives VIP requests from Routers and forwards the corresponding traffic to the corresponding backend endpoints. Each Maglev consists of a Controller and a Forwarder, as shown in the architecture below:

Both the Controller and Forwarder manage the relevant VIPs using a Configuration Object. This Configuration Object is actually another system (which can be roughly considered a registration center), and they communicate with each other through RPC.

On the Maglev machine, the Controller periodically checks the Forwarder. Based on the check results, it determines whether to submit/revoke the registration of all VIPs via BGP (either all succeed or all fail, which is actually to ensure system consistency). This ensures that traffic coming from the Router can be directed to healthy machines.

The VIP traffic coming from the Router will be processed by the Forwarder. In the Forwarder, each VIP will be associated with one or more backend pools. Unless specially handled, the backends in Maglev are server endpoints. A backend pool can contain a set of physical IPs of service endpoints or other backend pools. Each backend pool will design several monitoring checkers based on its specific needs, and packets will only be forwarded to healthy services. As mentioned earlier, the same service may be included in multiple backend pools, so the Forwarder will deduplicate based on specific addresses to avoid extra overhead.

The Config Manager of the Forwarder will be responsible for pulling, parsing, and validating the relevant configurations from the Configuration Object. All configuration submissions are atomic (either all succeed or all fail). During the process of pushing and parsing to take effect, there is a very brief gap during which the configurations between a Maglev cluster may be out of sync. However, due to the existence of consistent hashing, most requests can still be successfully delivered within this very short gap.

Maglev Implementation#

Alright, after all this talk, let's look at some practical details of the entire Maglev system.

Overview#

As we all know (as mentioned earlier), Maglev is actually responsible for the traffic-related forwarding work through the Forwarder. Let's illustrate its structure with a diagram:

The Forwarder will directly receive packets from the NIC (Network Interface Card) and then directly throw them into the NIC for forwarding to the backend. All operations during this process will not go through the kernel (in fact, going through the kernel incurs additional costs).

Packets retrieved from the NIC will first be processed by the Steering Module. During processing, the Steering Module will perform hash calculations based on the five-tuple (protocol, destination address, destination port, source address, source port). It will then transfer the packets to the corresponding Receiving Queue. Each Receiving Queue corresponds to a processing thread. The processing thread will filter out packets whose destination VIP does not match the locally registered VIP. It will then recalculate the five-tuple hash and look up the corresponding value in the Connection Tracking Table.

The Connection Tracking Table stores the backend corresponding to the previous five-tuple hash. If the lookup hits, it will be reused directly; if it misses, a new backend will be selected for this packet, and the key-value pair will be added to the Connection Tracking Table. If no backend is available at this time, the packet will be discarded. After completing the lookup operation, as mentioned earlier, the packet will be rewritten and placed into the transmission queue. Finally, the muxing module will send the packets in the transmission queue directly through the NIC.

Here’s a question: why not consider using a common strategy like round-robin in the Steering Module? Everyone knows that the processing speeds of each thread are inconsistent. If we directly use round-robin, it may lead to packet reordering in such situations. If we introduce the concept of weights to improve it, it will add new complexity, as the processing speed of threads is dynamically changing. Another situation is connection tracking. Suppose we have a persistent connection; we need to ensure that every packet goes to the same machine. In this case, using round-robin would introduce additional complexity. However, for some special cases, such as when the receive queue is full, and consistent hashing cannot handle it, we will use round-robin as a backup method to replace consistent hashing. This situation is particularly useful when packets with the same five-tuple exist simultaneously.

Efficient Packet Processing#

As mentioned earlier, Maglev directly operates on TCP packets. Given Google's enormous traffic, Maglev needs to have good forwarding performance. Otherwise, its throughput capacity will not meet the demands in large-scale scenarios. How does Google achieve this? Answer: by directly operating on the network card.

We all know that when programming networks in Linux, copying packets from kernel space to user space is actually a very costly operation. Therefore, for scenarios with extreme performance demands, such as L4 load balancing, people tend to implement things in the kernel to avoid cross-state copying. This is also the idea behind tools like LVS. However, for larger-scale traffic, going from the network card to the kernel and processing through a bunch of filters in the kernel is also a very costly operation. As mentioned earlier, Maglev only relies on the five-tuple in the packet and does not need to care about the packet sequence number or payload. So Google thought: I have a bold idea! Let’s look at a diagram:

Google chose to program directly on the NIC (i.e., network card). The Forwarder and NIC share a memory space. This memory maintains a circular pool of packets. The steering module and muxing module in the Forwarder each maintain three pointers to handle these packets, which are described in detail below.

First, the steering module maintains three pointers:

received, managing received packets.
reserved, managing received but unprocessed packets.
processed, managing processed packets.

The process is as follows: when the NIC receives a new packet, the memory pointed to by the received pointer will be modified. Then, when a packet is dispatched to a thread for processing, the memory address pointed to by the processed pointer will be modified. Since it is a circular structure, the packets that exist between received and processed are those that have been received but not yet processed, managed by the reserved pointer.

Correspondingly, the muxing module also maintains three pointers:

sent, managing packets that have been sent.
ready, managing packets that are ready and waiting to be sent.
recycled, managing recycled packets.

The corresponding process is as follows: when the steering module completes processing a packet, the memory pointed to by the ready pointer will be modified, and it will wait to be sent. When a packet is sent, the memory address pointed to by the sent pointer will be modified. In addition to ready and sent, there is another state recycled that manages recycled packets.

We can see that during this process, no data copying occurs, which actually reduces some latency caused by copying data. However, this method has a problem: when the pointers go out of bounds, it incurs significant additional overhead. Therefore, Google's approach is to use batch processing, for example, receiving 3000 small packets and processing them at once, which is quite a clever operation.

Additionally, some extra optimizations need to be made, such as ensuring that packet processing threads do not share data to avoid race conditions, and binding threads to specific CPU cores to ensure performance, etc.

Currently, Google's approach is very efficient, with an average processing time of only 300 ns ($10^{-9}$s) per packet. As mentioned earlier, Google uses batch processing to handle packets. The problem with this is that whenever hardware interrupts occur, the time to reach the processing threshold may be significantly longer than in most cases. Therefore, Google designed a 50μs ($10^{-6}$s) timer to handle such situations. In other words, when there are hardware or other issues, the overall packet processing time may increase by 50μs (actually, it feels like Google is showing off: look how great our performance is, the only bottleneck is hardware (runs away)).

Backend Selection#

As mentioned earlier, the Forwarder will select a backend for the packets. For common TCP scenarios, it is crucial to forward packets with the same five-tuple to the same backend node. Google maintains a connection tracking table in Maglev to solve this problem. When a packet arrives, Maglev calculates its five-tuple hash and checks whether it exists in the table. If it does not exist, it selects a node as the backend and adds the record to the table. If it exists, it is reused directly.

This seems fine, right? Google: No, no, there are still problems!

First, let's consider a scenario: as mentioned earlier, Maglev is connected to a group of Routers, and the Routers do not provide connection affinity, meaning they do not guarantee that packets from the same connection are sent to the same machine. Therefore, it is possible that different packets from the same connection are sent to different machines. For example, let's assume the Router has connection affinity, but there may also be cases where, after a machine restarts, the connection tracking table is cleared.

Another example: we all know that the memory used by the connection tracking table must have a threshold. Therefore, when facing very high traffic or abnormal situations like SYN Flood, when the capacity of the connection tracking table reaches its threshold, we will inevitably clear some data. In this case, the tracking information for a connection is likely to be cleared. So how do we perform connection tracking in such situations?

Google's approach is to introduce consistent hashing.

Consistent Hashing: Maglev Hash#

The overall algorithm has many details, but I will only explain the general idea here; for specific details, you can refer to the original text.

First, we need to determine the length M of the lookup table after preprocessing. All keys will be hashed into this lookup table, and each element in the lookup table will be mapped to a node.

The calculation of the lookup table is divided into two steps:

Calculate a value for each node for each item in the lookup table (which is referred to as permutation in the original text).
Based on this value, calculate which node each item in the lookup table maps to (stored in the entry; here, the entry is referred to as the final lookup table in the original text).

The permutation is an M×N matrix, where columns correspond to the lookup table and rows correspond to nodes. To calculate the permutation, two hash algorithms need to be selected to compute two values: offset and skip. Finally, the permutation is filled based on the values of offset and skip, as described below:

offset ← h1(name[i]) mod M
skip ← h2(name[i]) mod (M − 1) + 1
permutation[i][j] ← (offset + j × skip) mod M

Where i is the index of the node in the Node Table, and j is the index in the lookup table.

After calculating the permutation, we can calculate the final lookup table, which is represented as a one-dimensional array:

Here’s a piece of code to calculate the lookup table based on the already computed permutation:

from typing import List

# Calculate the lookup_table based on the already computed permutation
def calculate_lookup_table(n: int, m: int, permutation: List[List[int]]) -> List[int]:
    # result is the final hash table recording distribution
    result: List[int] = [-1] * m
    # next is used to resolve conflicts; during traversal, if the entry we want to fill is already occupied,
    # we find the next row using next. This process continues until an empty position is found.
    # Since each column contains every value from 0 to M-1, we will definitely traverse every row.
    # The computational complexity is O(M logM) ~ O(M^2)
    next: List[int] = [0] * n
    flag = 0
    while True:
        for i in range(n):
            x = permutation[i][next[i]]
            while True:
                # Exit the search when an empty position is found
                if result[x] == -1:
                    break
                next[i] += 1
                x = permutation[i][next[i]]
            result[x] = i
            next[i] += 1
            flag += 1
            # Exit the calculation when the table is filled
            if flag == m:
                return result

In this loop, we can see that it will definitely end, and in the worst case, the complexity can be very high, potentially reaching O(M^2). The original text suggests choosing an M that is much larger than N (To avoid this happening we always choose M such that M ≫ N.) to keep the average complexity at O(M log M).

How does Google's self-developed consistent hashing algorithm perform in Maglev? The paper also conducted tests:

It can be seen that for different sizes of the lookup table, Maglev exhibits better balance.

To be honest, in my view, Maglev is essentially a hash with virtual nodes. I honestly didn't expect why Google didn't use more mature hashes like Dynamo. Is it due to policy reasons? (After all, Dynamo belongs to AWS (runs away). By the way, Envoy also implements Maglev. See Evaluate other consistent hash LB algorithms, and it has introduced weights, implemented quite well; interested readers can check it out (runs away).

To be honest, there are still many details about Maglev Hash that haven't been discussed, but I'm too lazy to write them. I'll wait until I publish an analysis blog on consistent hashing later. Flag++

Maglev Optimization#

We have already covered the basic principles of Maglev. However, if it is to be used as a load balancer on a large scale in production, many optimizations need to be made for the details. Since this involves many aspects, I will only briefly introduce a few here; I still recommend everyone to read the original text directly.

Handling Fragmented Packets#

Those familiar with networks know that when transmitting packets based on the IP protocol, due to the limitation of MTU size, there may be cases of fragmented transmission, and these fragmented packets may not carry complete five-tuple information. For example, if a packet is split into two segments, the first segment will carry L3 and L4 header information, while the second segment will only carry L3 information. During transmission, due to network conditions, Maglev cannot fully guarantee correct processing of the received data.

This is a big problem because packet fragmentation is a very common scenario. So how should Maglev handle such scenarios? First, we need to determine how to ensure that all data can be successfully delivered:

Ensure that different segments of a data packet are processed by the same Maglev instance.
Ensure that the backend selection results are consistent for different segments of the same data packet.

Okay, let's see how Google solves this problem.

First, each Maglev instance will have a special backend pool that contains all instances in that Maglev cluster. When data is received, Maglev will first calculate a hash based on the three-tuple (source address, destination address, protocol family) and then select a Maglev instance for forwarding. This ensures that different segments of the same data packet can be transmitted to the same Maglev instance. Of course, GRE's recursive control needs to be utilized to avoid infinite loops.

Now let's see how condition 2 is satisfied. Each Maglev instance will maintain a special table that records the forwarding results of the first data segment after fragmentation. Using the previous example, when the second segment of a packet arrives, Maglev will check the table for the forwarding result of the first segment. If it exists, it will forward directly; if it does not exist, it will cache this segment until the first segment arrives or the timeout threshold is reached.

Monitoring and Debugging#

In reality, the use of time does not require debugging (crosses out) (laughs). Google designed auxiliary monitoring and debugging tools for this system to assist in daily development iterations.

In terms of monitoring, there are both black-box and white-box monitoring methods. For example, specific monitoring nodes distributed globally to confirm the health status of VIPs. Of course, there is also a complete set of white-box monitoring. Google monitors specific server metrics while also monitoring Maglev's own metrics.

Additionally, there are some debugging tools. For example, Google developed a packet tracer similar to X-Trace. It can send information with specific headers and payloads through the packet tracer. When Maglev receives such special packets, in addition to forwarding the packets as usual, it will also report some key information to a designated location.

This actually reflects one of the advantages of software load balancing over hardware load balancing; both debuggability and iterability are unmatched by hardware load balancing.

Conclusion#

I actually took quite a while to read this article, and many details are worth delving into slowly. So I once again recommend everyone to find the original text and read it; it's very good. Additionally, I would like to recommend an article by the Meituan technical team, who also referenced Maglev to implement their own high-performance L4 load balancing. See MGW——Meituan's High-Performance Layer 4 Load Balancer.

Alright, let's end this article here. This should be the most time-consuming article I've written. However, thinking about the few articles I still have to write, my head is starting to hurt.

Gotta run!