Protecting Your API: Quick Start with Rate Limiting in the Rust Axum Framework

When integrating third-party APIs, it is common to encounter messages such as “Exceeded call limit” or “Exceeded call frequency limit,” or even receive a 429 status code.

In today’s era of widespread AI services, this situation is even more prevalent. AI resources are highly valuable (mainly due to the high cost of hardware GPUs, which significantly increases concurrency costs). Rate limiting is the most effective way to maximize the utilization of AI resources.

Leaky Bucket Algorithm

The basic idea of the Leaky Bucket algorithm is to treat requests as water. When a request arrives, this “water” is stored in a bucket with a fixed capacity. The bottom of the bucket has a small hole that leaks water (processes requests) at a constant rate. If the inflow of water exceeds the leak rate, the excess water will overflow, meaning the request is rejected. This method effectively smooths out traffic and ensures that the system’s input rate does not exceed the set maximum rate.

What are its advantages?

The Leaky Bucket algorithm strictly controls the data transmission rate, ensuring the system does not become overloaded due to sudden traffic.
The algorithm logic is simple, making it easy to understand and implement.

What are its disadvantages?

Due to its fixed leak rate, the Leaky Bucket algorithm performs poorly in handling sudden traffic, potentially leading to a large number of rejected requests.
Even when there is no network congestion, the Leaky Bucket algorithm cannot fully utilize available bandwidth because it always processes requests at a preset rate.

Token Bucket Algorithm

The core idea of the Token Bucket algorithm is that the system adds tokens to the bucket at a constant rate. Each request consumes one token to be processed. If there are enough tokens in the bucket, the request can be processed immediately; otherwise, the request will be rejected or delayed. This mechanism allows for some degree of burst traffic, as long as there are enough tokens in the bucket, requests can pass quickly.

What are its advantages?

The Token Bucket algorithm can handle requests exceeding the average rate within a certain time, improving the flexibility and response speed of the system.
It better utilizes network resources by allowing tokens to accumulate during idle periods for use during peak times.

What are its disadvantages?

Compared to the Leaky Bucket algorithm, it is more complex.
If there are not enough tokens in the bucket, requests may be delayed.

Here is a comparison table:

Feature	Leaky Bucket Algorithm	Token Bucket Algorithm
Traffic Control	Fixed rate outflow	Tokens generated at a fixed rate, allowing burst traffic
Burst Traffic Support	Not supported	Supported
Resource Utilization	Lower, fixed rate may lead to resource waste	Higher, better utilization of network resources
Application Scenario	Suitable for applications requiring strict traffic control, such as network bandwidth management
Implementation Complexity	Simple	Relatively complex
Delay Characteristics	Requests are either processed immediately or rejected
	Requests are processed immediately if tokens are available, otherwise they may be delayed

GCRA (Generic Cell Rate Algorithm)

The Leaky Bucket algorithm is rarely used directly due to its obvious drawbacks. GCRA is an improved version of the Leaky Bucket algorithm and has two equivalent descriptions: the Leaky Bucket mode and the Virtual Scheduling mode.

Leaky Bucket Mode

In this mode, GCRA defines a bucket with a finite capacity. The bucket drains water at a rate of 1 unit per time and refills after each request confirmation. If the water level (X’) in the bucket is less than or equal to τ when a request arrives, the request is confirmed; otherwise, it is rejected. This mode is similar to the traditional Leaky Bucket algorithm but introduces additional parameters to better handle burst traffic.

Virtual Scheduling Mode

In this mode, GCRA determines whether a request can be confirmed by comparing its “Theoretical Arrival Time (TAT)” with the actual arrival time t. The algorithm defines T, the expected request interval (which can be derived to give an expected request rate of 1/T), and τ, the tolerance, i.e., the maximum time a request can arrive before TAT. If t is less than TAT - τ, the request is considered too early and cannot be confirmed. When a request is confirmed, the algorithm updates the next TAT to max(t, TAT) + T. This mode is more elegant and easier to understand and implement.

Using the Axum Framework

Here, we use the tower_governor middleware based on GCRA. tower_governor is a middleware for tower that implements rate limiting for interfaces. It can rate limit requests based on peer IP addresses, IP address headers, globally, or through custom keys. It is an extension of governor for tower, and governor is an implementation of GCRA.

Since Axum is built on top of tower, we can use tower_governor to implement interface rate limiting.

Rate Limiting Based on IP

Suppose I have a group of interfaces, and I want to limit the same IP to 5 requests within 2 seconds.

// Rate limiting based on IP, 5 requests within 2 seconds
let governor_conf = Arc::new(
    GovernorConfigBuilder::default()
        .per_second(2)
        .key_extractor(SmartIpKeyExtractor)
        .burst_size(5)
        .finish()
        .unwrap(),
);
let rate_limit_layer = GovernorLayer {
    config: governor_conf,
};
// Limit the login interface group
Router::new()
    .post("/auth", post(login).layer(rate_limit_layer))

tower_governor provides three basic extractors:

PeerIpKeyExtractor: This is the default value, which uses the peer IP address of the request.
SmartIpKeyExtractor: It looks up common IP identifier headers (x-forwarded-for, x-real-ip, forwarded) usually provided by reverse proxies in sequence and falls back to the peer IP address.
GlobalKeyExtractor: Uses the same key for all incoming requests.

Rate limiting based on IP is only suitable for general scenarios. For high-cost, time-consuming scenarios, rate limiting based on user login credentials is the primary approach.

To implement rate limiting based on user login credentials, we just need to implement the KeyExtractor interface to easily create a custom extractor.

Assume we limit based on the user’s token, which is passed through the Authorization: Bearer {token} header. First, define a structure:

pub struct UserTokenExtractor;

Implement the KeyExtractor interface:

impl KeyExtractor for UserTokenExtractor {
    type Key = String;

    fn extract<B>(&self, req: &Request<B>) -> Result<Self::Key, GovernorError> {
        req.headers()
            .get("Authorization")
            .and_then(|token| token.to_str().ok())
            .and_then(|token| token.strip_prefix("Bearer "))
            .map(|token| token.trim().to_owned())
            .ok_or(GovernorError::Other {
                code: StatusCode::UNAUTHORIZED,
                msg: Some("You don't have permission to access".to_string()),
                headers: None,
            })
    }

    fn key_name(&self, key: &Self::Key) -> Option<String> {
        Some(key.to_string())
    }

    fn name(&self) -> &'static str {
        "UserToken"
    }
}

The main function is the extract method, which extracts the authorization token from the HTTP request. The specific functions are as follows:

Retrieve the Authorization field from the request headers.
Check if the field value is a valid string.
Remove the Bearer prefix.
Trim leading and trailing whitespace and return.
If any step fails, return a GovernorError.

Usage

// Rate limiting based on user token, 3 requests within 5 seconds
let token_rate_limit_layer = GovernorLayer {
    config: Arc::new(
        GovernorConfigBuilder::default()
            .per_second(5)
            .burst_size(3)
            .key_extractor(UserTokenExtractor)
            .use_headers()
            .finish()
            .unwrap(),
    ),
};
// Limit the request rate of the generation interface
Router::new()
    .post("/generate", post(generate).layer(token_rate_limit_layer))

When requests are made too quickly, the following error will appear:

Too Many Requests! Wait for 1s

Protecting Your API: Quick Start with Rate Limiting in the Rust Axum Framework

Leaky Bucket Algorithm

What are its advantages?

What are its disadvantages?

Token Bucket Algorithm

What are its advantages?

What are its disadvantages?

GCRA (Generic Cell Rate Algorithm)

Using the Axum Framework

Rate Limiting Based on IP

Rate Limiting Based on User Login Credentials