Optimizing Memory Allocators When Deploying Rust Projects with Musl

I used to really like using Alpine images as deployment images when developing Go web applications because they are very small. Alpine uses musl as its C standard library implementation by default.

When using Rust for development, I didn’t actually use Alpine images. The reason was that my Rust projects were relatively complex with more complicated dependencies, making them unsuitable for cross-compilation to x86_64-unknown-linux-musl. For example, the openssl dependency is very common in Rust crates. If a crate doesn’t provide a rustls option, then when compiling you have to configure the openssl environment, which is quite troublesome. So I was lazy and used Ubuntu images instead.

Previously, I had seen some Rust projects using Alpine images that would use third-party memory allocation libraries, such as jemalloc. I never quite understood this behavior until recently when I came across articles mentioning that musl’s default memory allocator has performance issues.

Testing

To verify how severe this issue is, I also ran a test.

Hardware: AMD Ryzen 9 7945HX3D System: Ubuntu 24.04 Build image: ghcr.io/rust-cross/rust-musl-cross:x86_64-unknown-linux-musl Runtime image: gcr.io/iguazio/alpine:3.20

Here we compare musl’s default allocator with mimalloc.

The test code is the same:

fn main() {
    println!("=== Memory Allocator Benchmark ===");
    
    // Test parameters
    let num_threads = std::thread::available_parallelism().map_or(8, |x| x.get());
    let iterations = 100000;
    
    println!("Threads: {}, Iterations per thread: {}", num_threads, iterations);
    
    // Benchmark test
    let start = Instant::now();
    
    let mut handles = vec![];
    for _ in 0..num_threads {
        let handle = std::thread::spawn(move || {
            let mut counter = 0;
            for _ in 0..iterations {
                let data = vec![1u8; counter % 1000 + 1]; // Dynamically allocate memory of different sizes
                counter += usize::from(data.get(100).copied().unwrap_or(1));
            }
            counter
        });
        handles.push(handle);
    }
    
    let mut total_counter = 0;
    for handle in handles {
        total_counter += handle.join().unwrap();
    }
    
    let duration = start.elapsed();
    
    println!("Total counter: {}", total_counter);
    println!("Time elapsed: {:.2?}", duration);
    println!("Throughput: {:.2} operations/sec", 
             (num_threads * iterations) as f64 / duration.as_secs_f64());
}

The only difference is that using the mimalloc allocator requires adding:

[dependencies]
mimalloc = { version = "0.1.48", features = ["secure","v3"] }

And in main.rs add:

// Use mimalloc as the global allocator
#[cfg_attr(target_env = "musl", global_allocator)]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;

The test results are as follows:

benchmark-test
=== Standard Allocator (musl default) ===
=== Memory Allocator Benchmark ===
Threads: 32, Iterations per thread: 100000
Total counter: 3200000
Time elapsed: 7.12s
Throughput: 449749.72 operations/sec

=== Mimalloc Allocator ===
=== Memory Allocator Benchmark (with mimalloc) ===
Threads: 32, Iterations per thread: 100000
Allocator: mimalloc with secure+v3 features
Total counter: 3200000
Time elapsed: 660.86ms
Throughput: 4842149.67 operations/sec

Performance Comparison Test Results

Item	Standard Allocator (musl default)	Mimalloc Allocator (secure+v3 features)	Performance Improvement
Total Operations	3,200,000	3,200,000	Same
Time Elapsed	7.12 seconds	660.86 milliseconds (0.66 seconds)	About 10.8x
Throughput	449,749.72 ops/sec	4,842,149.67 ops/sec	About 10.8x

Performance Improvement Analysis

Performance improvement of mimalloc compared to standard allocator:

Speed improvement: About 10.8x (7.12s vs 0.66s) Throughput improvement: About 10.8x (4,842,149 vs 449,749 ops/sec)

The above test used mimalloc with the secure feature. When I removed the secure feature, the performance improvement was as follows:

=== Standard Allocator (musl default) ===
=== Memory Allocator Benchmark ===
Threads: 32, Iterations per thread: 100000
Total counter: 3200000
Time elapsed: 7.05s
Throughput: 453651.14 operations/sec

=== Mimalloc Allocator ===
=== Memory Allocator Benchmark (with mimalloc) ===
Threads: 32, Iterations per thread: 100000
Allocator: mimalloc with v3 feature
Total counter: 3200000
Time elapsed: 19.62ms
Throughput: 163117709.56 operations/sec

The comparison results are as follows:

Item	Standard Allocator (musl default)	Mimalloc Allocator (v3 feature only)	Performance Improvement
Total Operations	3,200,000	3,200,000	Same
Time Elapsed	7.05 seconds	19.62 milliseconds (0.01962 seconds)	About 359x
Throughput	453,651.14 ops/sec	163,117,709.56 ops/sec	About 360x

This result is more extreme; I even started to doubt if there was a problem with the test code. However, this result should be reasonable since others have tried similar comparisons on 48-core systems and achieved 700x differences.

Therefore, if you deploy Rust projects using musl, try to use a third-party memory allocator like mimalloc.

Minimal Deployment Images

Alpine images are popular because they are very small. However, if the project is simple enough, you can try using an empty image (scratch). For example, the following approach:

# Use multi-stage builds to generate fully static binaries
FROM ghcr.io/rust-cross/rust-musl-cross:x86_64-unknown-linux-musl AS builder

# Set working directory
WORKDIR /app

# Copy Cargo configuration file (using domestic mirror for acceleration)
COPY .cargo/config.toml /root/.cargo/config.toml

# Copy Cargo files (utilizing Docker cache layers)
COPY Cargo.toml Cargo.lock ./

# Create virtual main.rs to pre-download dependencies
RUN mkdir src && \
    echo "fn main() {}" > src/main.rs && \
    cargo build --release --target x86_64-unknown-linux-musl && \
    rm -rf src

# Copy source code
COPY src/ ./src/

# Build application (using musl target)
RUN cargo build --release --target x86_64-unknown-linux-musl

# Runtime stage - use scratch image (minimal image size)
FROM scratch AS runtime

# Copy SSL certificates (from build image)
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/

# Copy binary from build stage
COPY --from=builder /app/target/x86_64-unknown-linux-musl/release/tinyserver /tinyserver

# Expose port
EXPOSE 3000


# Start application
CMD ["/tinyserver"]

Using this approach, the image size is almost exactly the size of the program itself.

The prerequisite is to determine whether the deployment project is simple enough, otherwise you may encounter some unexpected runtime errors. To be safe, using Alpine images is more reliable.