Is Rust Suitable for AI Model Inference?

With the advancement of AI technology, we’re seeing an increasing number of AI projects adopting the Rust programming language. Beyond its use in model inference, Rust is also being used to build AI system frameworks oriented toward LLM workflow engines.

Back to the title: Is Rust suitable for model inference?

A few years ago, my answer would have been no. The reason was that a few years back, Rust lacked mature libraries for scientific computing — some were incomplete or entirely missing. It also had limited support for GPU programming. As a result, doing deep learning work in Rust was quite difficult.

What about now (2025)?

Yes, it can be used — but it depends on the deployment scenario. If you’re targeting edge devices, forget about it and stick with C++.

Currently, there are roughly three popular options in Rust for model inference:

PyTorch-based libtorch: tch-rs project
ONNX Runtime-based solution: ort project
Pure Rust implementation: candle

If you were to implement model inference in Rust, how would you choose?

“Unfortunately,” I’ve personally used all three approaches — even reimplementing the same project using each of them separately.

tch-rs

tch-rs wraps libtorch, so we only need to focus on libtorch. libtorch is the C++ interface of PyTorch, and its API closely mirrors PyTorch’s Python API. Therefore, if your model was implemented in PyTorch, using libtorch for model inference has a very low learning curve and feels very smooth.

However, PyTorch is not purely an inference framework — it’s primarily designed for training, with inference being just a small part. For inference-only projects, it’s simply too heavy.

For example, the libtorch 2.7.1 CUDA 12.8 version alone comes as a 3.5GB compressed package:

https://download.pytorch.org/libtorch/cu128/libtorch-cxx11-abi-shared-with-deps-2.7.1%2Bcu128.zip

Regarding the tch-rs project specifically, since libtorch is C++ based, integrating it into a Rust project poses some challenges. We must carefully configure the libtorch C++ library.

Another downside is that using tch-rs significantly slows down rust-analyzer during development, which negatively impacts the overall developer experience.

candle

candle is a minimalistic inference framework developed by Hugging Face, fully implemented in Rust. Rustaceans who don’t know the specifics will often prefer candle as their first choice for inference (and yes, I’m one of them).

The advantages of candle are significant. For instance, being a pure Rust implementation makes integration much cleaner compared to the other two options. It even enables model deployment on the WebAssembly (Wasm) platform, thanks to Rust’s excellent Wasm support.

Is it really that perfect?

candle seems to be built primarily for large language models (LLMs), so it performs well with Transformer models — a relatively small branch in the broader model ecosystem. Its support for traditional CNN and LSTM models isn’t great, likely due to declining interest in those architectures and fewer contributors working on them. This leads to missing or incomplete operators in the candle codebase.

candle uses the safetensors format for weights. safetensors is a secure tensor storage format proposed by Hugging Face — it contains only tensor data (like weights), no code logic, and doesn’t store model structure information.

This means you must define the model structure in Rust before loading the weights via code. The problem is that candle’s APIs aren’t always equivalent to PyTorch’s, making it harder to use. Worse yet, candle currently lacks comprehensive documentation or a complete mapping table between PyTorch and candle APIs.

Therefore, using candle as an inference framework heavily tests both the user’s programming skills and familiarity with both PyTorch and candle APIs.

Additionally, candle has limitations regarding hardware support.

Even among NVIDIA GPUs, not all are supported. From testing, Compute Capability should ideally be at least 8.0. Devices below this level may encounter errors like:

DriverError(CUDA_ERROR_INVALID_PTX, "a PTXJIT compilation failed")

It also does not support Jetson series devices because candle relies on nvidia-smi for device detection, and nvidia-smi is not available on Jetson platforms.

You can check Compute Capability values here:

https://developer.nvidia.com/cuda-gpus

Although candle-transformers implements many popular models, they still represent only a small portion of the entire model landscape.

ort

ort is a Rust wrapper for ONNX Runtime. Since common CNN networks often require substantial preprocessing and postprocessing, pairing ort with ndarray is necessary.

ONNX Runtime supports various hardware backends through providers, including cpu, cuda, tensorrt, rocm, and even domestic Chinese solutions like Huawei CANN: cann, and Rockchip RKNPU: rknpu.

Of course, deploying with other providers can be challenging. Most commonly, people use cpu or cuda.

Thanks to the popularity of the ONNX format, ort is generally more user-friendly than candle. However, the ort project hasn’t been actively developed — it took nearly two years to upgrade from v1 to v2, and the current version is still in release candidate (rc) status.

Nevertheless, I recommend using ort v2, as it provides a more robust build process and is more user-friendly than v1.

How to Choose

In 2025, most model projects are still trained using PyTorch. But I do not recommend using tch-rs — its overhead is just too high.

If you have a “purist” preference for Rust projects and want to maintain Rust purity, and your model architecture is available under candle-transformers and you’re targeting newer hardware, then consider candle.

Otherwise, the best option is to go with ort.

One important fact is that many cloud servers offer GPU hardware with Compute Capability typically below 8.0 — such as the NVIDIA T4, QUADRO series, or older cards like Tesla P40 and Tesla P4. These remain cost-effective choices, but are not well-supported by candle.