A New Choice for Efficient Inference in Rust: Model Deployment Practice Based on ONNX Runtime

In addition to using tch-rs and candle frameworks for model deployment, Rust can also use ONNX Runtime for deployment. ort is the binding library of ONNX Runtime in Rust.

[dependencies]
ort = "=2.0.0-rc.9"

ONNX (Open Neural Network Exchange) is a training framework-independent model weight format introduced by Microsoft. It is an open file format. Its design goal is to allow different deep learning frameworks (such as PyTorch, Google’s TensorFlow, Baidu’s PaddlePaddle, etc.) to convert and share models with each other, thereby achieving cross-platform model deployment and inference. Some niche inference frameworks even use ONNX as an intermediate format to support most models.

ONNX uses Protobuf format for storage and also stores computational graph structures, so using the ONNX format does not require defining the model structure.

Of course, the ONNX format itself is also a deployment format, and its runtime is called ONNX Runtime.

Exporting Models

Here we take the Pytorch framework as an example, where the torch.onnx.export API can be used to directly export ONNX models.

Here we export a common CNN model - ResNet50:

import torch
from torchvision.models import resnet50

if __name__ == "__main__":
    model = resnet50(pretrained=True)
    model.eval()
    print(model)
    
    input_tensor = torch.rand((1, 3, 224, 224), dtype=torch.float32)
    torch.onnx.export(
        model,(input_tensor,),
        "resnet50.onnx", 
        input_names=["input"],
    )

Model Inference

Before implementing, let’s see how ResNet inference is implemented in Pytorch:

from PIL import Image
from torchvision import transforms

# Load the model
model = resnet50(pretrained=True)
model.eval()

# Data preprocessing
input_image = Image.open(filename)
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) 

# Model inference
if torch.cuda.is_available():
    input_batch = input_batch.to('cuda')
    model.to('cuda')

with torch.no_grad():
    output = model(input_batch)

# Post-processing
print(output[0])
print(torch.nn.functional.softmax(output[0], dim=0))

Model inference generally consists of four steps:

Load the model
Data preprocessing
Model inference
Post-processing

Loading the Model

let session = Session::builder()?
        .with_optimization_level(GraphOptimizationLevel::Level3)?
        .with_execution_providers([
	        CPUExecutionProvider::default().build(),
            CUDAExecutionProvider::default().build(),
        ])?
        .with_intra_threads(4)?
        .commit_from_file("resnet50.onnx")?;

with_optimization_level is used to set the optimization level of the session, with different levels corresponding to different graph optimization strategies.

with_execution_providers is mainly used to set the list of execution providers. ONNX Runtime abstracts different execution providers (EP) to implement hardware-accelerated execution of ONNX graphs. Common EPs include CUDA, CPU, TensorRT, ROCm, etc., and even support domestic NPU - RKNPU.

with_intra_threads is used to set the number of threads for intra-node parallel execution in the session. If ONNX Runtime is built with OpenMP, the number of threads is controlled by the environment variable OMP_NUM_THREADS, making this function ineffective. The function sets the number of threads by calling SetIntraOpNumThreads.

Preprocessing

From the above Python implementation, we find that data preprocessing needs to first resize the image to 224x224 (because the input size of ResNet50 is 224x224), then convert it to a tensor and normalize the tensor.

Take this image as an example ![[cat 1.jpeg]]

Load the image to be inferred and resize it to 224x224:

let image_buffer: ImageBuffer<Rgb<u8>, Vec<u8>> = image::open(
        Path::new(env!("CARGO_MANIFEST_DIR"))
            .join("tests")
            .join("cat.jpeg"),
    )
    .unwrap()
    .resize(224, 224, FilterType::Nearest)
    .to_rgb8();

ort uses ndarray as the tensor processing format:

let mut array = ndarray::Array::from_shape_fn((1, 3, 224, 224), |(_, c, j, i)| {
        let pixel = image_buffer.get_pixel(i as u32, j as u32);
        let channels = pixel.channels();
        // range [0, 255] -> range [0, 1]
        (channels[c] as f32) / 255.0
    });

Normalization:

let mean = [0.485, 0.456, 0.406];
    let std = [0.229, 0.224, 0.225];
    for c in 0..3 {
        let mut channel_array = array.slice_mut(s![0, c, .., ..]);
        channel_array -= mean[c];
        channel_array /= std[c];
    }

Convert to Tensor:

let input = Tensor::from_array(array)?;

Model Inference

The rest is simple, just call the session to perform model inference.

let outputs = session.run(inputs![input]?)?;

Based on the structure of the ResNet50 model, the last layer of ResNet50 is a fully connected layer, and the default output of the model is the classification of the ImageNet dataset. Therefore, we need to find the classification list of ImageNet:

The candle framework project has a classification list of ImageNet, which we can directly use:

https://github.com/huggingface/candle/blob/0b24f7f0a41d369942bfcadac3a3cf494167f8a6/candle-examples/src/imagenet.rs

Thus, the output defaults to 1000-dimensional probabilities. We need to convert them using softmax:

let mut probabilities: Vec<(usize, f32)> = outputs[0]
		.try_extract_tensor()?
		.softmax(ndarray::Axis(1))
		.iter()
		.copied()
		.enumerate()
		.collect::<Vec<_>>();

Make each item formatted as follows:

(
    281, // Classification index
    0.92174786, // Probability
)

Finally, sort by probability and take the highest probability:

probabilities.sort_unstable_by(|a, b| b.1.partial_cmp(&a.1).unwrap());

// The first one is the result with the highest probability
dbg!(probabilities[0]);
// Get the label of the highest probability result
let label = CLASSES[probabilities[0].0];
dbg!(label); // Result

The printed results are as follows:

[src/main.rs:56:5] probabilities[0] = (
    281,
    0.92174786,
)
[src/main.rs:58:5] label = "tabby, tabby cat"

So far, we have completed the image classification inference of ResNet using ONNX Runtime.

Advantages and Disadvantages of ort / ONNX Runtime

The inference code of ONNX Runtime is relatively simple. Its biggest advantage is strong versatility, supporting almost all models and common training frameworks. Moreover, ONNX Runtime has few dependencies, making integration very convenient.

However, using ONNX may encounter unsupported operators, and it might do some optimizations itself. Additionally, some models may run slower when using ONNX Runtime GPU inference; known typical examples include some OCR models.