Deploying YOLOv10 Object Detection Model with ONNX Runtime in Rust

Unconsciously, the YOLO series has evolved to YOLOV12. The ultralytics project has made the YOLO family shine brightly. Putting aside other controversies, at least in terms of engineering, ultralytics has made the greatest contribution.

YOLO v5, v8, and v11 are all iterative versions launched by ultralytics.

The recent three versions of YOLO: v10, v11, v12.

v10 is a version launched by researchers from Tsinghua University. I think its biggest contribution is eliminating Non-Maximum Suppression (NMS). Generally, when using YOLO for engineering, it’s basically deployed on edge devices. YOLO’s NMS is actually very unfriendly, for example, it can only be calculated on the CPU, and its processing time on edge devices has reached an unavoidable level. v11 is not iterated from v10, it is iterated based on the v8 version, so this version has NMS. v12 introduces an attention-centered architecture, which deviates from the traditional CNN methods used in previous YOLO models. This version is slower in inference speed.

So I chose YOLOv10.

Export ONNX version

from ultralytics import YOLO
# Load a model
model = YOLO("yolov10s.pt")  # load a custom trained model
# Export the model
model.export(format="onnx")

Unlike YOLOv8 [1,84,8400], YOLOv10’s output is [1, 300, 6]. This indicates that YOLOv10 supports up to 300 detection boxes, each detection box contains 6 channels, namely [x1, y1, x2, y2, score, class_id].

With the above information, the inference code is easy to implement. Here we use onnxruntime as the inference engine.

ort = {version = "=2.0.0-rc.10", features = ["ndarray","coreml","cuda"]}
ndarray = "0.16.1"
image = "0.25.8"
imageproc = "0.25.0"
ab_glyph = "0.2.31"

coreml corresponds to the macOS framework, cuda enables NVIDIA graphics cards.

let session = Session::builder()?
            .with_inter_threads(1)?
            .with_optimization_level(GraphOptimizationLevel::Level3)?
            .with_execution_providers([CoreMLExecutionProvider::default().build()])?
            .commit_from_file(model_path)?;

First, preprocessing is basically the same as other versions of YOLO, resize to 640x640, and convert to (C,H,W) channel order:

	/// Preprocess image
    /// 
    /// Resize the input image to 640x640 and convert it to the tensor format required by the model.
    /// Pixel values will be normalized to the [0, 1] range.
    /// 
    /// Parameters:
    /// * `image`: Input dynamic image
    /// 
    /// Return value:
    /// A 4-dimensional tensor after preprocessing, shaped as (1, 3, 640, 640) wrapped in Result
    #[allow(clippy::type_complexity)]
    fn preprocess_image(
        &self,
        image: &DynamicImage,
    ) -> Result<ndarray::ArrayBase<ndarray::OwnedRepr<f32>, ndarray::Dim<[usize; 4]>>, Box<dyn Error>>
    {
        // Resize image to 640x640
        let img = image.resize_exact(640, 640, FilterType::Nearest);
        // Create a zero-value tensor with shape (1, 3, 640, 640)
        let mut input = Array::zeros((1, 3, 640, 640));
        // Iterate through all pixels, normalize RGB values and store in tensor
        for pixel in img.pixels() {
            let x = pixel.0 as _;
            let y = pixel.1 as _;
            let [r, g, b, _] = pixel.2.0;
            input[[0, 0, y, x]] = (r as f32) / 255.;
            input[[0, 1, y, x]] = (g as f32) / 255.;
            input[[0, 2, y, x]] = (b as f32) / 255.;
        }

        Ok(input)
    }

Through Netron software, we can see that the input name = images, output name = output0.

let array = self.preprocess_image(&img)?;
// Run inference
let outputs: SessionOutputs = self
	.session
	.run(inputs!["images" => TensorRef::from_array_view(&array)?])?;

Unlike YOLOv8, YOLOv10’s output does not need transposition (refer to the output shape mentioned above):

// This is YOLOV10
let (_output_shape, output_data) = outputs["output0"].try_extract_tensor::<f32>()?;
// println!("Output tensor shape: {:?}", _output_shape);
let output_vec: Vec<f32> = output_data.to_vec();

// This is YOLOv8 
//let output = outputs
//            .get(0)
//            .unwrap()
//            .try_extract::<f32>()?
//            .view()
//            .t()
//            .into_owned();

Without NMS, post-processing is very, very simple, you could say it’s truly end-to-end.

/// Filter detection results
pub fn filter_detections(
    results: &[f32],
    confidence_threshold: f32,
    img_width: u32,
    img_height: u32,
    orig_width: u32,
    orig_height: u32,
) -> Vec<Detection> {
    // YOLOv10 output format: [x1, y1, x2, y2, score, class_id]
    // Every 6 elements form a detection box
    if !results.len().is_multiple_of(6) {
        eprintln!("Warning: Model output length is not a multiple of 6, actual length: {}", results.len());
    }

    let num_detections = results.len() / 6;
    // println!("Number of detection boxes: {}", num_detections);

    let mut detections = Vec::with_capacity(num_detections);

    // Calculate scaling and padding factors 
    let scale = (img_width as f32 / orig_width as f32).min(img_height as f32 / orig_height as f32);
    let new_width = (orig_width as f32 * scale) as u32;
    let new_height = (orig_height as f32 * scale) as u32;
    let pad_x = (img_width - new_width) / 2;
    let pad_y = (img_height - new_height) / 2;

    for i in 0..num_detections {
        let base_index = i * 6;

        let left = results[base_index];
        let top = results[base_index + 1];
        let right = results[base_index + 2];
        let bottom = results[base_index + 3];
        let confidence = results[base_index + 4];
        let class_id = results[base_index + 5] as usize;

        // Print original values for debugging
        // println!("Detection box {}: left={}, top={}, right={}, bottom={}, confidence={}, class ID={}",
        //          i, left, top, right, bottom, confidence, class_id);

        // Check if confidence is valid
        if !(0.0..=1.0).contains(&confidence) {
            // println!("Skipping invalid confidence: {}", confidence);
            continue;
        }

        // Check if class ID is valid
        if class_id >= YOLOV10_CLASS_LABELS.len() {
            // println!("Skipping invalid class ID: {}", class_id);
            continue;
        }

        // Apply confidence threshold
        if confidence >= confidence_threshold {
            // Remove padding and scale to original image dimensions
            let left = (left - pad_x as f32) / scale;
            let top = (top - pad_y as f32) / scale;
            let right = (right - pad_x as f32) / scale;
            let bottom = (bottom - pad_y as f32) / scale;

            let x = left as u32;
            let y = top as u32;
            let width = (right - left) as u32;
            let height = (bottom - top) as u32;

            // Ensure coordinates are valid
            if width > 0 && height > 0 && x < orig_width && y < orig_height {
                detections.push(Detection {
                    confidence,
                    bbox: (x, y, width, height),
                    class_id,
                    class_name: YOLOV10_CLASS_LABELS[class_id].to_owned(),
                });
            } else {
                println!(
                    "Skipping invalid bounding box: ({}, {}) - Width: {}, Height: {}",
                    x, y, width, height
                );
            }
        }
    }

    // NMS, but yolov10 doesn't need nms
    // nms(&mut detections, 0.5, 0.3);

    // println!("Final valid detection count: {}", detections.len());
    detections
}

Drawing the results, we can see that the results are basically the same as Python.

[res.jpg]

Finally

It should be noted that although the ultralytics project is open source, its code, including model weight files, are all released under the unfriendly AGPL-3.0 license. Therefore, you’ll find that although yolov8 is very famous, there aren’t many open source projects willing to use it. For example, the famous cvat project originally had yolov8-related plugins, but later removed them. Also, in the candle project, you’ll find that its yolov8 implementation has nothing to do with ultralytics - the weight node names are completely different. This is because the yolov8 in the candle project is a reimplementation based on the tinygrad project.

If you want to use ultralytics and its weights for commercial projects, it’s best to understand this risk clearly.

If you think this can help you, you can view the complete code here: https://github.com/kingzcheung/yolov10