Errors Encountered When Running Yolov10 Model with Rust Candle Framework on GPU

Previously, I completely implemented the Yolov10 model from scratch and successfully ran it on CPU. However, when using CUDA acceleration, the following error occurred:

DriverError(CUDA_ERROR_INVALID_VALUE, "invalid argument")

Through error stack trace investigation, I eventually pinpointed the issue to the topk function.

Specifically, Yolov10 uses the topk function twice in the v10postprocess module.

The first topk is primarily used to select detection results with the highest confidence:

max_scores, index = torch.topk(max_scores, max_det, dim=-1)

Sort the maximum class scores (max_scores) for each prediction box.
Select the top max_det prediction boxes with the highest scores.
Reduce the number of items to process, improving efficiency.

The second topk is used again to sort by score among the already selected max_det prediction boxes.

scores, index = torch.topk(scores.flatten(1), max_det, dim=-1)

In the candle framework, the topk function is roughly implemented as follows:


pub trait TopKLastDimOp {
    /// Note: this implements torch.topk with sorted=True.
    fn topk(&self, topk: usize) -> Result<TopKOutput>;

    /// Note: this implements torch.topk with sorted=False.
    fn topk_unsorted(&self, topk: usize) -> Result<TopKOutput>;
}

impl TopKLastDimOp for Tensor {
    fn topk(&self, topk: usize) -> Result<TopKOutput> {
        // Sorted descending
        let sorted_indices = self.arg_sort_last_dim(false)?;
        let topk_indices = sorted_indices.narrow(D::Minus1, 0, topk)?.contiguous()?;
        Ok(TopKOutput {
            values: self.gather(&topk_indices, D::Minus1)?,
            indices: topk_indices,
        })
    }

    fn topk_unsorted(&self, topk: usize) -> Result<TopKOutput> {
        // Sorted descending
        let sorted_indices_all = self.arg_sort_last_dim(false)?;
        let topk_indices_sorted = sorted_indices_all
            .narrow(D::Minus1, 0, topk)?
            .contiguous()?;
        let topk_values_sorted = self.gather(&topk_indices_sorted, D::Minus1)?;

        // Reorder the indices ascending
        let reorder_indices = topk_indices_sorted.arg_sort_last_dim(true)?;
        let topk_indices_unsorted = topk_indices_sorted.gather(&reorder_indices, D::Minus1)?;
        let topk_values_unsorted = topk_values_sorted.gather(&reorder_indices, D::Minus1)?;
        Ok(TopKOutput {
            values: topk_values_unsorted,
            indices: topk_indices_unsorted,
        })
    }
}

The topk function itself is not problematic; the issue lies with the arg_sort_last_dim function in the topk implementation. In the first topk, the shape of max_det being processed is [1,8400]. This is a very large tensor. However, arg_sort_last_dim on cuda does not support tensors of this size.

Through the following test, it was discovered that the arg_sort_last_dim function cannot even handle tensors larger than 1024.

fn main() {
    let a = Tensor::zeros(
        1025,
        DType::F32,
        &Device::cuda_if_available(0).unwrap(),
    )
    .unwrap();
    dbg!(&a.arg_sort_last_dim(true));
}

In fact, this issue with arg_sort_last_dim was reported quite a while ago, but almost no one has solved it.

So it’s temporarily unsolvable.