3D Convolutional Layers in Rust Candle

The Candle framework has not been actively implementing 3D operators.

Common 3D operators such as: BatchNorm3d, Conv3d, Dropout3d, MaxPool3d, ConvTranspose3d are all unimplemented.

Without 3D operators, the Candle framework cannot handle tasks involving video data, such as video classification, video object detection, VLM (Vision-Language Models), etc. This is also why the Candle framework has not included visual large models like Qwen2_VL, GLM, and others.

However, recently someone was found to have implemented a Conv3d layer with a fixed temporal kernel size of 2 for time dimension in Qwen2_VL using Candle.

Simulating 3D Convolution with 2D Convolution

The approach is roughly as follows, decomposing 3D convolution into two parallel 2D convolutions:

// Original 3D convolution weight shape: [output_channels, input_channels/groups, time=2, height, width]
// Weight shape example: [64, 32, 2, 3, 3]  ← time dimension is 2

// Decompose into two 2D convolution weights:
let w1 = ws.i((.., .., 0, .., ..))?;  // Shape: [64, 32, 3, 3] ← Convolution kernel for frame 1
let w2 = ws.i((.., .., 1, .., ..))?;  // Shape: [64, 32, 3, 3] ← Convolution kernel for frame 2

By splitting the temporal dimension of the input tensor, performing 2D convolution on each frame separately, and then combining the results.

Forward propagation is roughly as follows:


fn forward(&self, xs: &Tensor) -> Result<Tensor> {
    // 1. Split the temporal dimension of the input tensor
    let xs1 = xs.i((.., .., 0, .., ..))?;  // Feature maps for frame 1
    let xs2 = xs.i((.., .., 1, .., ..))?;  // Feature maps for frame 2
    
    // 2. Perform 2D convolution separately on each frame
    let out1 = self.conv2d_1.forward(&xs1)?;
    let out2 = self.conv2d_2.forward(&xs2)?;
    
    // 3. Combine results (element-wise addition)
    // 4. Restore temporal dimension (from [B, C, H, W] → [B, C, 1, H, W])
    (out1 + out2)?.unsqueeze(2)
}

Limitations

This implementation is inherently inflexible, as it can only handle cases where the temporal dimension is 2. Currently, the Candle framework has slow progress on supporting 3D convolution operators in the core library, with no general solution yet available.