Miscellaneous Features in Rust Candle

Writing so many articles about Candle and PyTorch is actually a collection of my learning notes. Previously, I used PyTorch almost blindly without paying attention to the details of its APIs. On the other hand, Candle has almost no documentation.

masked_fill

masked_fill is an operation used for conditional tensor filling, which replaces positions in the tensor that meet certain conditions with a given value based on a specified boolean mask. Candle does not have an official masked_fill interface implementation, but there are some custom implementations found in the transformer module.

PyTorch:

    x = torch.tensor([[1.0, 0.0], [0.3, -0.4]])
    mask = x.to(torch.bool)
    c = x.masked_fill(mask, torch.finfo(x.dtype).min)
    print(c) #tensor([[-3.4028e+38,  0.0000e+00],
                    #[-3.4028e+38, -3.4028e+38]])

Candle:

// Custom implementation
fn masked_fill(on_false: &Tensor, mask: &Tensor, on_true: f32) -> Result<Tensor> {
    let shape = mask.shape();
    let on_true = Tensor::new(on_true, on_false.device())?.broadcast_as(shape.dims())?;
    let m = mask.where_cond(&on_true, on_false)?;
    Ok(m)
}

	// Example usage
    let data = vec![1.0f32, 0.0,0.3, -0.4];
    let x = Tensor::from_vec(data, (2,2), &Device::Cpu)?;
    let mask = x.ne(0.0)?;

    let y = masked_fill(&x, &mask, f32::MIN)?;

    println!("mask:{y}");
//     mask:[[-3.4028e38,   0.0000e0],
//      [-3.4028e38, -3.4028e38]]
//      Tensor[[2, 2], f32]

Broadcasting Mechanism

PyTorch’s broadcasting mechanism allows tensors of different shapes to perform element-wise operations (such as addition, subtraction, multiplication, division) as long as their shapes meet the following conditions:

Starting from the trailing dimension, the sizes of the two tensors must be equal or one of them must be 1.
If the number of dimensions of the two tensors is different, 1 will be padded to the front of the smaller tensor until both have the same number of dimensions.

Suppose we have two tensors:

A has a shape of [1, 1, 64, 64]
B has a shape of [64, 64]

These two tensors can be directly added in PyTorch:

    a = torch.ones(1,1,64,64)
    b = torch.ones(64,64)
    print(a+b)

However, in Candle, due to Rust’s characteristics, tensors of different sizes cannot perform operations like a+b. Therefore, we need to use broadcast_add to achieve the same purpose.

    let device = Device::Cpu;
    let a = Tensor::ones((1,1,64,64), DType::F32, &device)?;
    let b = Tensor::ones((64,64), DType::F32, &device)?;
    // Addition
    let c  = a.broadcast_add(&b)?;
    println!("c::{c}");

Matrix Multiplication

In PyTorch, a@b is equivalent to torch.matmul(a, b).

So what’s the difference between this and a*b?

Take these two matrices as examples: $a = \begin{bmatrix} 1&2\\3&4\\ \end{bmatrix}$ $b = \begin{bmatrix} 5&6\\7&8\\ \end{bmatrix}$ a*b is actually called element-wise multiplication. It requires that the dimensions of a and b must be the same, and it multiplies corresponding elements one by one. That is,

= \begin{bmatrix} 5&12\\21&32\\ \end{bmatrix} $$ While the calculation process of matrix multiplication `a @ b` involves dot products row by row and column by column. $$ a @ b = \begin{bmatrix} 1*5+2*7&1*6+2*8\\3*5+4*7&3*6+4*8\\ \end{bmatrix} = \begin{bmatrix} 19&22\\43&50\\ \end{bmatrix} $$ PyTorch: ```python a = torch.tensor([[1.0, 2.0], [3.0, 4.0]]) b = torch.tensor([[5.0, 6.0], [7.0, 8.0]]) #tensor([[19., 22.],[43., 50.]]) print(a @ b) #tensor([[ 5., 12.],[21., 32.]]) print(a * b) ``` Candle: ```rust let a_data = vec![1.0f32, 2.0,3.0,4.0]; let b_data = vec![5.0f32, 6.0,7.0,8.0]; let a = Tensor::from_vec(a_data, (2,2), &Device::Cpu)?; let b = Tensor::from_vec(b_data, (2,2), &Device::Cpu)?; let x = a.matmul(&b)?; //[[19., 22.],[43., 50.]] println!("x:{x}"); //[[ 5., 12.],[21., 32.]] let y = (a * b)?; println!("y:{y}"); ``` ## ModuleList ModuleList is also a container, providing only a list container without any substantial functionality. This is not implemented in Candle. However, sometimes we encounter structures like this: ``` (albert_layer_groups): ModuleList( (0): AlbertLayerGroup( (albert_layers): ModuleList( (0): AlbertLayer( ``` The `0` inside is also a `key`, meaning that if we want to simply use `Vec<...>` to represent it, it won't work. The reason is that `Vec<...>` cannot construct this key name for the model structure because we need to use `vb.pp("0")` in the builder, and ModuleList itself also needs a key, such as `albert_layers` and `albert_layer_groups` above. The following mainly reports errors similar to `cannot find tensor albert.encoder.albert_layer_groups.0.0.full_layer_layer_norm.weight`, meaning that the key path does not match. ```rust // #[derive(Debug, Clone)] struct AlbertLayerGroup { albert_layers: Vec<AlbertLayer>, } ``` My usual practice is to turn `Vec<AlbertLayer>` into a custom struct for internal maintenance. This approach doesn’t make much sense functionally, but it ensures that the keys of the weights correspond correctly. ```rust #[derive(Debug, Clone)] struct AlbertLayers { layers: Vec<AlbertLayer>, } impl AlbertLayers { pub fn load(vb: VarBuilder, config: &Config) -> Result<Self> { let mut layers = vec![]; for i in 0..config.inner_group_num { layers.push(AlbertLayer::load(vb.pp(i), config)?); } Ok(Self { layers }) } } ``` AlbertLayerGroup becomes like this: ```rust #[derive(Debug, Clone)] struct AlbertLayerGroup { albert_layers: AlbertLayers, } impl AlbertLayerGroup { fn load(vb: VarBuilder, config: &Config) -> Result<Self> { let albert_layers = AlbertLayers::load(vb.pp("albert_layers"), config)?; Ok(Self { albert_layers }) } } ``` This approach is more verbose in code but clearer. It is also possible to do without this structure; `vb.pp("")` can use the `.` syntax, so we can still use `vb.pp("albert_layers.0")` to get the weights.