The Self Attention forward part using 5D tesnor can be used in 4D tensor #2587

deeperlearner · 2025-09-19T07:41:51Z

deeperlearner
Sep 19, 2025

pytorch-image-models/timm/layers/attention.py

Lines 75 to 76 in 019550e

    
           qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim).permute(2, 0, 3, 1, 4) 
        
           q, k, v = qkv.unbind(0)

I'm using Ambarella transfering toolchain, which only support 4D tensors. This part, which uses a 5D tensor, causes an error every time I call the self-attention function.
I'm just proposing a method here that uses only 4D tensors for replacement.

H, D = self.num_heads, self.head_dim
qkv = self.qkv(x)  # (B, N, 3*H*D)
q = qkv[:, :, :H*D].reshape(B, N, H, D).transpose(1, 2)       # (B, H, N, D)
k = qkv[:, :, H*D:2*H*D].reshape(B, N, H, D).transpose(1, 2)  # (B, H, N, D)
v = qkv[:, :, 2*H*D:].reshape(B, N, H, D).transpose(1, 2)     # (B, H, N, D)

I know the 5D tensor version is more readable, but the 4D tensor version solved my problem.

rwightman · 2025-09-19T15:58:21Z

rwightman
Sep 19, 2025
Maintainer

@deeperlearner I think I actually had it like that years ago on first impl but opted for unbind, next in line would be chunk due to clarity. At this point I don't feel it makes sense to change this, there are too many users and this could subtly impact things.

So for your case, best solution would be to implement a module with the necessary change locally and patch it after creating the model, pretty easy to do and would let you use timm weights and model structure for training and then export to Ambarella devices for inference...

Something like this (might need some tweaks but idea is sound)

def replace_attention_recursive(module, parent_name=''):
    """Recursively replace Attention modules with AmbarellaAttention"""
    for name, child in module.named_children():
        full_name = f"{parent_name}.{name}" if parent_name else name
        
        if isinstance(child, Attention) and not isinstance(child, AmbarellaAttention):
            # Create new AmbarellaAttention with same configuration
            new_attn = AmbarellaAttention(
                dim=child.qkv.in_features,
                num_heads=child.num_heads,
                qkv_bias=child.qkv.bias is not None,
                attn_drop=child.attn_drop.p if hasattr(child.attn_drop, 'p') else 0.,
                proj_drop=child.proj_drop.p if hasattr(child.proj_drop, 'p') else 0.
            )
            
            # Copy weights from original attention module
            with torch.no_grad():
                new_attn.qkv.weight.copy_(child.qkv.weight)
                if child.qkv.bias is not None:
                    new_attn.qkv.bias.copy_(child.qkv.bias)
                new_attn.proj.weight.copy_(child.proj.weight)
                if child.proj.bias is not None:
                    new_attn.proj.bias.copy_(child.proj.bias)
            
            # Replace the module
            setattr(module, name, new_attn)
            print(f"Replaced {full_name}")
        else:
            # Recurse into child modules
            replace_attention_recursive(child, full_name)

# Usage
model = timm.create_model('vit_base_patch16_224', pretrained=True)
replace_attention_recursive(model)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

The Self Attention forward part using 5D tesnor can be used in 4D tensor #2587

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

The Self Attention forward part using 5D tesnor can be used in 4D tensor #2587

Uh oh!

deeperlearner Sep 19, 2025

Replies: 1 comment

Uh oh!

rwightman Sep 19, 2025 Maintainer

deeperlearner
Sep 19, 2025

rwightman
Sep 19, 2025
Maintainer