Feature Request: `tl.atomic_add` for bfloat16

For additional context, see pytorch/pytorch#97016. `torch.index_put(..., accumulate=True)`  currently fails for `torch.bfloat16` under `torch.compile` because `tl.atomic_add` doesn't support BFloat16.

The PTX instruction `atom.add.bf16` requires compute capability 9.0+, however when you compile `atomicAdd` in CUDA with compute capability 8.0+ it generates a CAS loop instead. Would it be reasonable for triton to do the same?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: `tl.atomic_add` for bfloat16 #1387

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: tl.atomic_add for bfloat16 #1387

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature Request: `tl.atomic_add` for bfloat16 #1387