-
Notifications
You must be signed in to change notification settings - Fork 555
Refactor attention and make attention mask an argument to the model #1776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/fegin/6/base
Are you sure you want to change the base?
Conversation
…odel **Status** The PR is not landable yet but server as a RFC. If people are okay with this design, this PR requires following changes and verifications: 1. Change all models, including the experimental ones. 2. E2E loss verification (this has been done for functional check, but loss verification is noot done yet). 3. We should add an unittest for attention. But since we don't have GPU unittest, this can be done in a seperate PR. **Summary** This PR aims to refactor how TorchTitan build the attention masks and pass to model. Before this PR, init_attention_masks() is called in Trainer but the masks are stored as a class variable of FlexAttentionWrapper(). We chose this shortcut to support the case where a single model requires multiple masks. The previous design has several issues, one particular one is #1723. Now that pytorch/pytorch#164111 proves that we can let PP split BlockMask, this PR performs the refactor to pass masks as an argument of model.forward(). The new design: 1. Model needs to provide `get_attention_masks()` that accepts `create_mask_fn`, `batch`, and `eos_id`. If the attention op is SDPA, then this API should return None as SDPA currently doesn't support varlen. But once it does, we may have to return some tuple of int that represents the mask. Justification: attention logic is technically a part of the model, but requires some information from trainer/dataloader. So it's model author's responsibility to provide some API that let trainer calls to get the masks. 2. `get_attention_masks()` will be called from the trainer and the resulting masks are passed to the model.forward(). Justification: this will allow us to fix #1723 with pytorch/pytorch#164111 and this PR. 3. Provide a single AttentionOp instead of two. Justification: since the masking logic is moved outside, we don't need to do bookkeeping of masks in FlexAttentionWrapper. The logic is so simple that one AttentionOp makes things cleaner. Note: we still have two very very thin op wrappers that are used for CP. I keep these two for the CP education purpose. But this certinaly can be confusion for Titan's users. I'm opn to merge them to AttentionOp. See the discussion in #1723. ghstack-source-id: e869695 Pull-Request-resolved: #1776
…odel **Status** The PR is not landable yet but server as a RFC. If people are okay with this design, this PR requires following changes and verifications: 1. Change all models, including the experimental ones. 2. E2E loss verification (this has been done for functional check, but loss verification is noot done yet). 3. We should add an unittest for attention. But since we don't have GPU unittest, this can be done in a seperate PR. **Summary** This PR aims to refactor how TorchTitan build the attention masks and pass to model. Before this PR, init_attention_masks() is called in Trainer but the masks are stored as a class variable of FlexAttentionWrapper(). We chose this shortcut to support the case where a single model requires multiple masks. The previous design has several issues, one particular one is #1723. Now that pytorch/pytorch#164111 proves that we can let PP split BlockMask, this PR performs the refactor to pass masks as an argument of model.forward(). The new design: 1. Model needs to provide `get_attention_masks()` that accepts `create_mask_fn`, `batch`, and `eos_id`. If the attention op is SDPA, then this API should return None as SDPA currently doesn't support varlen. But once it does, we may have to return some tuple of int that represents the mask. Justification: attention logic is technically a part of the model, but requires some information from trainer/dataloader. So it's model author's responsibility to provide some API that let trainer calls to get the masks. 2. `get_attention_masks()` will be called from the trainer and the resulting masks are passed to the model.forward(). Justification: this will allow us to fix #1723 with pytorch/pytorch#164111 and this PR. 3. Provide a single AttentionOp instead of two. Justification: since the masking logic is moved outside, we don't need to do bookkeeping of masks in FlexAttentionWrapper. The logic is so simple that one AttentionOp makes things cleaner. Note: we still have two very very thin op wrappers that are used for CP. I keep these two for the CP education purpose. But this certinaly can be confusion for Titan's users. I'm opn to merge them to AttentionOp. See the discussion in #1723. ghstack-source-id: 35aa425 Pull-Request-resolved: #1776
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice refactor! Left many comments lol.
@wwwjn for sliding window attention, you could just create another mask_mod following the examples here.
torchtitan/models/attention.py
Outdated
return _FlexAttentionWrapper._flex_attn(*args, **kwargs) | ||
|
||
|
||
class _ScaledDotProductAttentionWrapper(torch.nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to add comments why we have such wrappers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll evaluate if we can merge the two wrapper into AttentionOp. This seems to cause a lot of confusion. Even if we enable FlexCP in the future, people may still confuse.
torchtitan/models/attention.py
Outdated
@functools.lru_cache(4) | ||
def create_block_mask_fn(*args, **kwargs): | ||
return _compiled_create_block_mask(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you trying to handle the case where CP is not applied to every Attention module (which I somehow prefer delay until people definitely need the complexity)? O/w I don't see why we need to do this for every iteration, or why we need cache here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I'm not handling that case. I think this shouldn't be a closure. Let me change it.
Looks like the biggest concerns of this PR
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice refactor.
I have a small suggestion around the get_attention_masks
called in train.py .
As all things called in train.py, that would be better if they are a bit more flexible.
Hi, unsure where the best place to ask this is but this seems like a relevant recent PR. I have two questions:
|
SDPA doesn't support this so CP + SDPA doesn't support this. The current plan is to wait until SDPA support packed sequences.
Yes, it will support packing sequence. But the current implementation will use allgather only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me
|
||
self.use_flex_attn = model_args.use_flex_attn | ||
if self.use_flex_attn: | ||
self.inner_attention = FlexAttentionWrapper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option is to call it self.kernel
, as used by some internal
Sorry, could you elaborate why it is unsupported even in the non-CP case? According to the pseudocode can't we pass in |
When I said SDPA supports packed sequence, what I meant is that when SDPA supports varlen version. |
All losses match except for llama4 irope. The reason is that the original llama4 irope implementation in TorchTitan is incorrect. More precisely the fixed_block_size_mask_mod implementation is not correct. This PR also fixes it. |
Stack from ghstack (oldest at bottom):
Status
The PR is not landable yet but server as a RFC. If people are okay with this design, this PR requires following changes and verifications:
Summary
This PR aims to refactor how TorchTitan build the attention masks and pass to model. Before this PR, init_attention_masks() is called in Trainer but the masks are stored as a class variable of FlexAttentionWrapper(). We chose this shortcut to support the case where a single model requires multiple masks.
The previous design has several issues, one particular one is #1723.
Now that pytorch/pytorch#164111 proves that we can let PP split BlockMask, this PR performs the refactor to pass masks as an argument of model.forward().
The new design:
get_attention_masks()
that acceptscreate_mask_fn
,batch
, andeos_id
. If the attention op is SDPA, then this API should return None as SDPA currently doesn't support varlen. But once it does, we may have to return some tuple of int that represents the mask.Justification: attention logic is technically a part of the model, but requires some information from trainer/dataloader. So it's model author's responsibility to provide some API that let trainer calls to get the masks.
get_attention_masks()
will be called from the trainer and the resulting masks are passed to the model.forward().Justification: this will allow us to fix #1723 with pytorch/pytorch#164111 and this PR.
Justification: since the masking logic is moved outside, we don't need to do bookkeeping of masks in FlexAttentionWrapper. The logic is so simple that one AttentionOp makes things cleaner.
Note: we still have two very very thin op wrappers that are used for CP. I keep these two for the CP education purpose. But this certainly can be confusion for Titan's users. I'm opnn to merge them to AttentionOp.See the discussion in #1723.
Verification
llama3
llama3 flex
llama4
llama4 irope
deepseek
deepseek flex