-
Notifications
You must be signed in to change notification settings - Fork 19.6k
Update nn.py #21250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update nn.py #21250
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21250 +/- ##
==========================================
- Coverage 82.59% 82.56% -0.04%
==========================================
Files 564 564
Lines 54556 54580 +24
Branches 8479 8486 +7
==========================================
+ Hits 45062 45065 +3
- Misses 7405 7426 +21
Partials 2089 2089
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
The gating logic is a little confusing for me. I left some comments. Thanks!
keras/src/backend/jax/nn.py
Outdated
) | ||
is_tpu = jax.devices()[0].platform == "tpu" | ||
|
||
# Determine flash attention compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very confused about the logic here.
- why is FA disabled if inputs are sharded?
keras/src/backend/jax/nn.py
Outdated
flash_attention = ( | ||
not inputs_sharded or is_tpu | ||
) and _can_use_flash_attention(query, key, value, bias) | ||
elif flash_attention and inputs_sharded and not is_tpu: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this condition is weird
if FA is enabled and inputs are sharded and is not running on TPU - you are disabling FA? why? can you please explain?
following this you are checking if running on TPU and FA is enabled - this will never be true if inputs are sharded - whats the point?
keras/src/backend/jax/nn.py
Outdated
|
||
# `dot_product_attention` is only available in jax>=0.4.31 | ||
# Process mask for Splash Attention | ||
custom_mask = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets verify numerics remain consistent with this updated code to mask
#21254 |
Added support for flash attention with sharding, fixed issue when using flash attention on tpu.