Skip to content

Conversation

njhill
Copy link
Member

@njhill njhill commented Sep 27, 2024

Originally disabled by @WoosukKwon in eee8e47

@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Sep 30, 2024

Adding the flag increased the wheel size of vllm-flash-attn from 107 MB to 143MB (not sure whether this is before compression).

@njhill
Copy link
Member Author

njhill commented Sep 30, 2024

Thanks @WoosukKwon, I guess that's fairly significant.

The motivation here is that we depend of flash attention currently for certain features like multi-step scheduling, so it's a big performance downside for models which have these other head sizes (in particular some of the IBM Granite models use head_dim 80)

@WoosukKwon
Copy link
Collaborator

@njhill We can just add the head size 80 to

HEAD_DIMENSIONS = [32, 64, 96, 128, 160, 192, 224, 256]

Would that work for you?

@njhill
Copy link
Member Author

njhill commented Sep 30, 2024

@WoosukKwon yes, I think for our immediate needs that would be great, if you're sure that will be sufficient!

@njhill
Copy link
Member Author

njhill commented Oct 2, 2024

@WoosukKwon I've opened #22 for this

@WoosukKwon
Copy link
Collaborator

@njhill Since the idea above doesn't work, can you check how much this PR affects the vllm wheel size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants