metal : optimize FA vec for large sequences and BS <= 8 #15566
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
target #15541
kernel_flash_attn_ext_reduce
)kernel_mul_mv_ext_f32_f32_...
specialization (needed for some MoE models)llama-batched-bench
total speed report (cont batched-bench : fix unified KV cache handling + pp timing #15562)TODO
nkpsg
Perf M2 Ultra
Parallel performance of up to 8 sequences is significantly improved. The longer the sequence length, the higher gain is observed. This is comparison to
master
- the PR includes the speed-up from #15541 which improves large batch size prompt processing for MoE.