metal : optimize FA vec for large sequences and BS <= 8 #15566

ggerganov · 2025-08-25T14:22:40Z

For large sequences, run multiple workgroups of the FA vec kernel. Reduce the final results with a follow-up kernel (kernel_flash_attn_ext_reduce)
Avoid using more than half of max shared memory - improves significantly Gemma perf (likely because it has HS=256 and this puts a lot of pressure on the GPU)
Add kernel_mul_mv_ext_f32_f32_... specialization (needed for some MoE models)
Fix llama-batched-bench total speed report (cont batched-bench : fix unified KV cache handling + pp timing #15562)

TODO

Allow to adjust nkpsg
Add comments

Perf M2 Ultra

Parallel performance of up to 8 sequences is significantly improved. The longer the sequence length, the higher gain is observed. This is comparison to master - the PR includes the speed-up from #15541 which improves large batch size prompt processing for MoE.

scripts/compare-commits.sh master gg/metal-fa-vec-opt-2 llama-bench -m ./models/qwen3-30b-a3b-coder/ggml-model-q4_0.gguf -m ./models/gpt-oss-20b/ggml-model-mxfp4.gguf -m models/gpt-oss-120b/ggml-model-mxfp4.gguf -m ./models/qwen3-4b-instruct-2507/ggml-model-q8_0.gguf -m ./models/gemma-3-1b-it/ggml-model-q4_0.gguf -m models/gemma-3-4b/ggml-model-q4_0.gguf -m models/deepseek-v2-lite-chat/ggml-model-q4_k.gguf -fa 1 -t 1 -ub 2048 -d 0,512,1024,2048,4096,8192,16384,32768 -n 128 -r 1 -p 0

Model	Test	t/s master	t/s gg/metal-fa-vec-opt-2	Speedup
deepseek2 16B Q4_K_M	tg128	116.16	116.50	1.00
deepseek2 16B Q4_K_M	tg128@d512	107.84	109.84	1.02
deepseek2 16B Q4_K_M	tg128@d1024	100.70	104.67	1.04
deepseek2 16B Q4_K_M	tg128@d2048	92.85	106.26	1.14
deepseek2 16B Q4_K_M	tg128@d4096	80.13	97.56	1.22
deepseek2 16B Q4_K_M	tg128@d8192	63.33	83.86	1.32
deepseek2 16B Q4_K_M	tg128@d16384	44.12	66.05	1.50
deepseek2 16B Q4_K_M	tg128@d32768	27.40	46.16	1.69
gemma3 1B Q4_0	tg128	209.10	209.63	1.00
gemma3 1B Q4_0	tg128@d512	188.61	183.54	0.97
gemma3 1B Q4_0	tg128@d1024	185.13	201.75	1.09
gemma3 1B Q4_0	tg128@d2048	179.45	200.37	1.12
gemma3 1B Q4_0	tg128@d4096	170.35	197.92	1.16
gemma3 1B Q4_0	tg128@d8192	155.22	193.96	1.25
gemma3 1B Q4_0	tg128@d16384	130.38	186.47	1.43
gemma3 1B Q4_0	tg128@d32768	98.65	172.59	1.75
gemma3 4B Q4_0	tg128	124.28	124.31	1.00
gemma3 4B Q4_0	tg128@d512	113.91	112.14	0.98
gemma3 4B Q4_0	tg128@d1024	105.78	116.80	1.10
gemma3 4B Q4_0	tg128@d2048	103.59	116.77	1.13
gemma3 4B Q4_0	tg128@d4096	99.46	115.01	1.16
gemma3 4B Q4_0	tg128@d8192	92.45	112.76	1.22
gemma3 4B Q4_0	tg128@d16384	81.00	108.57	1.34
gemma3 4B Q4_0	tg128@d32768	64.09	101.14	1.58
gpt-oss 120B MXFP4 MoE	tg128	80.10	80.16	1.00
gpt-oss 120B MXFP4 MoE	tg128@d512	75.01	79.04	1.05
gpt-oss 120B MXFP4 MoE	tg128@d1024	75.02	78.28	1.04
gpt-oss 120B MXFP4 MoE	tg128@d2048	72.28	74.44	1.03
gpt-oss 120B MXFP4 MoE	tg128@d4096	69.73	73.28	1.05
gpt-oss 120B MXFP4 MoE	tg128@d8192	67.01	70.38	1.05
gpt-oss 120B MXFP4 MoE	tg128@d16384	59.98	65.34	1.09
gpt-oss 120B MXFP4 MoE	tg128@d32768	50.94	57.36	1.13
gpt-oss 20B MXFP4 MoE	tg128	116.39	116.84	1.00
gpt-oss 20B MXFP4 MoE	tg128@d512	114.74	114.89	1.00
gpt-oss 20B MXFP4 MoE	tg128@d1024	112.47	113.89	1.01
gpt-oss 20B MXFP4 MoE	tg128@d2048	110.12	108.47	0.99
gpt-oss 20B MXFP4 MoE	tg128@d4096	104.85	106.40	1.01
gpt-oss 20B MXFP4 MoE	tg128@d8192	98.95	102.07	1.03
gpt-oss 20B MXFP4 MoE	tg128@d16384	92.30	95.11	1.03
gpt-oss 20B MXFP4 MoE	tg128@d32768	76.43	83.01	1.09
qwen3 4B Q8_0	tg128	100.56	100.18	1.00
qwen3 4B Q8_0	tg128@d512	96.24	96.25	1.00
qwen3 4B Q8_0	tg128@d1024	92.75	92.73	1.00
qwen3 4B Q8_0	tg128@d2048	86.92	92.25	1.06
qwen3 4B Q8_0	tg128@d4096	77.07	87.32	1.13
qwen3 4B Q8_0	tg128@d8192	62.90	78.21	1.24
qwen3 4B Q8_0	tg128@d16384	46.08	64.68	1.40
qwen3 4B Q8_0	tg128@d32768	30.06	48.09	1.60
qwen3moe 30B.A3B Q4_0	tg128	89.09	88.96	1.00
qwen3moe 30B.A3B Q4_0	tg128@d512	83.39	84.58	1.01
qwen3moe 30B.A3B Q4_0	tg128@d1024	79.82	81.10	1.02
qwen3moe 30B.A3B Q4_0	tg128@d2048	73.54	80.89	1.10
qwen3moe 30B.A3B Q4_0	tg128@d4096	64.85	78.36	1.21
qwen3moe 30B.A3B Q4_0	tg128@d8192	52.71	69.64	1.32
qwen3moe 30B.A3B Q4_0	tg128@d16384	38.37	56.77	1.48
qwen3moe 30B.A3B Q4_0	tg128@d32768	24.48	42.07	1.72

gpt-oss-120b master

main: n_kv_max = 264192, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 1, n_gpu_layers = -1, n_threads = 1, n_threads_batch = 1
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|
|     0 |     32 |    1 |     32 |    0.000 |     0.00 |    0.402 |    79.69 |    0.402 |
|     0 |     32 |    2 |     64 |    0.000 |     0.00 |    0.569 |   112.47 |    0.569 |
|     0 |     32 |    4 |    128 |    0.000 |     0.00 |    0.870 |   147.19 |    0.870 |
|     0 |     32 |    8 |    256 |    0.000 |     0.00 |    1.638 |   156.25 |    1.639 |
|  1024 |     32 |    1 |   1056 |    0.980 |  1044.79 |    0.431 |    74.29 |    1.411 |
|  1024 |     32 |    2 |   2112 |    0.892 |  1147.39 |    0.609 |   105.04 |    1.502 |
|  1024 |     32 |    4 |   4224 |    0.893 |  1146.54 |    0.927 |   138.06 |    1.820 |
|  1024 |     32 |    8 |   8448 |    0.887 |  1154.32 |    1.768 |   144.77 |    2.655 |
|  2048 |     32 |    1 |   2080 |    1.802 |  1136.42 |    0.418 |    76.61 |    2.220 |
|  2048 |     32 |    2 |   4160 |    1.670 |  1226.42 |    0.608 |   105.35 |    2.277 |
|  2048 |     32 |    4 |   8320 |    1.672 |  1224.55 |    0.939 |   136.32 |    2.611 |
|  2048 |     32 |    8 |  16640 |    1.732 |  1182.58 |    1.781 |   143.76 |    3.513 |
|  4096 |     32 |    1 |   4128 |    3.510 |  1166.82 |    0.432 |    74.08 |    3.942 |
|  4096 |     32 |    2 |   8256 |    3.484 |  1175.76 |    0.630 |   101.59 |    4.114 |
|  4096 |     32 |    4 |  16512 |    3.483 |  1176.09 |    0.981 |   130.50 |    4.464 |
|  4096 |     32 |    8 |  33024 |    3.486 |  1175.15 |    1.864 |   137.34 |    5.349 |
|  8192 |     32 |    1 |   8224 |    7.556 |  1084.23 |    0.456 |    70.12 |    8.012 |
|  8192 |     32 |    2 |  16448 |    7.525 |  1088.61 |    0.684 |    93.62 |    8.209 |
|  8192 |     32 |    4 |  32896 |    7.536 |  1087.10 |    1.085 |   118.00 |    8.620 |
|  8192 |     32 |    8 |  65792 |    7.550 |  1084.98 |    2.077 |   123.27 |    9.627 |
| 16384 |     32 |    1 |  16416 |   17.332 |   945.32 |    0.506 |    63.30 |   17.837 |
| 16384 |     32 |    2 |  32832 |   17.333 |   945.26 |    0.788 |    81.24 |   18.121 |
| 16384 |     32 |    4 |  65664 |   17.365 |   943.49 |    1.329 |    96.31 |   18.694 |
| 16384 |     32 |    8 | 131328 |   17.367 |   943.39 |    2.493 |   102.68 |   19.860 |
| 32768 |     32 |    1 |  32800 |   43.822 |   747.75 |    0.604 |    52.95 |   44.426 |
| 32768 |     32 |    2 |  65600 |   43.802 |   748.10 |    1.007 |    63.57 |   44.808 |
| 32768 |     32 |    4 | 131200 |   43.853 |   747.23 |    1.782 |    71.84 |   45.634 |
| 32768 |     32 |    8 | 262400 |   43.824 |   747.71 |    3.416 |    74.94 |   47.240 |

gpt-oss-120b PR

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|
|     0 |     32 |    1 |     32 |    0.000 |     0.00 |    0.404 |    79.24 |    0.404 |
|     0 |     32 |    2 |     64 |    0.000 |     0.00 |    0.574 |   111.56 |    0.574 |
|     0 |     32 |    4 |    128 |    0.000 |     0.00 |    0.873 |   146.59 |    0.873 |
|     0 |     32 |    8 |    256 |    0.000 |     0.00 |    1.423 |   179.85 |    1.424 |
|  1024 |     32 |    1 |   1056 |    0.726 |  1410.42 |    0.412 |    77.76 |    1.138 |
|  1024 |     32 |    2 |   2112 |    0.705 |  1452.52 |    0.597 |   107.14 |    1.302 |
|  1024 |     32 |    4 |   4224 |    0.706 |  1449.94 |    0.919 |   139.29 |    1.625 |
|  1024 |     32 |    8 |   8448 |    0.700 |  1462.11 |    1.508 |   169.80 |    2.208 |
|  2048 |     32 |    1 |   2080 |    1.288 |  1589.73 |    0.434 |    73.71 |    1.722 |
|  2048 |     32 |    2 |   4160 |    1.284 |  1595.11 |    0.629 |   101.82 |    1.912 |
|  2048 |     32 |    4 |   8320 |    1.282 |  1597.98 |    0.976 |   131.10 |    2.258 |
|  2048 |     32 |    8 |  16640 |    1.281 |  1598.60 |    1.622 |   157.86 |    2.903 |
|  4096 |     32 |    1 |   4128 |    2.708 |  1512.32 |    0.440 |    72.66 |    3.149 |
|  4096 |     32 |    2 |   8256 |    2.706 |  1513.77 |    0.644 |    99.37 |    3.350 |
|  4096 |     32 |    4 |  16512 |    2.707 |  1512.85 |    1.005 |   127.32 |    3.713 |
|  4096 |     32 |    8 |  33024 |    2.706 |  1513.43 |    1.682 |   152.16 |    4.389 |
|  8192 |     32 |    1 |   8224 |    5.975 |  1371.06 |    0.458 |    69.94 |    6.432 |
|  8192 |     32 |    2 |  16448 |    5.972 |  1371.63 |    0.674 |    95.02 |    6.646 |
|  8192 |     32 |    4 |  32896 |    5.972 |  1371.64 |    1.062 |   120.57 |    7.034 |
|  8192 |     32 |    8 |  65792 |    5.971 |  1371.89 |    1.792 |   142.88 |    7.763 |
| 16384 |     32 |    1 |  16416 |   14.195 |  1154.25 |    0.492 |    65.10 |   14.686 |
| 16384 |     32 |    2 |  32832 |   14.192 |  1154.45 |    0.735 |    87.08 |   14.927 |
| 16384 |     32 |    4 |  65664 |   14.192 |  1154.47 |    1.181 |   108.39 |   15.373 |
| 16384 |     32 |    8 | 131328 |   14.198 |  1153.95 |    2.015 |   127.04 |   16.213 |
| 32768 |     32 |    1 |  32800 |   37.492 |   873.99 |    0.560 |    57.10 |   38.053 |
| 32768 |     32 |    2 |  65600 |   37.522 |   873.29 |    0.861 |    74.33 |   38.383 |
| 32768 |     32 |    4 | 131200 |   37.491 |   874.02 |    1.416 |    90.40 |   38.907 |
| 32768 |     32 |    8 | 262400 |   37.535 |   873.01 |    2.486 |   102.99 |   40.020 |

---

qwen3 30b a3b master

main: n_kv_max = 264192, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 1, n_gpu_layers = -1, n_threads = 1, n_threads_batch = 1
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|
|     0 |     32 |    1 |     32 |    0.000 |     0.00 |    0.423 |    75.59 |    0.424 |
|     0 |     32 |    2 |     64 |    0.000 |     0.00 |    0.538 |   118.94 |    0.538 |
|     0 |     32 |    4 |    128 |    0.000 |     0.00 |    0.725 |   176.49 |    0.725 |
|     0 |     32 |    8 |    256 |    0.000 |     0.00 |    1.044 |   245.28 |    1.044 |
|  1024 |     32 |    1 |   1056 |    0.899 |  1139.40 |    0.455 |    70.25 |    1.354 |
|  1024 |     32 |    2 |   2112 |    0.715 |  1432.46 |    0.583 |   109.80 |    1.298 |
|  1024 |     32 |    4 |   4224 |    0.715 |  1431.71 |    0.824 |   155.37 |    1.539 |
|  1024 |     32 |    8 |   8448 |    0.718 |  1425.56 |    1.179 |   217.08 |    1.898 |
|  2048 |     32 |    1 |   2080 |    1.720 |  1190.55 |    0.488 |    65.61 |    2.208 |
|  2048 |     32 |    2 |   4160 |    1.448 |  1414.43 |    0.600 |   106.74 |    2.047 |
|  2048 |     32 |    4 |   8320 |    1.445 |  1416.90 |    0.884 |   144.80 |    2.329 |
|  2048 |     32 |    8 |  16640 |    1.443 |  1419.59 |    1.308 |   195.67 |    2.751 |
|  4096 |     32 |    1 |   4128 |    3.238 |  1264.88 |    0.550 |    58.17 |    3.788 |
|  4096 |     32 |    2 |   8256 |    3.180 |  1288.08 |    0.668 |    95.86 |    3.848 |
|  4096 |     32 |    4 |  16512 |    3.181 |  1287.63 |    1.005 |   127.32 |    4.186 |
|  4096 |     32 |    8 |  33024 |    3.258 |  1257.25 |    1.538 |   166.50 |    4.795 |
|  8192 |     32 |    1 |   8224 |    7.652 |  1070.55 |    0.671 |    47.66 |    8.323 |
|  8192 |     32 |    2 |  16448 |    7.591 |  1079.11 |    0.800 |    80.00 |    8.391 |
|  8192 |     32 |    4 |  32896 |    7.613 |  1076.02 |    1.300 |    98.49 |    8.913 |
|  8192 |     32 |    8 |  65792 |    7.668 |  1068.33 |    2.119 |   120.81 |    9.787 |
| 16384 |     32 |    1 |  16416 |   20.417 |   802.47 |    0.907 |    35.27 |   21.324 |
| 16384 |     32 |    2 |  32832 |   20.413 |   802.64 |    1.062 |    60.25 |   21.475 |
| 16384 |     32 |    4 |  65664 |   20.454 |   801.02 |    1.843 |    69.44 |   22.297 |
| 16384 |     32 |    8 | 131328 |   20.431 |   801.92 |    3.212 |    79.71 |   23.643 |
| 32768 |     32 |    1 |  32800 |   61.786 |   530.35 |    1.422 |    22.50 |   63.208 |
| 32768 |     32 |    2 |  65600 |   61.811 |   530.13 |    1.619 |    39.53 |   63.431 |
| 32768 |     32 |    4 | 131200 |   61.759 |   530.57 |    2.858 |    44.79 |   64.617 |
| 32768 |     32 |    8 | 262400 |   61.786 |   530.34 |    5.426 |    47.18 |   67.212 |


qwen3 30b a3b PR

main: n_kv_max = 264192, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 1, n_gpu_layers = -1, n_threads = 1, n_threads_batch = 1
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|
|     0 |     32 |    1 |     32 |    0.000 |     0.00 |    0.425 |    75.36 |    0.425 |
|     0 |     32 |    2 |     64 |    0.000 |     0.00 |    0.529 |   120.88 |    0.530 |
|     0 |     32 |    4 |    128 |    0.000 |     0.00 |    0.718 |   178.36 |    0.718 |
|     0 |     32 |    8 |    256 |    0.000 |     0.00 |    0.956 |   267.65 |    0.957 |
|  1024 |     32 |    1 |   1056 |    0.472 |  2168.84 |    0.458 |    69.87 |    0.930 |
|  1024 |     32 |    2 |   2112 |    0.449 |  2279.58 |    0.561 |   113.99 |    1.011 |
|  1024 |     32 |    4 |   4224 |    0.451 |  2269.84 |    0.791 |   161.74 |    1.243 |
|  1024 |     32 |    8 |   8448 |    0.449 |  2280.20 |    1.077 |   237.59 |    1.527 |
|  2048 |     32 |    1 |   2080 |    0.907 |  2259.22 |    0.456 |    70.24 |    1.362 |
|  2048 |     32 |    2 |   4160 |    0.904 |  2264.99 |    0.582 |   110.04 |    1.486 |
|  2048 |     32 |    4 |   8320 |    0.902 |  2269.60 |    0.824 |   155.39 |    1.726 |
|  2048 |     32 |    8 |  16640 |    0.902 |  2269.57 |    1.148 |   223.09 |    2.050 |
|  4096 |     32 |    1 |   4128 |    2.115 |  1936.44 |    0.478 |    67.00 |    2.593 |
|  4096 |     32 |    2 |   8256 |    2.111 |  1940.60 |    0.620 |   103.28 |    2.730 |
|  4096 |     32 |    4 |  16512 |    2.111 |  1940.11 |    0.892 |   143.49 |    3.003 |
|  4096 |     32 |    8 |  33024 |    2.111 |  1940.15 |    1.285 |   199.15 |    3.397 |
|  8192 |     32 |    1 |   8224 |    5.468 |  1498.08 |    0.528 |    60.55 |    5.997 |
|  8192 |     32 |    2 |  16448 |    5.464 |  1499.21 |    0.696 |    92.00 |    6.160 |
|  8192 |     32 |    4 |  32896 |    5.462 |  1499.84 |    1.031 |   124.19 |    6.493 |
|  8192 |     32 |    8 |  65792 |    5.469 |  1497.95 |    1.551 |   165.02 |    7.020 |
| 16384 |     32 |    1 |  16416 |   16.038 |  1021.58 |    0.627 |    51.01 |   16.665 |
| 16384 |     32 |    2 |  32832 |   16.041 |  1021.41 |    0.850 |    75.29 |   16.891 |
| 16384 |     32 |    4 |  65664 |   16.039 |  1021.53 |    1.306 |    97.99 |   17.345 |
| 16384 |     32 |    8 | 131328 |   16.037 |  1021.66 |    2.110 |   121.32 |   18.147 |
| 32768 |     32 |    1 |  32800 |   52.989 |   618.39 |    0.826 |    38.76 |   53.814 |
| 32768 |     32 |    2 |  65600 |   52.974 |   618.56 |    1.200 |    53.34 |   54.174 |
| 32768 |     32 |    4 | 131200 |   52.977 |   618.54 |    1.935 |    66.15 |   54.912 |
| 32768 |     32 |    8 | 262400 |   52.935 |   619.02 |    3.395 |    75.41 |   56.330 |

ggml-ci

* metal : optmize FA vec for large heads and sequences * metal : adjust small-batch mul mv kernels ggml-ci * batched-bench : fix total speed computation ggml-ci * cont : add comments ggml-ci

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Aug 25, 2025

ggerganov force-pushed the gg/metal-mmid-opt branch from 80aa73a to 3b7c087 Compare August 26, 2025 09:15

Base automatically changed from gg/metal-mmid-opt to master August 26, 2025 09:46

ggerganov added 3 commits August 26, 2025 12:47

metal : optmize FA vec for large heads and sequences

ef68186

metal : adjust small-batch mul mv kernels

9c0fe8e

ggml-ci

batched-bench : fix total speed computation

6d0b222

ggml-ci

ggerganov force-pushed the gg/metal-fa-vec-opt-2 branch from aed06a9 to 6d0b222 Compare August 26, 2025 09:47

cont : add comments

a92bdd9

ggml-ci

ggerganov merged commit b3964c1 into master Aug 26, 2025
55 of 56 checks passed

ggerganov deleted the gg/metal-fa-vec-opt-2 branch August 26, 2025 11:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metal : optimize FA vec for large sequences and BS <= 8 #15566

metal : optimize FA vec for large sequences and BS <= 8 #15566

Uh oh!

ggerganov commented Aug 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

metal : optimize FA vec for large sequences and BS <= 8 #15566

metal : optimize FA vec for large sequences and BS <= 8 #15566

Uh oh!

Conversation

ggerganov commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Perf M2 Ultra

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Aug 25, 2025 •

edited

Loading