Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Aug 25, 2025

target #15541

  • For large sequences, run multiple workgroups of the FA vec kernel. Reduce the final results with a follow-up kernel (kernel_flash_attn_ext_reduce)
  • Avoid using more than half of max shared memory - improves significantly Gemma perf (likely because it has HS=256 and this puts a lot of pressure on the GPU)
  • Add kernel_mul_mv_ext_f32_f32_... specialization (needed for some MoE models)
  • Fix llama-batched-bench total speed report (cont batched-bench : fix unified KV cache handling + pp timing #15562)

TODO

  • Allow to adjust nkpsg
  • Add comments

Perf M2 Ultra

Parallel performance of up to 8 sequences is significantly improved. The longer the sequence length, the higher gain is observed. This is comparison to master - the PR includes the speed-up from #15541 which improves large batch size prompt processing for MoE.

scripts/compare-commits.sh master gg/metal-fa-vec-opt-2 llama-bench -m ./models/qwen3-30b-a3b-coder/ggml-model-q4_0.gguf -m ./models/gpt-oss-20b/ggml-model-mxfp4.gguf -m models/gpt-oss-120b/ggml-model-mxfp4.gguf -m ./models/qwen3-4b-instruct-2507/ggml-model-q8_0.gguf -m ./models/gemma-3-1b-it/ggml-model-q4_0.gguf -m models/gemma-3-4b/ggml-model-q4_0.gguf -m models/deepseek-v2-lite-chat/ggml-model-q4_k.gguf -fa 1 -t 1 -ub 2048 -d 0,512,1024,2048,4096,8192,16384,32768 -n 128 -r 1 -p 0
Model Test t/s master t/s gg/metal-fa-vec-opt-2 Speedup
deepseek2 16B Q4_K_M tg128 116.16 116.50 1.00
deepseek2 16B Q4_K_M tg128@d512 107.84 109.84 1.02
deepseek2 16B Q4_K_M tg128@d1024 100.70 104.67 1.04
deepseek2 16B Q4_K_M tg128@d2048 92.85 106.26 1.14
deepseek2 16B Q4_K_M tg128@d4096 80.13 97.56 1.22
deepseek2 16B Q4_K_M tg128@d8192 63.33 83.86 1.32
deepseek2 16B Q4_K_M tg128@d16384 44.12 66.05 1.50
deepseek2 16B Q4_K_M tg128@d32768 27.40 46.16 1.69
gemma3 1B Q4_0 tg128 209.10 209.63 1.00
gemma3 1B Q4_0 tg128@d512 188.61 183.54 0.97
gemma3 1B Q4_0 tg128@d1024 185.13 201.75 1.09
gemma3 1B Q4_0 tg128@d2048 179.45 200.37 1.12
gemma3 1B Q4_0 tg128@d4096 170.35 197.92 1.16
gemma3 1B Q4_0 tg128@d8192 155.22 193.96 1.25
gemma3 1B Q4_0 tg128@d16384 130.38 186.47 1.43
gemma3 1B Q4_0 tg128@d32768 98.65 172.59 1.75
gemma3 4B Q4_0 tg128 124.28 124.31 1.00
gemma3 4B Q4_0 tg128@d512 113.91 112.14 0.98
gemma3 4B Q4_0 tg128@d1024 105.78 116.80 1.10
gemma3 4B Q4_0 tg128@d2048 103.59 116.77 1.13
gemma3 4B Q4_0 tg128@d4096 99.46 115.01 1.16
gemma3 4B Q4_0 tg128@d8192 92.45 112.76 1.22
gemma3 4B Q4_0 tg128@d16384 81.00 108.57 1.34
gemma3 4B Q4_0 tg128@d32768 64.09 101.14 1.58
gpt-oss 120B MXFP4 MoE tg128 80.10 80.16 1.00
gpt-oss 120B MXFP4 MoE tg128@d512 75.01 79.04 1.05
gpt-oss 120B MXFP4 MoE tg128@d1024 75.02 78.28 1.04
gpt-oss 120B MXFP4 MoE tg128@d2048 72.28 74.44 1.03
gpt-oss 120B MXFP4 MoE tg128@d4096 69.73 73.28 1.05
gpt-oss 120B MXFP4 MoE tg128@d8192 67.01 70.38 1.05
gpt-oss 120B MXFP4 MoE tg128@d16384 59.98 65.34 1.09
gpt-oss 120B MXFP4 MoE tg128@d32768 50.94 57.36 1.13
gpt-oss 20B MXFP4 MoE tg128 116.39 116.84 1.00
gpt-oss 20B MXFP4 MoE tg128@d512 114.74 114.89 1.00
gpt-oss 20B MXFP4 MoE tg128@d1024 112.47 113.89 1.01
gpt-oss 20B MXFP4 MoE tg128@d2048 110.12 108.47 0.99
gpt-oss 20B MXFP4 MoE tg128@d4096 104.85 106.40 1.01
gpt-oss 20B MXFP4 MoE tg128@d8192 98.95 102.07 1.03
gpt-oss 20B MXFP4 MoE tg128@d16384 92.30 95.11 1.03
gpt-oss 20B MXFP4 MoE tg128@d32768 76.43 83.01 1.09
qwen3 4B Q8_0 tg128 100.56 100.18 1.00
qwen3 4B Q8_0 tg128@d512 96.24 96.25 1.00
qwen3 4B Q8_0 tg128@d1024 92.75 92.73 1.00
qwen3 4B Q8_0 tg128@d2048 86.92 92.25 1.06
qwen3 4B Q8_0 tg128@d4096 77.07 87.32 1.13
qwen3 4B Q8_0 tg128@d8192 62.90 78.21 1.24
qwen3 4B Q8_0 tg128@d16384 46.08 64.68 1.40
qwen3 4B Q8_0 tg128@d32768 30.06 48.09 1.60
qwen3moe 30B.A3B Q4_0 tg128 89.09 88.96 1.00
qwen3moe 30B.A3B Q4_0 tg128@d512 83.39 84.58 1.01
qwen3moe 30B.A3B Q4_0 tg128@d1024 79.82 81.10 1.02
qwen3moe 30B.A3B Q4_0 tg128@d2048 73.54 80.89 1.10
qwen3moe 30B.A3B Q4_0 tg128@d4096 64.85 78.36 1.21
qwen3moe 30B.A3B Q4_0 tg128@d8192 52.71 69.64 1.32
qwen3moe 30B.A3B Q4_0 tg128@d16384 38.37 56.77 1.48
qwen3moe 30B.A3B Q4_0 tg128@d32768 24.48 42.07 1.72
gpt-oss-120b master

main: n_kv_max = 264192, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 1, n_gpu_layers = -1, n_threads = 1, n_threads_batch = 1
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|
|     0 |     32 |    1 |     32 |    0.000 |     0.00 |    0.402 |    79.69 |    0.402 |
|     0 |     32 |    2 |     64 |    0.000 |     0.00 |    0.569 |   112.47 |    0.569 |
|     0 |     32 |    4 |    128 |    0.000 |     0.00 |    0.870 |   147.19 |    0.870 |
|     0 |     32 |    8 |    256 |    0.000 |     0.00 |    1.638 |   156.25 |    1.639 |
|  1024 |     32 |    1 |   1056 |    0.980 |  1044.79 |    0.431 |    74.29 |    1.411 |
|  1024 |     32 |    2 |   2112 |    0.892 |  1147.39 |    0.609 |   105.04 |    1.502 |
|  1024 |     32 |    4 |   4224 |    0.893 |  1146.54 |    0.927 |   138.06 |    1.820 |
|  1024 |     32 |    8 |   8448 |    0.887 |  1154.32 |    1.768 |   144.77 |    2.655 |
|  2048 |     32 |    1 |   2080 |    1.802 |  1136.42 |    0.418 |    76.61 |    2.220 |
|  2048 |     32 |    2 |   4160 |    1.670 |  1226.42 |    0.608 |   105.35 |    2.277 |
|  2048 |     32 |    4 |   8320 |    1.672 |  1224.55 |    0.939 |   136.32 |    2.611 |
|  2048 |     32 |    8 |  16640 |    1.732 |  1182.58 |    1.781 |   143.76 |    3.513 |
|  4096 |     32 |    1 |   4128 |    3.510 |  1166.82 |    0.432 |    74.08 |    3.942 |
|  4096 |     32 |    2 |   8256 |    3.484 |  1175.76 |    0.630 |   101.59 |    4.114 |
|  4096 |     32 |    4 |  16512 |    3.483 |  1176.09 |    0.981 |   130.50 |    4.464 |
|  4096 |     32 |    8 |  33024 |    3.486 |  1175.15 |    1.864 |   137.34 |    5.349 |
|  8192 |     32 |    1 |   8224 |    7.556 |  1084.23 |    0.456 |    70.12 |    8.012 |
|  8192 |     32 |    2 |  16448 |    7.525 |  1088.61 |    0.684 |    93.62 |    8.209 |
|  8192 |     32 |    4 |  32896 |    7.536 |  1087.10 |    1.085 |   118.00 |    8.620 |
|  8192 |     32 |    8 |  65792 |    7.550 |  1084.98 |    2.077 |   123.27 |    9.627 |
| 16384 |     32 |    1 |  16416 |   17.332 |   945.32 |    0.506 |    63.30 |   17.837 |
| 16384 |     32 |    2 |  32832 |   17.333 |   945.26 |    0.788 |    81.24 |   18.121 |
| 16384 |     32 |    4 |  65664 |   17.365 |   943.49 |    1.329 |    96.31 |   18.694 |
| 16384 |     32 |    8 | 131328 |   17.367 |   943.39 |    2.493 |   102.68 |   19.860 |
| 32768 |     32 |    1 |  32800 |   43.822 |   747.75 |    0.604 |    52.95 |   44.426 |
| 32768 |     32 |    2 |  65600 |   43.802 |   748.10 |    1.007 |    63.57 |   44.808 |
| 32768 |     32 |    4 | 131200 |   43.853 |   747.23 |    1.782 |    71.84 |   45.634 |
| 32768 |     32 |    8 | 262400 |   43.824 |   747.71 |    3.416 |    74.94 |   47.240 |

gpt-oss-120b PR

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|
|     0 |     32 |    1 |     32 |    0.000 |     0.00 |    0.404 |    79.24 |    0.404 |
|     0 |     32 |    2 |     64 |    0.000 |     0.00 |    0.574 |   111.56 |    0.574 |
|     0 |     32 |    4 |    128 |    0.000 |     0.00 |    0.873 |   146.59 |    0.873 |
|     0 |     32 |    8 |    256 |    0.000 |     0.00 |    1.423 |   179.85 |    1.424 |
|  1024 |     32 |    1 |   1056 |    0.726 |  1410.42 |    0.412 |    77.76 |    1.138 |
|  1024 |     32 |    2 |   2112 |    0.705 |  1452.52 |    0.597 |   107.14 |    1.302 |
|  1024 |     32 |    4 |   4224 |    0.706 |  1449.94 |    0.919 |   139.29 |    1.625 |
|  1024 |     32 |    8 |   8448 |    0.700 |  1462.11 |    1.508 |   169.80 |    2.208 |
|  2048 |     32 |    1 |   2080 |    1.288 |  1589.73 |    0.434 |    73.71 |    1.722 |
|  2048 |     32 |    2 |   4160 |    1.284 |  1595.11 |    0.629 |   101.82 |    1.912 |
|  2048 |     32 |    4 |   8320 |    1.282 |  1597.98 |    0.976 |   131.10 |    2.258 |
|  2048 |     32 |    8 |  16640 |    1.281 |  1598.60 |    1.622 |   157.86 |    2.903 |
|  4096 |     32 |    1 |   4128 |    2.708 |  1512.32 |    0.440 |    72.66 |    3.149 |
|  4096 |     32 |    2 |   8256 |    2.706 |  1513.77 |    0.644 |    99.37 |    3.350 |
|  4096 |     32 |    4 |  16512 |    2.707 |  1512.85 |    1.005 |   127.32 |    3.713 |
|  4096 |     32 |    8 |  33024 |    2.706 |  1513.43 |    1.682 |   152.16 |    4.389 |
|  8192 |     32 |    1 |   8224 |    5.975 |  1371.06 |    0.458 |    69.94 |    6.432 |
|  8192 |     32 |    2 |  16448 |    5.972 |  1371.63 |    0.674 |    95.02 |    6.646 |
|  8192 |     32 |    4 |  32896 |    5.972 |  1371.64 |    1.062 |   120.57 |    7.034 |
|  8192 |     32 |    8 |  65792 |    5.971 |  1371.89 |    1.792 |   142.88 |    7.763 |
| 16384 |     32 |    1 |  16416 |   14.195 |  1154.25 |    0.492 |    65.10 |   14.686 |
| 16384 |     32 |    2 |  32832 |   14.192 |  1154.45 |    0.735 |    87.08 |   14.927 |
| 16384 |     32 |    4 |  65664 |   14.192 |  1154.47 |    1.181 |   108.39 |   15.373 |
| 16384 |     32 |    8 | 131328 |   14.198 |  1153.95 |    2.015 |   127.04 |   16.213 |
| 32768 |     32 |    1 |  32800 |   37.492 |   873.99 |    0.560 |    57.10 |   38.053 |
| 32768 |     32 |    2 |  65600 |   37.522 |   873.29 |    0.861 |    74.33 |   38.383 |
| 32768 |     32 |    4 | 131200 |   37.491 |   874.02 |    1.416 |    90.40 |   38.907 |
| 32768 |     32 |    8 | 262400 |   37.535 |   873.01 |    2.486 |   102.99 |   40.020 |

---

qwen3 30b a3b master

main: n_kv_max = 264192, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 1, n_gpu_layers = -1, n_threads = 1, n_threads_batch = 1
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|
|     0 |     32 |    1 |     32 |    0.000 |     0.00 |    0.423 |    75.59 |    0.424 |
|     0 |     32 |    2 |     64 |    0.000 |     0.00 |    0.538 |   118.94 |    0.538 |
|     0 |     32 |    4 |    128 |    0.000 |     0.00 |    0.725 |   176.49 |    0.725 |
|     0 |     32 |    8 |    256 |    0.000 |     0.00 |    1.044 |   245.28 |    1.044 |
|  1024 |     32 |    1 |   1056 |    0.899 |  1139.40 |    0.455 |    70.25 |    1.354 |
|  1024 |     32 |    2 |   2112 |    0.715 |  1432.46 |    0.583 |   109.80 |    1.298 |
|  1024 |     32 |    4 |   4224 |    0.715 |  1431.71 |    0.824 |   155.37 |    1.539 |
|  1024 |     32 |    8 |   8448 |    0.718 |  1425.56 |    1.179 |   217.08 |    1.898 |
|  2048 |     32 |    1 |   2080 |    1.720 |  1190.55 |    0.488 |    65.61 |    2.208 |
|  2048 |     32 |    2 |   4160 |    1.448 |  1414.43 |    0.600 |   106.74 |    2.047 |
|  2048 |     32 |    4 |   8320 |    1.445 |  1416.90 |    0.884 |   144.80 |    2.329 |
|  2048 |     32 |    8 |  16640 |    1.443 |  1419.59 |    1.308 |   195.67 |    2.751 |
|  4096 |     32 |    1 |   4128 |    3.238 |  1264.88 |    0.550 |    58.17 |    3.788 |
|  4096 |     32 |    2 |   8256 |    3.180 |  1288.08 |    0.668 |    95.86 |    3.848 |
|  4096 |     32 |    4 |  16512 |    3.181 |  1287.63 |    1.005 |   127.32 |    4.186 |
|  4096 |     32 |    8 |  33024 |    3.258 |  1257.25 |    1.538 |   166.50 |    4.795 |
|  8192 |     32 |    1 |   8224 |    7.652 |  1070.55 |    0.671 |    47.66 |    8.323 |
|  8192 |     32 |    2 |  16448 |    7.591 |  1079.11 |    0.800 |    80.00 |    8.391 |
|  8192 |     32 |    4 |  32896 |    7.613 |  1076.02 |    1.300 |    98.49 |    8.913 |
|  8192 |     32 |    8 |  65792 |    7.668 |  1068.33 |    2.119 |   120.81 |    9.787 |
| 16384 |     32 |    1 |  16416 |   20.417 |   802.47 |    0.907 |    35.27 |   21.324 |
| 16384 |     32 |    2 |  32832 |   20.413 |   802.64 |    1.062 |    60.25 |   21.475 |
| 16384 |     32 |    4 |  65664 |   20.454 |   801.02 |    1.843 |    69.44 |   22.297 |
| 16384 |     32 |    8 | 131328 |   20.431 |   801.92 |    3.212 |    79.71 |   23.643 |
| 32768 |     32 |    1 |  32800 |   61.786 |   530.35 |    1.422 |    22.50 |   63.208 |
| 32768 |     32 |    2 |  65600 |   61.811 |   530.13 |    1.619 |    39.53 |   63.431 |
| 32768 |     32 |    4 | 131200 |   61.759 |   530.57 |    2.858 |    44.79 |   64.617 |
| 32768 |     32 |    8 | 262400 |   61.786 |   530.34 |    5.426 |    47.18 |   67.212 |


qwen3 30b a3b PR

main: n_kv_max = 264192, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 1, n_gpu_layers = -1, n_threads = 1, n_threads_batch = 1
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|
|     0 |     32 |    1 |     32 |    0.000 |     0.00 |    0.425 |    75.36 |    0.425 |
|     0 |     32 |    2 |     64 |    0.000 |     0.00 |    0.529 |   120.88 |    0.530 |
|     0 |     32 |    4 |    128 |    0.000 |     0.00 |    0.718 |   178.36 |    0.718 |
|     0 |     32 |    8 |    256 |    0.000 |     0.00 |    0.956 |   267.65 |    0.957 |
|  1024 |     32 |    1 |   1056 |    0.472 |  2168.84 |    0.458 |    69.87 |    0.930 |
|  1024 |     32 |    2 |   2112 |    0.449 |  2279.58 |    0.561 |   113.99 |    1.011 |
|  1024 |     32 |    4 |   4224 |    0.451 |  2269.84 |    0.791 |   161.74 |    1.243 |
|  1024 |     32 |    8 |   8448 |    0.449 |  2280.20 |    1.077 |   237.59 |    1.527 |
|  2048 |     32 |    1 |   2080 |    0.907 |  2259.22 |    0.456 |    70.24 |    1.362 |
|  2048 |     32 |    2 |   4160 |    0.904 |  2264.99 |    0.582 |   110.04 |    1.486 |
|  2048 |     32 |    4 |   8320 |    0.902 |  2269.60 |    0.824 |   155.39 |    1.726 |
|  2048 |     32 |    8 |  16640 |    0.902 |  2269.57 |    1.148 |   223.09 |    2.050 |
|  4096 |     32 |    1 |   4128 |    2.115 |  1936.44 |    0.478 |    67.00 |    2.593 |
|  4096 |     32 |    2 |   8256 |    2.111 |  1940.60 |    0.620 |   103.28 |    2.730 |
|  4096 |     32 |    4 |  16512 |    2.111 |  1940.11 |    0.892 |   143.49 |    3.003 |
|  4096 |     32 |    8 |  33024 |    2.111 |  1940.15 |    1.285 |   199.15 |    3.397 |
|  8192 |     32 |    1 |   8224 |    5.468 |  1498.08 |    0.528 |    60.55 |    5.997 |
|  8192 |     32 |    2 |  16448 |    5.464 |  1499.21 |    0.696 |    92.00 |    6.160 |
|  8192 |     32 |    4 |  32896 |    5.462 |  1499.84 |    1.031 |   124.19 |    6.493 |
|  8192 |     32 |    8 |  65792 |    5.469 |  1497.95 |    1.551 |   165.02 |    7.020 |
| 16384 |     32 |    1 |  16416 |   16.038 |  1021.58 |    0.627 |    51.01 |   16.665 |
| 16384 |     32 |    2 |  32832 |   16.041 |  1021.41 |    0.850 |    75.29 |   16.891 |
| 16384 |     32 |    4 |  65664 |   16.039 |  1021.53 |    1.306 |    97.99 |   17.345 |
| 16384 |     32 |    8 | 131328 |   16.037 |  1021.66 |    2.110 |   121.32 |   18.147 |
| 32768 |     32 |    1 |  32800 |   52.989 |   618.39 |    0.826 |    38.76 |   53.814 |
| 32768 |     32 |    2 |  65600 |   52.974 |   618.56 |    1.200 |    53.34 |   54.174 |
| 32768 |     32 |    4 | 131200 |   52.977 |   618.54 |    1.935 |    66.15 |   54.912 |
| 32768 |     32 |    8 | 262400 |   52.935 |   619.02 |    3.395 |    75.41 |   56.330 |

@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Aug 25, 2025
Base automatically changed from gg/metal-mmid-opt to master August 26, 2025 09:46
@ggerganov ggerganov force-pushed the gg/metal-fa-vec-opt-2 branch from aed06a9 to 6d0b222 Compare August 26, 2025 09:47
@ggerganov ggerganov merged commit b3964c1 into master Aug 26, 2025
55 of 56 checks passed
@ggerganov ggerganov deleted the gg/metal-fa-vec-opt-2 branch August 26, 2025 11:22
Minh141120 pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 27, 2025
* metal : optmize FA vec for large heads and sequences

* metal : adjust small-batch mul mv kernels

ggml-ci

* batched-bench : fix total speed computation

ggml-ci

* cont : add comments

ggml-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant