sync : llama.cpp #1344

ggerganov · 2025-09-20T10:11:48Z

No description provided.

This commit adds two new command-line options to the test-backend-ops.cpp that allow users to list all available GGML operations and to show test coverage of these operations. The motivation for this is that it can be useful to quickly see which operations are currently covered by tests and which are not. Also it migth be useful when using the `support` mode.

* CUDA: fastdiv, launch bounds for mmvq + q8_1 quant

… (llama/15817)

* ggml WebGPU: remove userdata from request adapter callback This commit removes the `userdata` parameter from the WebGPU request adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function captures the `webgpu_context` directly. The motivation for this change is to simplify the code and improve readability. * inline the callback lambda into the RequestAdapter call This commit removes the callback lambda variable and inlines it directly into the RequestAdapter call.

I think glslang will translate an access like x[i][1].z to OpAccessChain ... x, i, 1, 2 OpLoad float16_t ... rather than loading all of x[i] in a single OpLoad. Change the code to explicitly load the vector/matrix.

* ggml-cpu: clean up s390x simd Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 0da4b6aa07d96b758812d17b2c82267632fa4ba5) Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix hsum data types Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

* CANN: Switch to stream synchronization Switch to stream synchronization because events are not effective. Co-authored-by: hipudding <[email protected]> * CANN: add Comments --------- Co-authored-by: hipudding <[email protected]>

* ggml: allow casting between f32 and i32 * fix cuda * add vulkan * fix CPU non-cont * add non-cont test case * add note * extend test number range * correct note * add cont version for vulkan

…s too large (llama/15868) * cuda : fix supports_op condition for get_rows when src1->ne2 > 1 ggml-ci * ggml : add comment about ggml_get_rows ggml-ci * cuda : add FIXME [no ci] * cuda : update support condition ggml-ci

* vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With #15489, this reduces the number of synchronizations needed. * call optimize_graph per-split

* CUDA: Add mul_mat_id support the mmf Add support for mul_mat_id for bs < 16 * Review: use warp_size, fix should_use_mmf condition * Launch one block per expert, stride along n_expert_used * templatize mul_mat_id * Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids * Reduce compile times by dividing mmf into f16, bf16 and f32 variants * Divide mmf by ncols_dst * Add missing files * Fix MUSA/HIP builds

…(issue 15846) (llama/15886)

…905)

* CANN: implement LRU cache for ACL graphs in CANN backend - Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend. * fix typo * The LRU cache capacity can be configured via an env variable Signed-off-by: noemotiovon <[email protected]> * refactory acl graph * refactory && fix review comments Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

* CANN: Add ROPE sin/cos cache for reuse Introduce sin/cos caching mechanism in ROPE to avoid redundant computation across layers. The cache is built on the first layer per device and reused by subsequent layers if parameters match. - Added sin_cache / cos_cache pointers and position_length tracking - Introduced cache validity flags and properties: (ext_factor, theta_scale, freq_scale, attn_factor, is_neox) - Accelerates ROPE by eliminating repeated sin/cos generation This change reduces overhead in multi-layer scenarios while preserving correctness by verifying parameter consistency. Co-authored-by: hipudding <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]> Co-authored-by: hipudding <[email protected]>

* tests : filter out no-ops from coverage report This commit is a follow-up commit for #15745 to address the feedback on how no-op operations should be filtered out from the coverage report. The feedback regarding the UNARY and GLU sub-operations not being handled I not exactly sure what should be done. They are included in the coverage, for example ABS, ELU, EXP, GELU, GEGLU, GEGLU_ERF etc are in the list of covered operations: ```console $ ./build/bin/test-backend-ops --show-coverage Operations covered by tests (89): ✓ ABS ✓ ACC ✓ ADD ✓ ADD1 ✓ ADD_ID ✓ ARANGE ✓ ARGMAX ✓ ARGSORT ✓ CLAMP ✓ CONCAT ✓ CONV_2D ✓ CONV_2D_DW ✓ CONV_3D ✓ CONV_TRANSPOSE_1D ✓ CONV_TRANSPOSE_2D ✓ COS ✓ COUNT_EQUAL ✓ CPY ✓ CROSS_ENTROPY_LOSS ✓ CROSS_ENTROPY_LOSS_BACK ✓ DIAG_MASK_INF ✓ DIV ✓ DUP ✓ ELU ✓ EXP ✓ FLASH_ATTN_EXT ✓ GATED_LINEAR_ATTN ✓ GEGLU ✓ GEGLU_ERF ✓ GEGLU_QUICK ✓ GELU ✓ GELU_ERF ✓ GELU_QUICK ✓ GET_ROWS ✓ GET_ROWS_BACK ✓ GROUP_NORM ✓ HARDSIGMOID ✓ HARDSWISH ✓ IM2COL ✓ IM2COL_3D ✓ L2_NORM ✓ LEAKY_RELU ✓ LOG ✓ MEAN ✓ MUL ✓ MUL_MAT ✓ MUL_MAT_ID ✓ NEG ✓ NORM ✓ OPT_STEP_ADAMW ✓ OPT_STEP_SGD ✓ OUT_PROD ✓ PAD ✓ PAD_REFLECT_1D ✓ POOL_2D ✓ REGLU ✓ RELU ✓ REPEAT ✓ REPEAT_BACK ✓ RMS_NORM ✓ RMS_NORM_BACK ✓ ROLL ✓ ROPE ✓ ROPE_BACK ✓ RWKV_WKV6 ✓ RWKV_WKV7 ✓ SCALE ✓ SET ✓ SET_ROWS ✓ SGN ✓ SIGMOID ✓ SILU ✓ SILU_BACK ✓ SIN ✓ SOFT_MAX ✓ SOFT_MAX_BACK ✓ SQR ✓ SQRT ✓ SSM_CONV ✓ SSM_SCAN ✓ STEP ✓ SUB ✓ SUM ✓ SUM_ROWS ✓ SWIGLU ✓ SWIGLU_OAI ✓ TANH ✓ TIMESTEP_EMBEDDING ✓ UPSCALE Operations without tests (14): ✗ ADD_REL_POS ✗ CUSTOM ✗ DIAG ✗ DIAG_MASK_ZERO ✗ FLASH_ATTN_BACK ✗ GET_REL_POS ✗ IM2COL_BACK ✗ MAP_CUSTOM1 ✗ MAP_CUSTOM2 ✗ MAP_CUSTOM3 ✗ POOL_1D ✗ POOL_2D_BACK ✗ WIN_PART ✗ WIN_UNPART Coverage Summary: Total operations: 103 Tested operations: 89 Untested operations: 14 Coverage: 86.4% ``` Refs: ggml-org/llama.cpp#15745 * use of ggml_op enum values instead of strcmp

* CANN: Fix ggml_cann_set_device to avoid redundant device switches - Added a check to skip aclrtSetDevice if the current device is already set. - Prevents unnecessary context switches while keeping thread/device consistency. * CANN: add device default id

* remove unsupported vulkan devices * make this happen during selection instead * pass by reference

…a/16018) * Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * some f32 tests passing * Disable set_rows until it's implemented * f32 add all tests passing * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Add templated addition, clean up code * Get addition and multiplication working * Implement rms_norm * Add get_rows implementation * Add new get_rows files * Refactor use of wg size entry * Fix compilation * Try manually unrolled q4_0 quant * Revert "Try manually unrolled q4_0 quant" This reverts commit 77f8b96515f7e640ae4b0e44f066321fbc4a6166. * Move to constant max wg size * Check for tensor size in supports_op * Vectorize f32 and change default workgroup size * Move f32 get_rows from < 4 to % 4 != 0 * fix linter errors * Add in-place tests --------- Co-authored-by: Neha Abbas <[email protected]>

Signed-off-by: noemotiovon <[email protected]>

ggml-ci

* metal : improve F32, F16 and BF16 mat-vec multiplication ggml-ci * metal : make the NSG a function constant in mul_mv kernels ggml-ci

* metal : use function constants for mul_mv_ext kernels ggml-ci * metal : remove NW template argument ggml-ci * metal : adjust constants ggml-ci

* CUDA: Optimize PAD_REFLECT_1D feat: add more test cases for PAD_REFLECT_1D * use fast_div to improve performance * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <[email protected]> * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <[email protected]> * optimize * use a concise expression to further speedup the cuda kernel --------- Co-authored-by: Johannes Gäßler <[email protected]>

- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut) - MoE kernel optimizations --------- Co-authored-by: Li He <[email protected]>

When compiling with GGML_STATIC=ON, the build process would produce a binary that was still dynamically linked to OpenMP. This defeats the purpose of a static build: $ cmake -B build \ -DBUILD_SHARED_LIBS=OFF \ -DLLAMA_CURL=OFF \ -DGGML_CCACHE=OFF \ -DGGML_NATIVE=OFF \ -DGGML_STATIC=ON $ ldd llama-server linux-vdso.so.1 (0x0000e1a434e3b000) libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000) libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000) /lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000) This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES` to prioritize `.a` files, forcing CMake to link the static version of the library. Signed-off-by: Adrien Gallouët <[email protected]>

Generalize Linux check to `__linux__` to support non-glibc systems (like musl). Also, return `false` on unknown/untested OS. Without this commit, the code compiles (with warnings) but fails: register_backend: registered backend CPU (1 devices) register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C) build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug) system info: n_threads = 8, n_threads_batch = 8, total_threads = 16 .... print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: model type = 4B Illegal instruction (core dumped) Signed-off-by: Adrien Gallouët <[email protected]>

* ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test

* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues

dg0yt · 2025-09-23T06:27:58Z

src/ggml-vulkan/ggml-vulkan.cpp

+// See https://github.com/KhronosGroup/Vulkan-Hpp?tab=readme-ov-file#extensions--per-device-function-pointers-
+#define VULKAN_HPP_DISPATCH_LOADER_DYNAMIC 1


I think this causes an error in vcpkg CI now, android with static library linkage.

ld.lld: error: duplicate symbol: vk::detail::defaultDispatchLoaderDynamic >>> defined at test-cmake.cpp:6 (/mnt/vcpkg-ci/b/vcpkg-ci-ggml/src/e98bd99269-a8d27b3c76.clean/examples/test-cmake/test-cmake.cpp:6) >>> CMakeFiles/test-cmake.dir/test-cmake.cpp.o:(vk::detail::defaultDispatchLoaderDynamic) >>> defined at ggml-vulkan.cpp:14 (/mnt/vcpkg-ci/b/ggml/src/v0.9.1-72a0cec7be.clean/src/ggml-vulkan/ggml-vulkan.cpp:14) >>> ggml-vulkan.cpp.o:(.data._ZN2vk6detail28defaultDispatchLoaderDynamicE+0x0) in archive /mnt/vcpkg-ci/installed/arm64-android/debug/lib/libggml-vulkan.a clang++: error: linker command failed with exit code 1 (use -v to see invocation)

So ggml now exposes a symbol from the vk namespace.

I'm not a Vulkan expert. Is this the right thing to do in a library under C++ ODR rules? In a ggml test port in vcpkg this is done the "executable" for Android. AFAIU the final executable is the only way to do it reliably with regard to ODR.

(Admittedly this is a linker error about duplicate symbols, not a compiler error about ODR.)

FTR the test port builds the example executable with this change:

diff --git a/examples/test-cmake/CMakeLists.txt b/examples/test-cmake/CMakeLists.txt index d6bc0cc4..395a63c7 100644 --- a/examples/test-cmake/CMakeLists.txt +++ b/examples/test-cmake/CMakeLists.txt @@ -8,3 +8,9 @@ find_package(ggml CONFIG REQUIRED) set(TEST_TARGET test-cmake) add_executable(test-cmake test-cmake.cpp) target_link_libraries(test-cmake PRIVATE ggml::ggml) + +if(ANDROID AND TARGET ggml::ggml-vulkan) + # Instantiates VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE + find_package(Vulkan REQUIRED) + target_link_libraries(test-cmake PRIVATE Vulkan::Vulkan) +endif() diff --git a/examples/test-cmake/test-cmake.cpp b/examples/test-cmake/test-cmake.cpp index 029c8898..9d4bbe19 100644 --- a/examples/test-cmake/test-cmake.cpp +++ b/examples/test-cmake/test-cmake.cpp @@ -1,5 +1,11 @@ #include "ggml-backend.h" +#if defined(ANDROID) +#define VULKAN_HPP_DISPATCH_LOADER_DYNAMIC 1 +#include <vulkan/vulkan.hpp> +VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE +#endif + int main(void) { ggml_backend_load_all(); return 0;

The problem is from vcpkg side. because on android we force these VULKAN_HPP_DISPATCH_LOADER_DYNAMIC and VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE while compile with vulkan. This create the duplication of the symbols. I fix it in this PR: microsoft/vcpkg#47464

If a second lib does the same, we are back at start.
And that's why it cannot be in the libs.

You should mention that in the PR when I ask for solution and elaborate more. Now I understand.

dg0yt · 2025-09-23T06:32:52Z

src/ggml-vulkan/ggml-vulkan.cpp

+// See https://github.com/KhronosGroup/Vulkan-Hpp?tab=readme-ov-file#extensions--per-device-function-pointers-
+VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE


Actually this is the implementation.

danbev and others added 30 commits September 20, 2025 13:03

CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (llama/15802)

dd8b264

* CUDA: fastdiv, launch bounds for mmvq + q8_1 quant

ggml-cpu: drop support for nnpa intrinsics (llama/15821)

e53d186

ggml-cpu: document use of "free" memory [no ci] (llama/15834)

ef3a67b

kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16…

ff35061

… (llama/15817)

CUDA: faster tile FA (Pascal/AMD), headsize 256 (llama/15769)

664f01c

vulkan: Use larger loads in scalar/coopmat1 matmul (llama/15729)

d2069c7

I think glslang will translate an access like x[i][1].z to OpAccessChain ... x, i, 1, 2 OpLoad float16_t ... rather than loading all of x[i] in a single OpLoad. Change the code to explicitly load the vector/matrix.

vulkan: Support pad_ext (llama/15794)

f85374e

vulkan: support im2col_3d (llama/15795)

5af0743

tests: large sizes for get_rows (llama/15687)

f8f13b3

CUDA: non-contiguous src0 not supported for PAD (llama/15869)

0c6f5cb

ggml: allow casting between f32 and i32 (llama/15783)

a767255

* ggml: allow casting between f32 and i32 * fix cuda * add vulkan * fix CPU non-cont * add non-cont test case * add note * extend test number range * correct note * add cont version for vulkan

sync : llama.cpp

c186b92

metal : refactor + optimize (llama/15857)

a6b91e6

sync : llama.cpp

0103cf1

CUDA: generate_cu_files.py - add missing mxfp4 (llama/15880)

d096a7d

CUDA: fix GET_ROWS for large tensors (llama/15882)

373c68f

Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs …

e7772a1

…(issue 15846) (llama/15886)

HIP: use v_dot2_f32_f16 instruction for FA (llama/15884)

07ec4c6

vulkan: Fix OOB accesses in soft_max_back (llama/15861)

8967731

vulkan: throw the oom error instead of no memory type found (llama/15…

bd5192a

…905)

noemotiovon and others added 22 commits September 20, 2025 13:07

vulkan: automatically remove unsupported devices (llama/15976)

d80e7b9

* remove unsupported vulkan devices * make this happen during selection instead * pass by reference

CUDA: fix FA occupancy, optimize tile kernel (llama/15982)

f8c2feb

sync : llama.cpp

1a91fb9

metal : refactor + optimize v2 (llama/15995)

eb24610

CANN: Remove print (llama/16044)

098944f

Signed-off-by: noemotiovon <[email protected]>

metal : handle nil cv during pipeline creation (llama/16065)

1c1edee

ggml-ci

metal : avoid call free for non-owned buffer (llama/16067)

ad834c3

metal : improve F32, F16 and BF16 mat-vec multiplication (llama/16057)

2465bd4

* metal : improve F32, F16 and BF16 mat-vec multiplication ggml-ci * metal : make the NSG a function constant in mul_mv kernels ggml-ci

cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (llama/16060)

15c7f02

metal : use function constants for mul_mv_ext kernels (llama/16074)

d2af55f

* metal : use function constants for mul_mv_ext kernels ggml-ci * metal : remove NW template argument ggml-ci * metal : adjust constants ggml-ci

CUDA: fix compilation on CC 6.0 (llama/16091)

044daef

rename optimize_graph to graph_optimize (llama/16082)

2798b6c

opencl: optimize mxfp4 kernels (llama/16037)

543c31d

- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut) - MoE kernel optimizations --------- Co-authored-by: Li He <[email protected]>

ggml : refactor forward_dup for cpu backend (llama/16062)

7f7a412

* ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test

vulkan: use vec dot for matrix matrix multiplications (llama/16056)

96a655f

* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues

sync : llama.cpp

a78e889

tests : adjust to new timestep_embedding operator

ec27e70

ggerganov merged commit 332a82b into master Sep 20, 2025
8 checks passed

ggerganov deleted the sync-llama.cpp-25-09-20 branch September 20, 2025 10:33

dg0yt reviewed Sep 23, 2025

View reviewed changes

dg0yt mentioned this pull request Sep 23, 2025

[ggml, llama-cpp] update to 0.9.1, 6550 microsoft/vcpkg#47459

Closed

7 tasks

ggerganov mentioned this pull request Sep 23, 2025

vulkan: initialize vulkan-hpp to allow using extension function pointers ggml-org/llama.cpp#15705

Merged

talregev mentioned this pull request Sep 23, 2025

[ggml, llama-cpp] update to 0.9.1, 6550 microsoft/vcpkg#47464

Merged

7 tasks

Acly mentioned this pull request Sep 24, 2025

vulkan : make the vulkan.hpp dynamic dispatcher instance private ggml-org/llama.cpp#16224

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : llama.cpp #1344

sync : llama.cpp #1344

Uh oh!

ggerganov commented Sep 20, 2025

Uh oh!

Uh oh!

dg0yt Sep 23, 2025 •

edited

Loading

Uh oh!

dg0yt Sep 23, 2025

Uh oh!

talregev Sep 23, 2025

Uh oh!

dg0yt Sep 23, 2025

Uh oh!

talregev Sep 23, 2025

Uh oh!

dg0yt Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

27 participants

		// See https://github.com/KhronosGroup/Vulkan-Hpp?tab=readme-ov-file#extensions--per-device-function-pointers-
		#define VULKAN_HPP_DISPATCH_LOADER_DYNAMIC 1

		// See https://github.com/KhronosGroup/Vulkan-Hpp?tab=readme-ov-file#extensions--per-device-function-pointers-
		VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE

sync : llama.cpp #1344

sync : llama.cpp #1344

Uh oh!

Conversation

ggerganov commented Sep 20, 2025

Uh oh!

Uh oh!

dg0yt Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg0yt Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

talregev Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

dg0yt Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

talregev Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

dg0yt Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

27 participants

dg0yt Sep 23, 2025 •

edited

Loading