ggml : add repack testing support #16182

danbev · 2025-09-22T19:47:50Z

This PR consists of three commits but the first two are part of #16004.

This commit add support for testing the ggml-cpu repack feature which
enables the repackaging of quantized data into more optimal layout for
matrix multiplication for specific hardware architectures.

The motivation is to enable the testing of a cpu backend that uses
repacked data against a reference cpu backend that does not use repacked
data.

Building:

$ cmake -B build \
    -DGGML_CPU_REF_BACKEND=ON \
    -DGGML_BACKEND_DL=ON \
    -DGGML_CPU_ALL_VARIANTS=ON

List availble cpu architectures/variants:

$ ./build/bin/test-backend-ops cpu-variants --list
CPU variants:
  CPU-haswell     - 12th Gen Intel(R) Core(TM) i7-1260P
  CPU-sse42       - 12th Gen Intel(R) Core(TM) i7-1260P
  CPU-x64         - 12th Gen Intel(R) Core(TM) i7-1260P
  CPU-alderlake   - 12th Gen Intel(R) Core(TM) i7-1260P
  CPU-sandybridge - 12th Gen Intel(R) Core(TM) i7-1260P

Run tests:

./build/bin/test-backend-ops cpu-variants \
    --variant CPU-alderlake \
    -o "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)"

Testing CPU variant 'CPU-alderlake' against cpu-ref backend...

repack: repack tensor a with q4_0_8x8
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
repack: repack tensor a with q4_0_8x8
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  14491/14491 tests passed

All matrix multiplication tests can be run by use specifying -o "MUL_MAT, MUL_MAT_ID".

This commit introduces a CPU reference implementation for GGML, designed primarily for testing and validation purposes. The motivation for this addition is to have a pure C CPU backend implementation that does not use any hardware-specific optimizations or intrinsics. This will allow for testing the CPU backend variants against the reference implementation to ensure correctness

This commit add support for testing the ggml-cpu repack feature which enables the repackaging of quantized data into more optimal layout for matrix multiplication for specific hardware architectures. The motivation is to enable the testing of a cpu backend that uses repacked data against a reference cpu backend that does not use repacked data. Building: ```console $ cmake -B build \ -DGGML_CPU_REF_BACKEND=ON -DGGML_BACKEND_DL=ON \ -DGGML_CPU_ALL_VARIANTS=ON ``` List availble cpu architectures/variants: ```console $ ./build/bin/test-backend-ops cpu-variants --list CPU variants: CPU-alderlake - 12th Gen Intel(R) Core(TM) i7-1260P ``` Run tests: ```console ./build-ref/bin/test-backend-ops cpu-variants \ --variant CPU-alderlake \ -o "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)" Testing CPU variant 'CPU-alderlake' against cpu-ref backend... repack: repack tensor a with q4_0_8x8 MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK repack: repack tensor a with q4_0_8x8 MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK 14491/14491 tests passed ``` All matrix multiplication tests can be run by use specifying `-o "MUL_MAT"` but it may be harder to spot the ones that use repacking.

ggml/CMakeLists.txt

ggerganov · 2025-09-24T12:16:20Z

I don't think the reference CPU backend is using the generic implementations for some reason. Here is what I do on Mac:

cmake .. -DGGML_CPU_REF_BACKEND=ON -DGGML_BACKEND_DL=ON -DGGML_BLAS=OFF -DGGML_METAL=OFF
make -j && ./bin/test-backend-ops cpu-variants --list

load_backend: loaded CPU backend from llama.cpp/build-cpu-ref/bin/libggml-cpu.so
load_backend: loaded CPU backend from llama.cpp/build-cpu-ref/bin/libggml-cpu-ref.so
CPU variants:
  CPU             - Apple M4 Max

make -j && ./bin/test-backend-ops cpu-variants --variant CPU -o MUL_MAT -p "q4_0"

If I put prints like this, I see only the Arm implementation being executed:

diff --git a/ggml/src/ggml-cpu/arch/arm/quants.c b/ggml/src/ggml-cpu/arch/arm/quants.c
index aadbb487e..fff646064 100644
--- a/ggml/src/ggml-cpu/arch/arm/quants.c
+++ b/ggml/src/ggml-cpu/arch/arm/quants.c
@@ -138,6 +138,7 @@ void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, in
 //===================================== Dot products =================================
 
 void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+    printf("arm\n");
     const int qk = QK8_0;
     const int nb = n / qk;
 
diff --git a/ggml/src/ggml-cpu/quants.c b/ggml/src/ggml-cpu/quants.c
index 365cb36d2..3694ae145 100644
--- a/ggml/src/ggml-cpu/quants.c
+++ b/ggml/src/ggml-cpu/quants.c
@@ -113,6 +113,7 @@ void quantize_row_q8_K_generic(const float * GGML_RESTRICT x, void * GGML_RESTRI
 //===================================== Dot products =================================
 
 void ggml_vec_dot_q4_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+    printf("generic\n");
     const int qk = QK8_0;
     const int nb = n / qk;

danbev · 2025-09-24T12:31:31Z

@ggerganov Thanks for trying this out. I'll take a closer look shortly 👍

With fix in 22ef44d I was able to to see both arm and generic output though I did have to add the generic printf to ggml-quants.c:

diff --git a/ggml/src/ggml-cpu/arch/arm/quants.c b/ggml/src/ggml-cpu/arch/arm/quants.c
index aadbb487e..689148ae7 100644
--- a/ggml/src/ggml-cpu/arch/arm/quants.c
+++ b/ggml/src/ggml-cpu/arch/arm/quants.c
@@ -138,6 +138,7 @@ void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, in
 //===================================== Dot products =================================

 void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
+    printf("arm...\n");
     const int qk = QK8_0;
     const int nb = n / qk;

diff --git a/ggml/src/ggml-quants.c b/ggml/src/ggml-quants.c
index 727932123..058b0f6a6 100644
--- a/ggml/src/ggml-quants.c
+++ b/ggml/src/ggml-quants.c
@@ -197,6 +197,7 @@ void quantize_row_q5_1_ref(const float * GGML_RESTRICT x, block_q5_1 * GGML_REST

 // reference implementation for deterministic creation of model files
 void quantize_row_q8_0_ref(const float * GGML_RESTRICT x, block_q8_0 * GGML_RESTRICT y, int64_t k) {
+    printf("generic....\n");
     assert(k % QK8_0 == 0);
     const int nb = k / QK8_0;

I've also stepped through this in the

debugger

(lldb) target create "build/bin/test-backend-ops"
Current executable set to '/Users/danbev/work/ai/llama.cpp/build/bin/test-backend-ops' (arm64).
(lldb) settings set -- target.run-args  "cpu-variants" "--variant" "CPU" "-o" "MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1)"
(lldb) br set -f ggml-backend.cpp -l 2035                                                             Breakpoint 1: where = libggml-base.dylib`::ggml_backend_compare_graph_backend(ggml_backend_t, ggml_backend_t, ggml_cgraph *, ggml_backend_eval_callback, void *, ggml_tensor *) + 668 at ggml-backend.cpp:2035:40, address = 0x0000000000022734
(lldb) r
Process 58647 launched: '/Users/danbev/work/ai/llama.cpp/build/bin/test-backend-ops' (arm64)
ggml_backend_load_best: failed to find ggml_backend_score in /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu-ref.so
load_backend: loaded CPU backend from /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu.so
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3)
load_backend: loaded CPU backend from /Users/danbev/work/ai/llama.cpp/build/bin/libggml-cpu-ref.so
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU-ref (Apple M3)
Testing CPU variant 'CPU' against cpu-ref backend...

Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x000000010041a734 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2035:40
   2032	            struct ggml_cgraph g1v = ggml_graph_view(g1, i, i + 1);
   2033	            struct ggml_cgraph g2v = ggml_graph_view(g2, i, i + 1);
   2034
-> 2035	            ggml_backend_graph_compute(backend1, &g1v);
   2036	            ggml_backend_graph_compute(backend2, &g2v);
   2037
   2038	            if (ggml_is_view_op(t1->op)) {
Target 0: (test-backend-ops) stopped.
(lldb) n
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
arm...
Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x000000010041a740 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2036:40
   2033	            struct ggml_cgraph g2v = ggml_graph_view(g2, i, i + 1);
   2034
   2035	            ggml_backend_graph_compute(backend1, &g1v);
-> 2036	            ggml_backend_graph_compute(backend2, &g2v);
   2037
   2038	            if (ggml_is_view_op(t1->op)) {
   2039	                continue;
Target 0: (test-backend-ops) stopped.
(lldb) n
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
generic....
Process 58647 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x000000010041a748 libggml-base.dylib`ggml_backend_compare_graph_backend(backend1=0x00006000014a8090, backend2=0x00006000014a8000, graph=0x0000000100224020, callback=(test-backend-ops`test_case::eval(ggml_backend*, ggml_backend*, char const*, printer*)::'lambda'(int, ggml_tensor*, ggml_tensor*, void*)::__invoke(int, ggml_tensor*, ggml_tensor*, void*) at test-backend-ops.cpp:1204), user_data=0x000000016fdfddb8, test_node=0x0000000000000000) at ggml-backend.cpp:2038:33
   2035	            ggml_backend_graph_compute(backend1, &g1v);
   2036	            ggml_backend_graph_compute(backend2, &g2v);
   2037
-> 2038	            if (ggml_is_view_op(t1->op)) {
   2039	                continue;
   2040	            }
   2041
Target 0: (test-backend-ops) stopped.

I realized that this was because I had not disabled GGML_LLAMAFILE when building. Setting this to OFF then printed the output like you added. ~~I'm looking into adding disabling this in the build now.~~ I've disabled LLAMAFILE, HBM, OpenMP, and KleidiAI for the cpu reference backend now.

Set GGML_SYSTEM_ARCH to cpu-ref when GGML_CPU_REF_BACKEND is enabled to force a generic backend implementation.

Disable GGML_LLAMAFILE for ref backend.

Disable HBM, OpenMP, KleidiaI for CPU ref backend.

slaren · 2025-09-24T23:10:49Z

ggml/src/ggml-cpu/repack.cpp

+// A problem arises in `ggml_backend_compare_graph_backend` where the graph
+// is copied, and ggml_backend_buffer_repack_buffer_type does not implement
+// the get_tensor function, but even if it did it would return the repacked data


The solution for this would be to use the CPU reference implementation as the main backend to copy data from, rather than the tested backend. Basically, the order of the backends in test-backend-ops should be swapped. This would also solve a problem where if the backend has a buggy buffer implementation, it can lead to corrupted tensor data when the data is copied from the original backend to the CPU backend. This caused issues before when testing some backends.

Oh nice, I'll give that a try, thanks!

ggml/src/CMakeLists.txt

slaren · 2025-09-25T12:06:49Z

tests/test-backend-ops.cpp

I'll take a look at these changes later today.

danbev · 2025-09-25T15:05:12Z

@slaren Thanks for your reviews!
I've tried to apply them I've changed the order of the backends which works, but I'm trying to figure out how I can get the extra buffer types (repack) to work with this solution. I'll continue looking into but wanted to let you know in case you try it out you won't get surprised.

slaren

Strictly speaking, this is not enough to ensure that only the C implementation is used in the CPU-ref backend, because the base ABI may include some vector instructions. For example, the Linux AMD64 ABI includes SSE2, and if you add a feature flag for this instruction set, it will show up in the CPU-ref backend:

diff --git a/ggml/include/ggml-cpu.h b/ggml/include/ggml-cpu.h
index 9edd48513..004acea31 100644
--- a/ggml/include/ggml-cpu.h
+++ b/ggml/include/ggml-cpu.h
@@ -75,6 +75,7 @@ extern "C" {
     //

     // x86
+    GGML_BACKEND_API int ggml_cpu_has_sse2       (void);
     GGML_BACKEND_API int ggml_cpu_has_sse3       (void);
     GGML_BACKEND_API int ggml_cpu_has_ssse3      (void);
     GGML_BACKEND_API int ggml_cpu_has_avx        (void);
diff --git a/ggml/src/ggml-cpu/ggml-cpu.c b/ggml/src/ggml-cpu/ggml-cpu.c
index c13129084..06c8f54df 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.c
+++ b/ggml/src/ggml-cpu/ggml-cpu.c
@@ -3428,6 +3428,14 @@ int ggml_cpu_has_llamafile(void) {
 #endif
 }

+int ggml_cpu_has_sse2(void) {
+#if defined(__SSE2__)
+    return 1;
+#else
+    return 0;
+#endif
+}
+
 int ggml_cpu_has_sse3(void) {
 #if defined(__SSE3__)
     return 1;
diff --git a/ggml/src/ggml-cpu/ggml-cpu.cpp b/ggml/src/ggml-cpu/ggml-cpu.cpp
index 90dd65c41..6f42b0a8f 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.cpp
+++ b/ggml/src/ggml-cpu/ggml-cpu.cpp
@@ -506,6 +506,9 @@ static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t r
         ggml_cpu_init();

         std::vector<ggml_backend_feature> features;
+        if (ggml_cpu_has_sse2()) {
+            features.push_back({ "SSE2", "1" });
+        }
         if (ggml_cpu_has_sse3()) {
             features.push_back({ "SSE3", "1" });
         }

Other than that, I think this could be good enough for a first version. To be useful as part of a CI test however, test-backend-ops will need to be modified to be able to load every CPU backend variant supported by the system, not just the one selected by ggml_backend_load_best.

tests/CMakeLists.txt

This commit adds a function to load all CPU backend variants from dynamic libraries. The motivation for this is that currently only the "best" CPU backend is show when listing available backends. This change allows users to see all available CPU variants, and then test them individually.

danbev · 2025-09-26T08:26:49Z

Other than that, I think this could be good enough for a first version.

Do you have any suggestions to how to also avoid compiler optimizations because of the availability of SSE2, or is it perhaps enough that the cpu-ref backend does not use any explicit SIMD operations?

slaren · 2025-09-26T11:18:00Z

Compiler optimizations are not important, we are trying to test our code, not the compiler. The calling convention of AMD64 may actually require using SSE registers, so it may not be possible to disable SSE completely. It may be possible to do something like this:

#ifdef GGML_CPU_GENERIC
#undef __SSE2__
#endif

This commit contains an experimental hack to try to force the usage of the repack buffer if it is available on the backend device. The sole purpose of this is to get some feedback from other and hopefully get help to come up a proper way to do this.

danbev · 2025-09-26T14:18:39Z

@slaren When you get a chance would you be able to help me with the repacking?
In caa91b5 I've been able to force this to work, but I'm not sure what the best way to do this is.

danbev added 3 commits September 22, 2025 14:13

move list_cpu_variants() to be called directly

b6f2ff9

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Sep 22, 2025

ggerganov reviewed Sep 23, 2025

View reviewed changes

ggml/CMakeLists.txt Outdated Show resolved Hide resolved

danbev added 2 commits September 23, 2025 12:44

fix GGML_CPU_REF option name in CMakeLists.txt

d30161c

clarify comments about repack wrapper buffer type for testing [no ci]

d9e484f

danbev marked this pull request as ready for review September 24, 2025 08:37

danbev requested a review from slaren as a code owner September 24, 2025 08:37

danbev added 3 commits September 24, 2025 15:18

squash! ggml : add CPU backend reference implementation (wip) [no ci]

22ef44d

Set GGML_SYSTEM_ARCH to cpu-ref when GGML_CPU_REF_BACKEND is enabled to force a generic backend implementation.

squash! ggml : add CPU backend reference implementation (wip) [no ci]

aba90bf

Disable GGML_LLAMAFILE for ref backend.

squash! ggml : add CPU backend reference implementation (wip) [no ci]

5d75ea3

Disable HBM, OpenMP, KleidiaI for CPU ref backend.

slaren reviewed Sep 25, 2025

View reviewed changes

danbev added 5 commits September 25, 2025 14:37

use ggml_cpu_backend_variant

8b6a05e

disable GGML_CPU_REPACK for cpu ref backend

2e2c022

print the backend features

84ac20d

switch order backend ref and variant in test_cpu_variant

56f3e62

remove the repack test buffer type

7f03246

slaren reviewed Sep 26, 2025

View reviewed changes

tests/CMakeLists.txt Outdated Show resolved Hide resolved

danbev added 3 commits September 26, 2025 08:05

remove GGML_CPU_REF_BACKEND from tests/CMakeLists.txt

11f647a

ggml : add SSE2 detection for x86 backends

e393783

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : add repack testing support #16182

ggml : add repack testing support #16182

Uh oh!

danbev commented Sep 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

ggerganov commented Sep 24, 2025

Uh oh!

danbev commented Sep 24, 2025 •

edited

Loading

Uh oh!

slaren Sep 24, 2025

Uh oh!

danbev Sep 25, 2025

Uh oh!

Uh oh!

slaren Sep 25, 2025

Uh oh!

danbev commented Sep 25, 2025

Uh oh!

slaren left a comment

Uh oh!

Uh oh!

danbev commented Sep 26, 2025

Uh oh!

slaren commented Sep 26, 2025

Uh oh!

danbev commented Sep 26, 2025

Uh oh!

Uh oh!

ggml : add repack testing support #16182

Are you sure you want to change the base?

ggml : add repack testing support #16182

Uh oh!

Conversation

danbev commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Sep 24, 2025

Uh oh!

danbev commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

danbev Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

slaren Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

danbev commented Sep 25, 2025

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danbev commented Sep 26, 2025

Uh oh!

slaren commented Sep 26, 2025

Uh oh!

danbev commented Sep 26, 2025

Uh oh!

Uh oh!

danbev commented Sep 22, 2025 •

edited

Loading

danbev commented Sep 24, 2025 •

edited

Loading