-
Notifications
You must be signed in to change notification settings - Fork 1.4k
sync : llama.cpp #1344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync : llama.cpp #1344
Conversation
This commit adds two new command-line options to the test-backend-ops.cpp that allow users to list all available GGML operations and to show test coverage of these operations. The motivation for this is that it can be useful to quickly see which operations are currently covered by tests and which are not. Also it migth be useful when using the `support` mode.
* CUDA: fastdiv, launch bounds for mmvq + q8_1 quant
* ggml WebGPU: remove userdata from request adapter callback This commit removes the `userdata` parameter from the WebGPU request adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function captures the `webgpu_context` directly. The motivation for this change is to simplify the code and improve readability. * inline the callback lambda into the RequestAdapter call This commit removes the callback lambda variable and inlines it directly into the RequestAdapter call.
I think glslang will translate an access like x[i][1].z to OpAccessChain ... x, i, 1, 2 OpLoad float16_t ... rather than loading all of x[i] in a single OpLoad. Change the code to explicitly load the vector/matrix.
* ggml-cpu: clean up s390x simd Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 0da4b6aa07d96b758812d17b2c82267632fa4ba5) Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix hsum data types Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>
* CANN: Switch to stream synchronization Switch to stream synchronization because events are not effective. Co-authored-by: hipudding <[email protected]> * CANN: add Comments --------- Co-authored-by: hipudding <[email protected]>
* ggml: allow casting between f32 and i32 * fix cuda * add vulkan * fix CPU non-cont * add non-cont test case * add note * extend test number range * correct note * add cont version for vulkan
…s too large (llama/15868) * cuda : fix supports_op condition for get_rows when src1->ne2 > 1 ggml-ci * ggml : add comment about ggml_get_rows ggml-ci * cuda : add FIXME [no ci] * cuda : update support condition ggml-ci
* vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With #15489, this reduces the number of synchronizations needed. * call optimize_graph per-split
* CUDA: Add mul_mat_id support the mmf Add support for mul_mat_id for bs < 16 * Review: use warp_size, fix should_use_mmf condition * Launch one block per expert, stride along n_expert_used * templatize mul_mat_id * Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids * Reduce compile times by dividing mmf into f16, bf16 and f32 variants * Divide mmf by ncols_dst * Add missing files * Fix MUSA/HIP builds
…(issue 15846) (llama/15886)
* CANN: implement LRU cache for ACL graphs in CANN backend - Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend. * fix typo * The LRU cache capacity can be configured via an env variable Signed-off-by: noemotiovon <[email protected]> * refactory acl graph * refactory && fix review comments Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>
* CANN: Add ROPE sin/cos cache for reuse Introduce sin/cos caching mechanism in ROPE to avoid redundant computation across layers. The cache is built on the first layer per device and reused by subsequent layers if parameters match. - Added sin_cache / cos_cache pointers and position_length tracking - Introduced cache validity flags and properties: (ext_factor, theta_scale, freq_scale, attn_factor, is_neox) - Accelerates ROPE by eliminating repeated sin/cos generation This change reduces overhead in multi-layer scenarios while preserving correctness by verifying parameter consistency. Co-authored-by: hipudding <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]> Co-authored-by: hipudding <[email protected]>
* tests : filter out no-ops from coverage report This commit is a follow-up commit for #15745 to address the feedback on how no-op operations should be filtered out from the coverage report. The feedback regarding the UNARY and GLU sub-operations not being handled I not exactly sure what should be done. They are included in the coverage, for example ABS, ELU, EXP, GELU, GEGLU, GEGLU_ERF etc are in the list of covered operations: ```console $ ./build/bin/test-backend-ops --show-coverage Operations covered by tests (89): ✓ ABS ✓ ACC ✓ ADD ✓ ADD1 ✓ ADD_ID ✓ ARANGE ✓ ARGMAX ✓ ARGSORT ✓ CLAMP ✓ CONCAT ✓ CONV_2D ✓ CONV_2D_DW ✓ CONV_3D ✓ CONV_TRANSPOSE_1D ✓ CONV_TRANSPOSE_2D ✓ COS ✓ COUNT_EQUAL ✓ CPY ✓ CROSS_ENTROPY_LOSS ✓ CROSS_ENTROPY_LOSS_BACK ✓ DIAG_MASK_INF ✓ DIV ✓ DUP ✓ ELU ✓ EXP ✓ FLASH_ATTN_EXT ✓ GATED_LINEAR_ATTN ✓ GEGLU ✓ GEGLU_ERF ✓ GEGLU_QUICK ✓ GELU ✓ GELU_ERF ✓ GELU_QUICK ✓ GET_ROWS ✓ GET_ROWS_BACK ✓ GROUP_NORM ✓ HARDSIGMOID ✓ HARDSWISH ✓ IM2COL ✓ IM2COL_3D ✓ L2_NORM ✓ LEAKY_RELU ✓ LOG ✓ MEAN ✓ MUL ✓ MUL_MAT ✓ MUL_MAT_ID ✓ NEG ✓ NORM ✓ OPT_STEP_ADAMW ✓ OPT_STEP_SGD ✓ OUT_PROD ✓ PAD ✓ PAD_REFLECT_1D ✓ POOL_2D ✓ REGLU ✓ RELU ✓ REPEAT ✓ REPEAT_BACK ✓ RMS_NORM ✓ RMS_NORM_BACK ✓ ROLL ✓ ROPE ✓ ROPE_BACK ✓ RWKV_WKV6 ✓ RWKV_WKV7 ✓ SCALE ✓ SET ✓ SET_ROWS ✓ SGN ✓ SIGMOID ✓ SILU ✓ SILU_BACK ✓ SIN ✓ SOFT_MAX ✓ SOFT_MAX_BACK ✓ SQR ✓ SQRT ✓ SSM_CONV ✓ SSM_SCAN ✓ STEP ✓ SUB ✓ SUM ✓ SUM_ROWS ✓ SWIGLU ✓ SWIGLU_OAI ✓ TANH ✓ TIMESTEP_EMBEDDING ✓ UPSCALE Operations without tests (14): ✗ ADD_REL_POS ✗ CUSTOM ✗ DIAG ✗ DIAG_MASK_ZERO ✗ FLASH_ATTN_BACK ✗ GET_REL_POS ✗ IM2COL_BACK ✗ MAP_CUSTOM1 ✗ MAP_CUSTOM2 ✗ MAP_CUSTOM3 ✗ POOL_1D ✗ POOL_2D_BACK ✗ WIN_PART ✗ WIN_UNPART Coverage Summary: Total operations: 103 Tested operations: 89 Untested operations: 14 Coverage: 86.4% ``` Refs: ggml-org/llama.cpp#15745 * use of ggml_op enum values instead of strcmp
* CANN: Fix ggml_cann_set_device to avoid redundant device switches - Added a check to skip aclrtSetDevice if the current device is already set. - Prevents unnecessary context switches while keeping thread/device consistency. * CANN: add device default id
* remove unsupported vulkan devices * make this happen during selection instead * pass by reference
…a/16018) * Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * some f32 tests passing * Disable set_rows until it's implemented * f32 add all tests passing * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Add templated addition, clean up code * Get addition and multiplication working * Implement rms_norm * Add get_rows implementation * Add new get_rows files * Refactor use of wg size entry * Fix compilation * Try manually unrolled q4_0 quant * Revert "Try manually unrolled q4_0 quant" This reverts commit 77f8b96515f7e640ae4b0e44f066321fbc4a6166. * Move to constant max wg size * Check for tensor size in supports_op * Vectorize f32 and change default workgroup size * Move f32 get_rows from < 4 to % 4 != 0 * fix linter errors * Add in-place tests --------- Co-authored-by: Neha Abbas <[email protected]>
Signed-off-by: noemotiovon <[email protected]>
* metal : improve F32, F16 and BF16 mat-vec multiplication ggml-ci * metal : make the NSG a function constant in mul_mv kernels ggml-ci
* metal : use function constants for mul_mv_ext kernels ggml-ci * metal : remove NW template argument ggml-ci * metal : adjust constants ggml-ci
* CUDA: Optimize PAD_REFLECT_1D feat: add more test cases for PAD_REFLECT_1D * use fast_div to improve performance * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <[email protected]> * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <[email protected]> * optimize * use a concise expression to further speedup the cuda kernel --------- Co-authored-by: Johannes Gäßler <[email protected]>
- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut) - MoE kernel optimizations --------- Co-authored-by: Li He <[email protected]>
When compiling with GGML_STATIC=ON, the build process would produce a binary that was still dynamically linked to OpenMP. This defeats the purpose of a static build: $ cmake -B build \ -DBUILD_SHARED_LIBS=OFF \ -DLLAMA_CURL=OFF \ -DGGML_CCACHE=OFF \ -DGGML_NATIVE=OFF \ -DGGML_STATIC=ON $ ldd llama-server linux-vdso.so.1 (0x0000e1a434e3b000) libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000) libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000) /lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000) This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES` to prioritize `.a` files, forcing CMake to link the static version of the library. Signed-off-by: Adrien Gallouët <[email protected]>
Generalize Linux check to `__linux__` to support non-glibc systems (like musl). Also, return `false` on unknown/untested OS. Without this commit, the code compiles (with warnings) but fails: register_backend: registered backend CPU (1 devices) register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C) build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug) system info: n_threads = 8, n_threads_batch = 8, total_threads = 16 .... print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: model type = 4B Illegal instruction (core dumped) Signed-off-by: Adrien Gallouët <[email protected]>
* ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test
* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues
// See https://github.com/KhronosGroup/Vulkan-Hpp?tab=readme-ov-file#extensions--per-device-function-pointers- | ||
#define VULKAN_HPP_DISPATCH_LOADER_DYNAMIC 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this causes an error in vcpkg CI now, android with static library linkage.
ld.lld: error: duplicate symbol: vk::detail::defaultDispatchLoaderDynamic
>>> defined at test-cmake.cpp:6 (/mnt/vcpkg-ci/b/vcpkg-ci-ggml/src/e98bd99269-a8d27b3c76.clean/examples/test-cmake/test-cmake.cpp:6)
>>> CMakeFiles/test-cmake.dir/test-cmake.cpp.o:(vk::detail::defaultDispatchLoaderDynamic)
>>> defined at ggml-vulkan.cpp:14 (/mnt/vcpkg-ci/b/ggml/src/v0.9.1-72a0cec7be.clean/src/ggml-vulkan/ggml-vulkan.cpp:14)
>>> ggml-vulkan.cpp.o:(.data._ZN2vk6detail28defaultDispatchLoaderDynamicE+0x0) in archive /mnt/vcpkg-ci/installed/arm64-android/debug/lib/libggml-vulkan.a
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
So ggml now exposes a symbol from the vk
namespace.
I'm not a Vulkan expert. Is this the right thing to do in a library under C++ ODR rules? In a ggml test port in vcpkg this is done the "executable" for Android. AFAIU the final executable is the only way to do it reliably with regard to ODR.
(Admittedly this is a linker error about duplicate symbols, not a compiler error about ODR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR the test port builds the example executable with this change:
diff --git a/examples/test-cmake/CMakeLists.txt b/examples/test-cmake/CMakeLists.txt
index d6bc0cc4..395a63c7 100644
--- a/examples/test-cmake/CMakeLists.txt
+++ b/examples/test-cmake/CMakeLists.txt
@@ -8,3 +8,9 @@ find_package(ggml CONFIG REQUIRED)
set(TEST_TARGET test-cmake)
add_executable(test-cmake test-cmake.cpp)
target_link_libraries(test-cmake PRIVATE ggml::ggml)
+
+if(ANDROID AND TARGET ggml::ggml-vulkan)
+ # Instantiates VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE
+ find_package(Vulkan REQUIRED)
+ target_link_libraries(test-cmake PRIVATE Vulkan::Vulkan)
+endif()
diff --git a/examples/test-cmake/test-cmake.cpp b/examples/test-cmake/test-cmake.cpp
index 029c8898..9d4bbe19 100644
--- a/examples/test-cmake/test-cmake.cpp
+++ b/examples/test-cmake/test-cmake.cpp
@@ -1,5 +1,11 @@
#include "ggml-backend.h"
+#if defined(ANDROID)
+#define VULKAN_HPP_DISPATCH_LOADER_DYNAMIC 1
+#include <vulkan/vulkan.hpp>
+VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE
+#endif
+
int main(void) {
ggml_backend_load_all();
return 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is from vcpkg side. because on android we force these VULKAN_HPP_DISPATCH_LOADER_DYNAMIC
and VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE
while compile with vulkan. This create the duplication of the symbols. I fix it in this PR: microsoft/vcpkg#47464
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a second lib does the same, we are back at start.
And that's why it cannot be in the libs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should mention that in the PR when I ask for solution and elaborate more. Now I understand.
// See https://github.com/KhronosGroup/Vulkan-Hpp?tab=readme-ov-file#extensions--per-device-function-pointers- | ||
VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is the implementation.
No description provided.