Skip to content

Conversation

taronaeo
Copy link
Collaborator

This PR cleans up the zDNN codebase by refactoring the operations into individual files for better readability and easier collaboration. It also includes the zDNN backend documentation and lists zDNN as an available backend in README.md.

This PR should have no performance changes as it is merely a refactor. However, I've still ran tests just in-case.

Performance

model size params backend threads test t/s
granite 3B all F32 9.44 GiB 2.53 B zDNN,BLAS 8 pp512 215.78 ± 0.33
granite 3B all F32 9.44 GiB 2.53 B zDNN,BLAS 8 tg128 4.70 ± 0.02
granite 3B F16 4.72 GiB 2.53 B zDNN,BLAS 8 pp512 217.33 ± 1.52
granite 3B F16 4.72 GiB 2.53 B zDNN,BLAS 8 tg128 4.70 ± 0.05
granite 3B BF16 4.72 GiB 2.53 B zDNN,BLAS 8 pp512 216.35 ± 0.18
granite 3B BF16 4.72 GiB 2.53 B zDNN,BLAS 8 tg128 4.63 ± 0.06

Note

Tests were conducted on an IBM z17 Mainframe with 40 IFLs (cores) and 128 GB Memory on a shared R&D LPAR.

test-backend-ops

./build/bin/test-backend-ops -b zDNN | grep -v "not supported"                                                                     
ggml_zdnn_init: allocating                                                                                                                                        
ggml_zdnn_init: found 1 device                                                                                                                                    
ggml_zdnn_init: picking default device: zDNN                                                                                                                      
ggml_zdnn_init: NNPA name: zDNN                                                                                                                                   
ggml_zdnn_init: NNPA_PARMBLKFORMAT_0 = true                                                                                                                       
ggml_zdnn_init: NNPA_PARMBLKFORMAT_1 = true                                                                                                                       
Testing 3 devices                                                                                                                                                 
                                                                                                                                                                  
Backend 1/3: zDNN                                                                                                                                                 
  Device description: IBM Z Neural Network Processing Assist (NNPA)                                                                                               
  Device memory: 0 MB (0 MB free)                                                                                                                                 
                                                                                                                                                                  
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK                                                                       
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=9,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=4,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=1,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
  MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
ggml_zdnn_free: deallocating
  14491/14491 tests passed
  Backend zDNN: OK
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning IBM zDNN issues specific to IBM zDNN Accelerator labels Sep 22, 2025
@taronaeo
Copy link
Collaborator Author

  • CI / ggml-ci-x64-cpu-amx (pull_request) seem to have consistent failures of cat: /home/ggml/results/llama.cpp/qwen3_0_6b-imatrix-sum.log: No such file or directory across multiple PR CI tests.
  • CI / ggml-ci-x64-nvidia-t4-vulkan (pull_request) and CI / ggml-ci-x64-nvidia-t4-vulkan-coopmat1 (pull_request) seem to be having resource problems where the compilation is prematurely terminated.
  • CI / ggml-ci-mac-metal seem to not have any runners picking it up.

I'll cancel them and re-run again to see if its an ephemeral problem.

@ggerganov
Copy link
Member

Ignore the AMX failure - it's some issue in the amx backend. The Mac runner is up now - it will take some time to catch up with all the queued workflows, but should be good after that.

I will take a look at the vulkan issues. But none of these are problem for this PR.

@taronaeo taronaeo merged commit 264f1b5 into ggml-org:master Sep 23, 2025
115 of 122 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Sep 23, 2025
* origin/master: (39 commits)
ci : disable AMD workflows + update NVIDIA workflows (ggml-org#16200)
ci : enable Vulkan workflow on Mac (ggml-org#16194)
ggml-cpu: Respect cpumask settings (ggml-org#16164)
ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (ggml-org#15928)
zdnn: refactor codebase + add docs (ggml-org#16178)
codeowners : add @danbev to model-conversion example [no ci] (ggml-org#16190)
devops: add s390x containers (ggml-org#15915)
ggml-cpu : fix typo in gemm comments [no ci] (ggml-org#16189)
feat: Add conversion support in GraniteHybrid for non-hybrid (all attn) (ggml-org#16177)
clang-tidy : disable warning about performance enum size (ggml-org#16127)
ggml : implement set_rows with i32 index (ggml-org#16159)
codeowners : update + cleanup (ggml-org#16174)
common : enable `--offline` mode without curl support (ggml-org#16137)
webui : fix handling incomplete chunks (ggml-org#16107)
embedding : fix typos in README (ggml-org#16171)
common : remove unused local variables (ggml-org#16140)
ggml : extend ggml_can_fuse to work with non-sequential nodes (ggml-org#16123)
ggml : add ggml_op_is_empty (ggml-org#16122)
codeowners : update ownership for @ngxson and @allozuar (ggml-org#16128)
Vulkan: add conv_transpose_2d operation (ggml-org#16022)
...
struct pushed a commit to struct/llama.cpp that referenced this pull request Sep 26, 2025
* zdnn: initial matmul refactor

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: rm static from funcs

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: update ggml-zdnn.h

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: change header files to hpp

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: switch to common.hpp

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: move mulmat forward around

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: rm inline from utils

Signed-off-by: Aaron Teo <[email protected]>

* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <[email protected]>

* docs: add zDNN docs

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning IBM zDNN issues specific to IBM zDNN Accelerator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants