CUDA error: unknown error when offloading to gfx1035 #4770

tdavie · 2024-01-04T10:09:37Z

I've encountered an error when offloading layers to the iGPU, on commit a919280 with an AMD 7735HS (680M) laptop running Fedora. I've seen a similar issue reported here, but I am using a different hardware configuration and see a distinct CUDA error.

rocm packages

$ dnf list --installed | grep "hip\|rocm"
hip-devel.noarch                                     5.7.1-1.fc39                        @updates         
hipblas.x86_64                                       5.7.1-1.fc40                        @rawhide         
hipblas-devel.x86_64                                 5.7.1-1.fc40                        @rawhide         
hipcc.noarch                                         5.7.1-1.fc39                        @updates         
hsakmt.x86_64                                        1.0.6-34.rocm5.7.0.fc39             @updates         
hsakmt-devel.x86_64                                  1.0.6-34.rocm5.7.0.fc39             @updates         
rocm-cmake.noarch                                    5.7.0-1.fc39                        @updates         
rocm-comgr.x86_64                                    17.0-3.fc39                         @updates         
rocm-comgr-devel.x86_64                              17.0-3.fc39                         @updates         
rocm-device-libs.x86_64                              17.1-1.fc39                         @updates         
rocm-hip.x86_64                                      5.7.1-1.fc39                        @updates         
rocm-hip-devel.x86_64                                5.7.1-1.fc39                        @updates         
rocm-rpm-macros-modules.x86_64                       1.0-7.fc39                          @updates         
rocm-runtime.x86_64                                  5.7.1-1.fc39                        @updates         
rocm-runtime-devel.x86_64                            5.7.1-1.fc39                        @updates         
rocm-smi.x86_64                                      5.7.1-1.fc39                        @updates         
rocminfo.x86_64                                      5.7.0-1.fc39                        @updates

rocminfo

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 7735HS with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 7735HS with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   4829                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    13982956(0xd55cec) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    13982956(0xd55cec) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    13982956(0xd55cec) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1035                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5761(0x1681)                       
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2200                               
  BDFID:                   29184                              
  Internal Node ID:        1                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 115                                
  SDMA engine uCode::      47                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    2097152(0x200000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    2097152(0x200000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1035         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

compilation

$ make -j LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1035 CC=/usr/bin/hipcc CXX=/usr/bin/clang++
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   unknown
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi
I NVCCFLAGS:  
I LDFLAGS:   -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
I CC:        HIP version: 5.7.31921-
I CXX:       clang version 17.0.6 (Fedora 17.0.6-1.fc39)

/usr/bin/hipcc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion    -c ggml.c -o ggml.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c llama.cpp -o llama.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/common.cpp -o common.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/sampling.cpp -o sampling.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/grammar-parser.cpp -o grammar-parser.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/console.cpp -o console.o
/usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi --offload-arch=gfx1035 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml-cuda.o ggml-cuda.cu
/usr/bin/hipcc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
/usr/bin/hipcc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
/usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/train.cpp -o train.o
/usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion  -c tests/test-c.c -o tests/test-c.o
Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602.
Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602.
Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602.
Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602.
Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602.
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/build-info.cpp -o build-info.o
ggml.c:1203:5: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
 1203 |     GGML_F16_VEC_REDUCE(sumf, sum);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:835:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
  835 | #define GGML_F16_VEC_REDUCE         GGML_F32Cx8_REDUCE
      |                                     ^
ggml.c:825:33: note: expanded from macro 'GGML_F32Cx8_REDUCE'
  825 | #define GGML_F32Cx8_REDUCE      GGML_F32x8_REDUCE
      |                                 ^
ggml.c:771:11: note: expanded from macro 'GGML_F32x8_REDUCE'
  771 |     res = _mm_cvtss_f32(_mm_hadd_ps(t1, t1));                     \
      |         ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:1251:9: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
 1251 |         GGML_F16_VEC_REDUCE(sumf[k], sum[k]);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:835:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
  835 | #define GGML_F16_VEC_REDUCE         GGML_F32Cx8_REDUCE
      |                                     ^
ggml.c:825:33: note: expanded from macro 'GGML_F32Cx8_REDUCE'
  825 | #define GGML_F32Cx8_REDUCE      GGML_F32x8_REDUCE
      |                                 ^
ggml.c:771:11: note: expanded from macro 'GGML_F32x8_REDUCE'
  771 |     res = _mm_cvtss_f32(_mm_hadd_ps(t1, t1));                     \
      |         ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-cuda.cu:569:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
  569 | }
      | ^
ggml-cuda.cu:809:20: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  809 |     if (blockIdx.z < ne02) { // src0
      |         ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:855:56: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  855 |     if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) {
      |                                             ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:855:35: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  855 |     if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) {
      |                        ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:9899:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
 9899 | static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
      |                                                                                                       ^
ggml-cuda.cu:9906:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
 9906 | static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
      |                                                                                                          ^
ggml-cuda.cu:10048:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10048 |             } break;
       |               ^~~~~
ggml-cuda.cu:10041:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10041 |             } break;
       |               ^~~~~
ggml-cuda.cu:10017:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10017 |             } break;
       |               ^~~~~
ggml-cuda.cu:10002:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10002 |             } break;
       |               ^~~~~
ggml-cuda.cu:9985:13: warning: 'break' will never be executed [-Wunreachable-code-break]
  9985 |             break;
       |             ^~~~~
ggml-cuda.cu:4433:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 4433 |     mul_mat_q4_K(
      |     ^
ggml-cuda.cu:4433:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
ggml-cuda.cu:4500:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 4500 | mul_mat_q5_K(
      | ^
ggml-cuda.cu:4500:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
ggml-cuda.cu:4569:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 4569 |     mul_mat_q6_K(
      |     ^
ggml-cuda.cu:4569:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
2 warnings generated.
17 warnings generated when compiling for gfx1035.
ggml-cuda.cu:569:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
  569 | }
      | ^
ggml-cuda.cu:809:20: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  809 |     if (blockIdx.z < ne02) { // src0
      |         ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:855:56: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  855 |     if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) {
      |                                             ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:855:35: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  855 |     if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) {
      |                        ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:9899:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
 9899 | static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
      |                                                                                                       ^
ggml-cuda.cu:9906:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
 9906 | static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
      |                                                                                                          ^
ggml-cuda.cu:10048:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10048 |             } break;
       |               ^~~~~
ggml-cuda.cu:10041:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10041 |             } break;
       |               ^~~~~
ggml-cuda.cu:10017:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10017 |             } break;
       |               ^~~~~
ggml-cuda.cu:10002:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10002 |             } break;
       |               ^~~~~
ggml-cuda.cu:9985:13: warning: 'break' will never be executed [-Wunreachable-code-break]
  9985 |             break;
       |             ^~~~~
11 warnings generated when compiling for host.
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/main/main.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o main -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/quantize/quantize.cpp build-info.o ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o quantize -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/quantize-stats/quantize-stats.cpp build-info.o ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o quantize-stats -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/perplexity/perplexity.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o perplexity -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/embedding/embedding.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o embedding -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi pocs/vdot/vdot.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o vdot -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi pocs/vdot/q8dot.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o q8dot -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o train-text-from-scratch -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o convert-llama2c-to-ggml -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/simple/simple.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o simple -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/batched/batched.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o batched -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/batched-bench/batched-bench.cpp build-info.o ggml.o llama.o common.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o batched-bench -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/save-load-state/save-load-state.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o save-load-state -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iexamples/server examples/server/server.cpp examples/llava/clip.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o server -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas   -Wno-cast-qual
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/gguf/gguf.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o gguf -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/llama-bench/llama-bench.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o llama-bench -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/llava/llava-cli.cpp examples/llava/clip.cpp examples/llava/llava.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o llava-cli -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas  -Wno-cast-qual
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/baby-llama/baby-llama.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o baby-llama -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/beam-search/beam-search.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o beam-search -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/speculative/speculative.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o speculative -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/infill/infill.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o infill -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/tokenize/tokenize.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o tokenize -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/benchmark/benchmark-matmult.cpp build-info.o ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o benchmark-matmult -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/parallel/parallel.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o parallel -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/finetune/finetune.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o finetune -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/export-lora/export-lora.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o export-lora -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/lookahead/lookahead.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o lookahead -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/lookup/lookup.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o lookup -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 

====  Run ./main -h for help.  ====

Running with any layers offloaded results in an error.

$ ./main -m ~/Downloads/starling-lm-7b-alpha.Q4_K_M.gguf -p "what is the meaning of life, the universe, and everything?" -ngl 1
Log start
main: build = 1766 (a919280)
main: built with HIP version: 5.7.31921- for x86_64-redhat-linux-gnu
main: seed  = 1704360510
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 10.3, VMM: no
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /home/user/Downloads/starling-lm-7b-alpha.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = berkeley-nest_starling-lm-7b-alpha
llama_model_loader: - kv   2:                       llama.context_length u32              = 8192
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = berkeley-nest_starling-lm-7b-alpha
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|end_of_turn|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|end_of_turn|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: system memory used  = 4032.99 MiB
llm_load_tensors: VRAM used           =  132.50 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/33 layers to GPU
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 2.00 MB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB
llama_new_context_with_model: total VRAM used: 207.50 MiB (model: 132.50 MiB, context: 75.00 MiB)
CUDA error: unknown error
  current device: 0, in function ggml_cuda_mul_mat_mat_batched_cublas at ggml-cuda.cu:8640
  hipblasGemmBatchedEx(g_cublas_handles[g_main_device], HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
GGML_ASSERT: ggml-cuda.cu:226: !"CUDA error"
[New LWP 9043]
[New LWP 9045]
[New LWP 9046]
[New LWP 9047]
[New LWP 9048]
[New LWP 9049]
[New LWP 9050]
[New LWP 9051]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f469c71cd43 in wait4 () from /lib64/libc.so.6
#0  0x00007f469c71cd43 in wait4 () from /lib64/libc.so.6
#1  0x000000000054d505 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#2  0x0000000000555d48 in ggml_cuda_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#3  0x000000000054f3a3 in ggml_cuda_compute_forward ()
#4  0x0000000000443cf1 in ggml_compute_forward ()
#5  0x0000000000433efc in ggml_graph_compute_thread ()
#6  0x00000000004337a8 in ggml_graph_compute ()
#7  0x0000000000577f5b in ggml_backend_cpu_graph_compute ()
#8  0x0000000000574a07 in ggml_backend_graph_compute ()
#9  0x00000000004e00c1 in llama_decode_internal(llama_context&, llama_batch) ()
#10 0x00000000004e0954 in llama_decode ()
#11 0x0000000000526e17 in llama_init_from_gpt_params(gpt_params&) ()
#12 0x0000000000415df3 in main ()
[Inferior 1 (process 9042) detached]
Aborted (core dumped)

Let me know if you need any further details. I appreciate all the contributors' great work on this program!

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-18T01:34:34Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-02T01:08:56Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

tdavie added the bug-unconfirmed label Jan 4, 2024

github-actions bot added the stale label Mar 18, 2024

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: unknown error when offloading to gfx1035 #4770

CUDA error: unknown error when offloading to gfx1035 #4770

tdavie commented Jan 4, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

CUDA error: unknown error when offloading to gfx1035 #4770

CUDA error: unknown error when offloading to gfx1035 #4770

Comments

tdavie commented Jan 4, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024