Skip to content

CUDA error: unknown error when offloading to gfx1035 #4770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tdavie opened this issue Jan 4, 2024 · 2 comments
Closed

CUDA error: unknown error when offloading to gfx1035 #4770

tdavie opened this issue Jan 4, 2024 · 2 comments

Comments

@tdavie
Copy link

tdavie commented Jan 4, 2024

I've encountered an error when offloading layers to the iGPU, on commit a919280 with an AMD 7735HS (680M) laptop running Fedora. I've seen a similar issue reported here, but I am using a different hardware configuration and see a distinct CUDA error.

rocm packages
$ dnf list --installed | grep "hip\|rocm"
hip-devel.noarch                                     5.7.1-1.fc39                        @updates         
hipblas.x86_64                                       5.7.1-1.fc40                        @rawhide         
hipblas-devel.x86_64                                 5.7.1-1.fc40                        @rawhide         
hipcc.noarch                                         5.7.1-1.fc39                        @updates         
hsakmt.x86_64                                        1.0.6-34.rocm5.7.0.fc39             @updates         
hsakmt-devel.x86_64                                  1.0.6-34.rocm5.7.0.fc39             @updates         
rocm-cmake.noarch                                    5.7.0-1.fc39                        @updates         
rocm-comgr.x86_64                                    17.0-3.fc39                         @updates         
rocm-comgr-devel.x86_64                              17.0-3.fc39                         @updates         
rocm-device-libs.x86_64                              17.1-1.fc39                         @updates         
rocm-hip.x86_64                                      5.7.1-1.fc39                        @updates         
rocm-hip-devel.x86_64                                5.7.1-1.fc39                        @updates         
rocm-rpm-macros-modules.x86_64                       1.0-7.fc39                          @updates         
rocm-runtime.x86_64                                  5.7.1-1.fc39                        @updates         
rocm-runtime-devel.x86_64                            5.7.1-1.fc39                        @updates         
rocm-smi.x86_64                                      5.7.1-1.fc39                        @updates         
rocminfo.x86_64                                      5.7.0-1.fc39                        @updates
rocminfo
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 7735HS with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 7735HS with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   4829                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    13982956(0xd55cec) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    13982956(0xd55cec) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    13982956(0xd55cec) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1035                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5761(0x1681)                       
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2200                               
  BDFID:                   29184                              
  Internal Node ID:        1                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 115                                
  SDMA engine uCode::      47                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    2097152(0x200000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    2097152(0x200000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1035         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             
compilation
$ make -j LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1035 CC=/usr/bin/hipcc CXX=/usr/bin/clang++
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   unknown
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi
I NVCCFLAGS:  
I LDFLAGS:   -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
I CC:        HIP version: 5.7.31921-
I CXX:       clang version 17.0.6 (Fedora 17.0.6-1.fc39)

/usr/bin/hipcc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion    -c ggml.c -o ggml.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c llama.cpp -o llama.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/common.cpp -o common.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/sampling.cpp -o sampling.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/grammar-parser.cpp -o grammar-parser.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/console.cpp -o console.o
/usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi --offload-arch=gfx1035 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml-cuda.o ggml-cuda.cu
/usr/bin/hipcc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
/usr/bin/hipcc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
/usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/train.cpp -o train.o
/usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion  -c tests/test-c.c -o tests/test-c.o
Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602.
Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602.
Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602.
Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602.
Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602.
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/build-info.cpp -o build-info.o
ggml.c:1203:5: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
 1203 |     GGML_F16_VEC_REDUCE(sumf, sum);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:835:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
  835 | #define GGML_F16_VEC_REDUCE         GGML_F32Cx8_REDUCE
      |                                     ^
ggml.c:825:33: note: expanded from macro 'GGML_F32Cx8_REDUCE'
  825 | #define GGML_F32Cx8_REDUCE      GGML_F32x8_REDUCE
      |                                 ^
ggml.c:771:11: note: expanded from macro 'GGML_F32x8_REDUCE'
  771 |     res = _mm_cvtss_f32(_mm_hadd_ps(t1, t1));                     \
      |         ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:1251:9: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
 1251 |         GGML_F16_VEC_REDUCE(sumf[k], sum[k]);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:835:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
  835 | #define GGML_F16_VEC_REDUCE         GGML_F32Cx8_REDUCE
      |                                     ^
ggml.c:825:33: note: expanded from macro 'GGML_F32Cx8_REDUCE'
  825 | #define GGML_F32Cx8_REDUCE      GGML_F32x8_REDUCE
      |                                 ^
ggml.c:771:11: note: expanded from macro 'GGML_F32x8_REDUCE'
  771 |     res = _mm_cvtss_f32(_mm_hadd_ps(t1, t1));                     \
      |         ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-cuda.cu:569:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
  569 | }
      | ^
ggml-cuda.cu:809:20: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  809 |     if (blockIdx.z < ne02) { // src0
      |         ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:855:56: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  855 |     if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) {
      |                                             ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:855:35: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  855 |     if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) {
      |                        ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:9899:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
 9899 | static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
      |                                                                                                       ^
ggml-cuda.cu:9906:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
 9906 | static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
      |                                                                                                          ^
ggml-cuda.cu:10048:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10048 |             } break;
       |               ^~~~~
ggml-cuda.cu:10041:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10041 |             } break;
       |               ^~~~~
ggml-cuda.cu:10017:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10017 |             } break;
       |               ^~~~~
ggml-cuda.cu:10002:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10002 |             } break;
       |               ^~~~~
ggml-cuda.cu:9985:13: warning: 'break' will never be executed [-Wunreachable-code-break]
  9985 |             break;
       |             ^~~~~
ggml-cuda.cu:4433:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 4433 |     mul_mat_q4_K(
      |     ^
ggml-cuda.cu:4433:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
ggml-cuda.cu:4500:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 4500 | mul_mat_q5_K(
      | ^
ggml-cuda.cu:4500:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
ggml-cuda.cu:4569:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
 4569 |     mul_mat_q6_K(
      |     ^
ggml-cuda.cu:4569:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
2 warnings generated.
17 warnings generated when compiling for gfx1035.
ggml-cuda.cu:569:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn]
  569 | }
      | ^
ggml-cuda.cu:809:20: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  809 |     if (blockIdx.z < ne02) { // src0
      |         ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:855:56: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  855 |     if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) {
      |                                             ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:855:35: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare]
  855 |     if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) {
      |                        ~~~~~~~~~~ ^ ~~~~
ggml-cuda.cu:9899:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
 9899 | static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
      |                                                                                                       ^
ggml-cuda.cu:9906:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn]
 9906 | static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) {
      |                                                                                                          ^
ggml-cuda.cu:10048:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10048 |             } break;
       |               ^~~~~
ggml-cuda.cu:10041:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10041 |             } break;
       |               ^~~~~
ggml-cuda.cu:10017:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10017 |             } break;
       |               ^~~~~
ggml-cuda.cu:10002:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 10002 |             } break;
       |               ^~~~~
ggml-cuda.cu:9985:13: warning: 'break' will never be executed [-Wunreachable-code-break]
  9985 |             break;
       |             ^~~~~
11 warnings generated when compiling for host.
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/main/main.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o main -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/quantize/quantize.cpp build-info.o ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o quantize -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/quantize-stats/quantize-stats.cpp build-info.o ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o quantize-stats -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/perplexity/perplexity.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o perplexity -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/embedding/embedding.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o embedding -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi pocs/vdot/vdot.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o vdot -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi pocs/vdot/q8dot.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o q8dot -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o train-text-from-scratch -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o convert-llama2c-to-ggml -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/simple/simple.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o simple -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/batched/batched.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o batched -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/batched-bench/batched-bench.cpp build-info.o ggml.o llama.o common.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o batched-bench -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/save-load-state/save-load-state.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o save-load-state -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iexamples/server examples/server/server.cpp examples/llava/clip.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o server -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas   -Wno-cast-qual
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/gguf/gguf.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o gguf -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/llama-bench/llama-bench.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o llama-bench -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/llava/llava-cli.cpp examples/llava/clip.cpp examples/llava/llava.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o llava-cli -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas  -Wno-cast-qual
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/baby-llama/baby-llama.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o baby-llama -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/beam-search/beam-search.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o beam-search -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/speculative/speculative.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o speculative -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/infill/infill.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o infill -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/tokenize/tokenize.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o tokenize -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/benchmark/benchmark-matmult.cpp build-info.o ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o benchmark-matmult -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/parallel/parallel.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o parallel -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/finetune/finetune.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o finetune -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/export-lora/export-lora.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o export-lora -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/lookahead/lookahead.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o lookahead -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 
/usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/lookup/lookup.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o lookup -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas 

====  Run ./main -h for help.  ====

Running with any layers offloaded results in an error.

$ ./main -m ~/Downloads/starling-lm-7b-alpha.Q4_K_M.gguf -p "what is the meaning of life, the universe, and everything?" -ngl 1
Log start
main: build = 1766 (a919280)
main: built with HIP version: 5.7.31921- for x86_64-redhat-linux-gnu
main: seed  = 1704360510
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 10.3, VMM: no
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /home/user/Downloads/starling-lm-7b-alpha.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = berkeley-nest_starling-lm-7b-alpha
llama_model_loader: - kv   2:                       llama.context_length u32              = 8192
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = berkeley-nest_starling-lm-7b-alpha
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|end_of_turn|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|end_of_turn|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: system memory used  = 4032.99 MiB
llm_load_tensors: VRAM used           =  132.50 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/33 layers to GPU
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 2.00 MB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB
llama_new_context_with_model: total VRAM used: 207.50 MiB (model: 132.50 MiB, context: 75.00 MiB)
CUDA error: unknown error
  current device: 0, in function ggml_cuda_mul_mat_mat_batched_cublas at ggml-cuda.cu:8640
  hipblasGemmBatchedEx(g_cublas_handles[g_main_device], HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
GGML_ASSERT: ggml-cuda.cu:226: !"CUDA error"
[New LWP 9043]
[New LWP 9045]
[New LWP 9046]
[New LWP 9047]
[New LWP 9048]
[New LWP 9049]
[New LWP 9050]
[New LWP 9051]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f469c71cd43 in wait4 () from /lib64/libc.so.6
#0  0x00007f469c71cd43 in wait4 () from /lib64/libc.so.6
#1  0x000000000054d505 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#2  0x0000000000555d48 in ggml_cuda_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#3  0x000000000054f3a3 in ggml_cuda_compute_forward ()
#4  0x0000000000443cf1 in ggml_compute_forward ()
#5  0x0000000000433efc in ggml_graph_compute_thread ()
#6  0x00000000004337a8 in ggml_graph_compute ()
#7  0x0000000000577f5b in ggml_backend_cpu_graph_compute ()
#8  0x0000000000574a07 in ggml_backend_graph_compute ()
#9  0x00000000004e00c1 in llama_decode_internal(llama_context&, llama_batch) ()
#10 0x00000000004e0954 in llama_decode ()
#11 0x0000000000526e17 in llama_init_from_gpt_params(gpt_params&) ()
#12 0x0000000000415df3 in main ()
[Inferior 1 (process 9042) detached]
Aborted (core dumped)

Let me know if you need any further details. I appreciate all the contributors' great work on this program!

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant