Skip to content

does not compile on CUDA 10 anymore #4123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
whoreson opened this issue Nov 18, 2023 · 26 comments
Closed

does not compile on CUDA 10 anymore #4123

whoreson opened this issue Nov 18, 2023 · 26 comments
Labels

Comments

@whoreson
Copy link
Contributor

Ever since this got merged:
https://github.com/ggerganov/llama.cpp/pull/3370

@whoreson
Copy link
Contributor Author

Makefile needs to be modified because 10's nvcc doesn't have the --forward-unknown-to-host-compiler option, nor the -arch=native
cuda10.patch

@WeirdConstructor
Copy link
Contributor

I second this with CUDA 11 on Ubuntu 22.04. I did not succeed installing CUDA 12 on my Ubuntu 22.04, so I am stuck with 11.
I added following ugly hack to my Makefile, which seems to work for my system:

ifdef WEICON_BROKEN
	NVCCFLAGS += -arch=compute_86
else
	NVCCFLAGS += -arch=native
endif

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 18, 2023

So there is hope for me building this on windows 8.1 with cublas?

@rvandernoort
Copy link

Makefile needs to be modified because 10's nvcc doesn't have the --forward-unknown-to-host-compiler option, nor the -arch=native cuda10.patch

Hi I tried to use your patch to compile on my Nvidia Jetson Nano, but I'm getting some new errors because of it. The jetson runs cuda 10.2, any idea what is wrong?

make LLAMA_CUBLAS=1
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   aarch64
I UNAME_M:   aarch64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation 
I NVCCFLAGS: --compiler-options="  " -use_fast_math -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib 
I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX:       g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=native    -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/sampling.cpp -o sampling.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/grammar-parser.cpp -o grammar-parser.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/build-info.cpp -o build-info.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=native  -Wno-array-bounds -Wno-format-truncation  -c common/console.cpp -o console.o
nvcc --compiler-options="  " -use_fast_math -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu(5970): error: identifier "CUBLAS_TF32_TENSOR_OP_MATH" is undefined

ggml-cuda.cu(6617): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7552): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7586): error: identifier "CUBLAS_COMPUTE_16F" is undefined

4 errors detected in the compilation of "/tmp/tmpxft_00002e62_00000000-6_ggml-cuda.cpp1.ii".
Makefile:440: recipe for target 'ggml-cuda.o' failed
make: *** [ggml-cuda.o] Error 1

@chenxuuu
Copy link

I fix follow by this: ggml-org/whisper.cpp#1018

same error:

user@ubuntu:~/llama.cpp$ make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   aarch64
I UNAME_M:   aarch64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/usr/local/cuda-10.2/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -mcpu=armv8.3-a
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/usr/local/cuda-10.2/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -mcpu=armv8.3-a  -Wno-array-bounds -Wno-format-truncation
I NVCCFLAGS: --compiler-options="  " -use_fast_math -arch=compute_62 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/usr/local/cuda-10.2/targets/aarch64-linux/lib
I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX:       g++ (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

nvcc --compiler-options="  " -use_fast_math -arch=compute_62 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu(6947): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7923): error: identifier "CUBLAS_COMPUTE_16F" is undefined

ggml-cuda.cu(7957): error: identifier "CUBLAS_COMPUTE_16F" is undefined

3 errors detected in the compilation of "/tmp/tmpxft_00007758_00000000-6_ggml-cuda.cpp1.ii".
Makefile:457: recipe for target 'ggml-cuda.o' failed
make: *** [ggml-cuda.o] Error 1

@FantasyGmm
Copy link
Contributor

FantasyGmm commented Dec 21, 2023

0001-fix-old-jetson-compile-error.patch
This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good.
Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error

@ggml-org ggml-org deleted a comment from whoreson Dec 21, 2023
@rvandernoort
Copy link

It looks like a promising patch, thanks! I can only test this in the new year, unfortunately, but I'll let know the results by then.

@whoreson
Copy link
Contributor Author

Nice, that patch does fix the compile issue. However, something else is up:

current device: 0
GGML_ASSERT: ggml-cuda.cu:8498: !"cuBLAS error"

But it actually dies at various lines. Hmm I'll check past revisions or something.

@whoreson
Copy link
Contributor Author

Okay, it's broken since bcc0eb4

Which is the "per-layer KV cache + quantum K cache" update.

@he-man86
Copy link

he-man86 commented Jan 4, 2024

0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error

I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8.

any ideas what it going wrong?

@FantasyGmm
Copy link
Contributor

FantasyGmm commented Jan 4, 2024

0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error

I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8.

any ideas what it going wrong?

I tested it on Jetson Tx2 and compiled gcc 8.5 myself. Do not use gcc8 from the apt source , it does not work,I have submitted the content of the patch to the repository, you can directly compile it using the latest code

@he-man86
Copy link

he-man86 commented Jan 4, 2024

0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error

I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8.
any ideas what it going wrong?

I tested it on Jetson Tx2 and compiled gcc 8.5 myself. Do not use gcc8 from the apt source , it does not work,I have submitted the content of the patch to the repository, you can directly compile it using the latest code

Thanks a lot for the help! I am not sure what repository you mean though. Do you have one with the correctly compiled gcc8.5? i gave it a quick try to do it myself, but it has a some parameters i am not sure how to set.

@FantasyGmm
Copy link
Contributor

FantasyGmm commented Jan 5, 2024

0001-fix-old-jetson-compile-error.patch This patch will be useful. I checked the cuda10 documentation and modified some of the code. Now I have successfully on tx2(ub18 jetpack4 cuda10), and the performance is quite good. Need to compile and install a new gcc8, this is the latest gcc supported by cuda10, it can solve the C compiler part of the error

I tried the patch on my Nano ubuntu18 with cuda10.2, but it doesn't work for me. I believe i have the setup ok. also updates gcc and g++ to version8.
any ideas what it going wrong?

I tested it on Jetson Tx2 and compiled gcc 8.5 myself. Do not use gcc8 from the apt source , it does not work,I have submitted the content of the patch to the repository, you can directly compile it using the latest code

Thanks a lot for the help! I am not sure what repository you mean though. Do you have one with the correctly compiled gcc8.5? i gave it a quick try to do it myself, but it has a some parameters i am not sure how to set.

sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
./contrib/download_prerequisites
 mkdir build 
cd build 
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install
gcc -v

This takes a long long long long time and take a lot of space,
You can delete the gcc folder after make install.

@rvandernoort
Copy link

rvandernoort commented Jan 11, 2024

UPDATE: Managed to compile now! Needed to export the gcc installation for make by:

export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++

I've installed gcc 8.5 from source

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/aarch64-unknown-linux-gnu/8.5.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: ../configure -enable-checking=release -enable-languages=c,c++
Thread model: posix
gcc version 8.5.0 (GCC)

and after removing this line in the Makefile to get rid of an error #MK_CXXFLAGS += -mcpu=native and using CUDA_DOCKER_ARCH=sm_52, I still get the following error similar to the one with cmake i've described here#3880:

make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_52
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   aarch64
I UNAME_M:   aarch64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation
I NVCCFLAGS: -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 
I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib 
I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
I CXX:       g++ (GCC) 8.5.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/sampling.cpp -o sampling.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/grammar-parser.cpp -o grammar-parser.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/build-info.cpp -o build-info.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/console.cpp -o console.o
nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi" -c ggml-cuda.cu -o ggml-cuda.o
ggml-cuda.cu(598): warning: function "warp_reduce_sum(half2)" was declared but never referenced

ggml-cuda.cu(619): warning: function "warp_reduce_max(half2)" was declared but never referenced

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’:
ggml-quants.c:403:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
ggml-quants.c:403:27: error: invalid initializer
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: implicit declaration of function ‘vld1q_u8_x2’; did you mean ‘vld1q_u32’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:406:27: error: implicit declaration of function ‘vld1q_s8_x2’; did you mean ‘vld1q_s32’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s8_x2  vld1q_s8_x2
                           ^
ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:406:27: error: invalid initializer
 #define ggml_vld1q_s8_x2  vld1q_s8_x2
                           ^
ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
ggml-quants.c:3757:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(2, 2);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
ggml-quants.c:3758:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(4, 4);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\
                 ^
ggml-quants.c:3759:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(6, 6);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q3_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:4365:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
                                    ^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:4383:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q3bits = ggml_vld1q_u8_x2(q3); q3 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: implicit declaration of function ‘vld1q_s8_x4’; did you mean ‘vld1q_s64’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:4385:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes_2 = ggml_vld1q_s8_x4(q8); q8 += 64;
                                                ^~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q4_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:5244:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q4bits = ggml_vld1q_u8_x2(q4); q4 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:5246:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                     ^
ggml-quants.c:5253:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
                     ^
ggml-quants.c: In function ‘ggml_vec_dot_q5_K_q8_K’:
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:5840:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
                                    ^~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:5848:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             const ggml_uint8x16x2_t q5bits = ggml_vld1q_u8_x2(q5); q5 += 32;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:5849:46: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             const ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                                              ^~~~~~~~~~~~~~~~
ggml-quants.c: In function ‘ggml_vec_dot_q6_K_q8_K’:
ggml-quants.c:403:27: error: invalid initializer
 #define ggml_vld1q_s16_x2 vld1q_s16_x2
                           ^
ggml-quants.c:6506:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                         ^~~~~~~~~~~~~~~~~
ggml-quants.c:404:27: error: invalid initializer
 #define ggml_vld1q_u8_x2  vld1q_u8_x2
                           ^
ggml-quants.c:6520:40: note: in expansion of macro ‘ggml_vld1q_u8_x2’
             ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); qh += 32;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:405:27: error: implicit declaration of function ‘vld1q_u8_x4’; did you mean ‘vld1q_u64’? [-Werror=implicit-function-declaration]
 #define ggml_vld1q_u8_x4  vld1q_u8_x4
                           ^
ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:405:27: error: invalid initializer
 #define ggml_vld1q_u8_x4  vld1q_u8_x4
                           ^
ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:407:27: error: invalid initializer
 #define ggml_vld1q_s8_x4  vld1q_s8_x4
                           ^
ggml-quants.c:6522:40: note: in expansion of macro ‘ggml_vld1q_s8_x4’
             ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                                        ^~~~~~~~~~~~~~~~
ggml-quants.c:6547:21: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’
             q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
                     ^
ggml-quants.c: In function ‘ggml_vec_dot_iq2_xxs_q8_K’:
ggml-quants.c:7264:17: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’
             q8b = ggml_vld1q_s8_x4(q8); q8 += 64;
                 ^
cc1: some warnings being treated as errors
Makefile:552: recipe for target 'ggml-quants.o' failed
make: *** [ggml-quants.o] Error 1

@FantasyGmm
Copy link
Contributor

UPDATE: Managed to compile now! Needed to export the gcc installation for make by:


export CC=/usr/local/bin/gcc

export CXX=/usr/local/bin/g++

I've installed gcc 8.5 from source


gcc -v

Using built-in specs.

COLLECT_GCC=gcc

COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/aarch64-unknown-linux-gnu/8.5.0/lto-wrapper

Target: aarch64-unknown-linux-gnu

Configured with: ../configure -enable-checking=release -enable-languages=c,c++

Thread model: posix

gcc version 8.5.0 (GCC)

and after removing this line in the Makefile to get rid of an error #MK_CXXFLAGS += -mcpu=native and using CUDA_DOCKER_ARCH=sm_52, I still get the following error similar to the one with cmake i've described here#3880:


make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_52

I llama.cpp build info: 

I UNAME_S:   Linux

I UNAME_P:   aarch64

I UNAME_M:   aarch64

I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion 

I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation

I NVCCFLAGS: -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 

I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib 

I CC:        cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

I CXX:       g++ (GCC) 8.5.0



cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml.c -o ggml.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c llama.cpp -o llama.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/common.cpp -o common.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/sampling.cpp -o sampling.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/grammar-parser.cpp -o grammar-parser.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/build-info.cpp -o build-info.o

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wno-array-bounds -Wno-format-truncation -c common/console.cpp -o console.o

nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -use_fast_math --forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -arch=sm_52 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi" -c ggml-cuda.cu -o ggml-cuda.o

ggml-cuda.cu(598): warning: function "warp_reduce_sum(half2)" was declared but never referenced



ggml-cuda.cu(619): warning: function "warp_reduce_max(half2)" was declared but never referenced



cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o

cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -mcpu=native -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o

ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’:

ggml-quants.c:403:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]

 #define ggml_vld1q_s16_x2 vld1q_s16_x2

                           ^

ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’

         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);

                                         ^~~~~~~~~~~~~~~~~

ggml-quants.c:403:27: error: invalid initializer

 #define ggml_vld1q_s16_x2 vld1q_s16_x2

                           ^

ggml-quants.c:3725:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’

         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);

                                         ^~~~~~~~~~~~~~~~~

ggml-quants.c:404:27: error: implicit declaration of function ‘vld1q_u8_x2’; did you mean ‘vld1q_u32’? [-Werror=implicit-function-declaration]

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:3749:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c:406:27: error: implicit declaration of function ‘vld1q_s8_x2’; did you mean ‘vld1q_s32’? [-Werror=implicit-function-declaration]

 #define ggml_vld1q_s8_x2  vld1q_s8_x2

                           ^

ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’

             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:406:27: error: invalid initializer

 #define ggml_vld1q_s8_x2  vld1q_s8_x2

                           ^

ggml-quants.c:3751:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’

             ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’

         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\

                 ^

ggml-quants.c:3757:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’

             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(2, 2);

             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’

         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\

                 ^

ggml-quants.c:3758:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’

             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(4, 4);

             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ggml-quants.c:3743:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’

         q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;\

                 ^

ggml-quants.c:3759:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’

             SHIFT_MULTIPLY_ACCUM_WITH_SCALE(6, 6);

             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ggml-quants.c: In function ‘ggml_vec_dot_q3_K_q8_K’:

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:4365:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’

         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);

                                    ^~~~~~~~~~~~~~~~

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:4383:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             const ggml_uint8x16x2_t q3bits = ggml_vld1q_u8_x2(q3); q3 += 32;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c:407:27: error: implicit declaration of function ‘vld1q_s8_x4’; did you mean ‘vld1q_s64’? [-Werror=implicit-function-declaration]

 #define ggml_vld1q_s8_x4  vld1q_s8_x4

                           ^

ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’

             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;

                                                ^~~~~~~~~~~~~~~~

ggml-quants.c:407:27: error: invalid initializer

 #define ggml_vld1q_s8_x4  vld1q_s8_x4

                           ^

ggml-quants.c:4384:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’

             const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;

                                                ^~~~~~~~~~~~~~~~

ggml-quants.c:407:27: error: invalid initializer

 #define ggml_vld1q_s8_x4  vld1q_s8_x4

                           ^

ggml-quants.c:4385:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’

             const ggml_int8x16x4_t q8bytes_2 = ggml_vld1q_s8_x4(q8); q8 += 64;

                                                ^~~~~~~~~~~~~~~~

ggml-quants.c: In function ‘ggml_vec_dot_q4_K_q8_K’:

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:5244:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             const ggml_uint8x16x2_t q4bits = ggml_vld1q_u8_x2(q4); q4 += 32;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c:5246:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’

             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;

                     ^

ggml-quants.c:5253:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’

             q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;

                     ^

ggml-quants.c: In function ‘ggml_vec_dot_q5_K_q8_K’:

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:5840:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’

         ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);

                                    ^~~~~~~~~~~~~~~~

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:5848:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             const ggml_uint8x16x2_t q5bits = ggml_vld1q_u8_x2(q5); q5 += 32;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c:407:27: error: invalid initializer

 #define ggml_vld1q_s8_x4  vld1q_s8_x4

                           ^

ggml-quants.c:5849:46: note: in expansion of macro ‘ggml_vld1q_s8_x4’

             const ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;

                                              ^~~~~~~~~~~~~~~~

ggml-quants.c: In function ‘ggml_vec_dot_q6_K_q8_K’:

ggml-quants.c:403:27: error: invalid initializer

 #define ggml_vld1q_s16_x2 vld1q_s16_x2

                           ^

ggml-quants.c:6506:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’

         const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);

                                         ^~~~~~~~~~~~~~~~~

ggml-quants.c:404:27: error: invalid initializer

 #define ggml_vld1q_u8_x2  vld1q_u8_x2

                           ^

ggml-quants.c:6520:40: note: in expansion of macro ‘ggml_vld1q_u8_x2’

             ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); qh += 32;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:405:27: error: implicit declaration of function ‘vld1q_u8_x4’; did you mean ‘vld1q_u64’? [-Werror=implicit-function-declaration]

 #define ggml_vld1q_u8_x4  vld1q_u8_x4

                           ^

ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’

             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:405:27: error: invalid initializer

 #define ggml_vld1q_u8_x4  vld1q_u8_x4

                           ^

ggml-quants.c:6521:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’

             ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:407:27: error: invalid initializer

 #define ggml_vld1q_s8_x4  vld1q_s8_x4

                           ^

ggml-quants.c:6522:40: note: in expansion of macro ‘ggml_vld1q_s8_x4’

             ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;

                                        ^~~~~~~~~~~~~~~~

ggml-quants.c:6547:21: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’

             q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;

                     ^

ggml-quants.c: In function ‘ggml_vec_dot_iq2_xxs_q8_K’:

ggml-quants.c:7264:17: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’

             q8b = ggml_vld1q_s8_x4(q8); q8 += 64;

                 ^

cc1: some warnings being treated as errors

Makefile:552: recipe for target 'ggml-quants.o' failed

make: *** [ggml-quants.o] Error 1

your cc is gcc7,not gcc8

@oiwn
Copy link

oiwn commented Feb 26, 2024

Any updates?

@otaGran
Copy link

otaGran commented Feb 26, 2024

I just got my TX2 working with the latest commit of the master branch(a33e6a0, 02/26 2024). And the following is what I have done.

  1. A factory reset TX2 to JetPack 4.6.4, the last version which still support the Jestson TX2. JetPack 4.6.4 provide Cuda 10.2 and GCC 7.

  2. Enable all six cores by sudo nvpmodel -m 0, use jetson-fan-ctl to keep the fan running, and jetson-stats to monitor the usage.

  3. Compile and install GCC 8.5 following @FantasyGmm 's guide. I have made a copy to here:

    wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz
    sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
    ./contrib/download_prerequisites
    mkdir build 
    cd build 
    sudo ../configure -enable-checking=release -enable-languages=c,c++
    make -j6
    make install
    
  4. Set the correct gcc/g++

    export CC=/usr/local/bin/gcc
    export CXX=/usr/local/bin/g++
    
  5. Changed the line in Makefile from

    MK_NVCCFLAGS  += -O3
    

    to

    MK_NVCCFLAGS += -maxrregcount=80
    

    The original -O3 will cause nvcc to report an error of "nvcc fatal : redefinition of argument 'optimize'."

    The -maxrregcount=80 is a workaround for the error too many resources for launch during the inference. I'm not a CUDA expert, the number 80 is from this link.

  6. make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6

  7. When running the llama.cpp, I still need -ngl 33 ( using llama2-7b) to exsiply offload all layers to Jetson TX2 GPU.

./main -m llama-2-7b.Q4_0.gguf -ngl 33 -c 256 -b 512 -n 128 --keep 48

llama_print_timings: load time = 15632.56 ms
llama_print_timings: sample time = 3.56 ms / 24 runs ( 0.15 ms per token, 6735.90 tokens per second)
llama_print_timings: prompt eval time = 13273.26 ms / 145 tokens ( 91.54 ms per token, 10.92 tokens per second)
llama_print_timings: eval time = 5457.13 ms / 23 runs ( 237.27 ms per token, 4.21 tokens per second)
llama_print_timings: total time = 32417.66 ms / 168 tokens

@whoreson
Copy link
Contributor Author

whoreson commented Mar 2, 2024

Hmm, it does compile with CUDA 10.2 (but not with CUDA 10.1 which I previously used). I didn't even bother compiling a proper gcc, just disabled the version check in /cuda-toolkit/targets/x86_64-linux/include/crt/host_config.h

Then first compiled ggml-cuda.cu by hand like so:

~/cuda-10.2/cuda-toolkit/bin/nvcc --compiler-options="" -use_fast_math -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 --use_fast_math --compiler-options="-I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wmissing-declarations -Wno-unused-function -Wno-multichar -Wno-format-truncation -Wno-array-bounds -pthread    -Wno-pedantic -march=native -mtune=native " -c ggml-cuda.cu -o ggml-cuda.o

And continued with make LLAMA_CUBLAS=1 as usual.

@whoreson
Copy link
Contributor Author

2bf8d0f broke it on CUDA 10.2
@slaren @JohannesGaessler

@slaren
Copy link
Member

slaren commented Mar 21, 2024

@whoreson this is getting a bit tiresome. Are you going to ask people to harass me over this again? Let's be clear: I have no interest in supporting ancient versions of CUDA. If this is important for you, you are welcome to fix it yourself and open a PR.

@JohannesGaessler
Copy link
Collaborator

I have no intention to support CUDA 10. As slaren said, if you want it supported you are free to put in the effort yourself and I will then happily review your PRs.

@whoreson

This comment was marked as off-topic.

@whoreson

This comment was marked as off-topic.

@github-actions github-actions bot added the stale label Apr 22, 2024
Copy link
Contributor

github-actions bot commented May 7, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@zhaofeng-shu33
Copy link

zhaofeng-shu33 commented Mar 13, 2025

This version 81bc can be compiled successfully in win10 + VS2019 + Nvidia Toolkit 10.2 with the help of the following patch for ggml-cuda.cu. It dates back to 2023/12/7.

#include <cuda_fp16.h>
+#define CUBLAS_TF32_TENSOR_OP_MATH CUBLAS_TENSOR_OP_MATH
+#define CUBLAS_COMPUTE_16F CUDA_R_16F
+#define CUBLAS_COMPUTE_32F CUDA_R_32F

-DLLAMA_CUBLAS=ON option is needed to give.

@anuragdogra2192
Copy link

anuragdogra2192 commented Mar 24, 2025

I made it work on Ubuntu 18.04, Jetson Nano with CUDA 10.2. with gcc 8.5
This version 81bc,
by defining the below before including <cuda_runtime.h> in ggml-cuda.cu

#if CUDA_VERSION < 11000
#define CUBLAS_TF32_TENSOR_OP_MATH CUBLAS_TENSOR_OP_MATH
#define CUBLAS_COMPUTE_16F CUDA_R_16F
#define CUBLAS_COMPUTE_32F CUDA_R_32F
#endif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests