Description
Name and Version
llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
Device 2: Tesla P40, compute capability 6.1, VMM: yes
Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 5145 (12b17501)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Quad Nvidia Tesla P40 on dual Xeon E5-2699v4 (two cards per CPU)
Models
Llama-3.3-70B-Instruct-GGUF
Qwen2.5-72B-Instruct-GGUF
gemma-3-27b-it-Q8_0.gguf
QwQ-32B-Q8_0.gguf
Problem description & steps to reproduce
I updated and built llama.cpp after sticking with the same version for a couple of months, and since then Llama 3.3 70B or Qwen 2.5 72B. llama-server also fails to generate output after starting on smaller models on two cards only like Gemma 3 27B, Mistral Small 24B, Qwen 2.5 Coder 32B, and QwQ 32B.
If I run the 27-32B models on CUDA0 and CUDA1 they invariably fail, but generation works (mostly) fine with the following combination:
CUDA0,CUDA2
CUDA0,CUDA3
CUDA1,CUDA2
CUDA1,CUDA3
CUDA2,CUDA3
When this happens, nvtop shows GPU load on one GPU only for the smaller models that I configure to run on two GPUs, and on two GPUs only for the larger models that are configured to run on all four.
The worst part is that once this happens, llama.cpp is unable to initialize CUDA devices until I reboot the server. If I run llama-cli after this happens, I get the following:
llama-cli
ggml_cuda_init: failed to initialize CUDA: unknown error
build: 5145 (12b17501) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
gguf_init_from_file: failed to open GGUF file 'models/7B/ggml-model-f16.gguf'
llama_model_load: error loading model: llama_model_loader: failed to load model from models/7B/ggml-model-f16.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/7B/ggml-model-f16.gguf'
main: error: unable to load model
Meanwhile, nvidia-smi and nvtop continue to work normally when this happens, without reboot.
I don't remember the exact version I was running before, so I checked out b4686 from February (I think I was on b45xx) and recompiled, and indeed 70B models work without issue. I deleted the build directory, and configured and built again. To confirm, I run llama-cli after building:
llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
Device 2: Tesla P40, compute capability 6.1, VMM: yes
Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 4686 (7b891bdc)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
I run the same llama-server command:
llama-server -m llama-server -m /models/Llama-3.3-70B-Instruct-GGUF/Llama-3.3-70B-Instruct-Q8_0-00001-of-00002.gguf \
-fa -sm row --no-mmap \
-ngl 99 -ngld 99 --port 9002 -c 10000 \
--device CUDA0,CUDA1,CUDA2,CUDA3 --tensor-split 1,1,1,1 \
--slots --metrics --numa distribute -t 40
and generation worked fine.
I checked out b5145 (been trying since b5131), recompiled as described below, confirmed version with llama-cli --version, and run Llama 3.3 70B using the the same command above. In the time it took me to type all this, this is all the output I got from llama-server:
",H@2C%#6H<+$D+A'FD8CG1F8#.H7)'%8#<H(#9'#.)A932+C7%/4==E$3/C".5;33
Compile
cmake -B build -DGGML_RPC=ON -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_MMQ=ON -DLLAMA_CURL=OFF -DCMAKE_CXX_FLAGS="-O3 -flto" -DCMAKE_C_FLAGS="-O3 -flto"
cmake --build build --config Release -j 80
First Bad Commit
No response
Relevant log output
Sometimes I get one of the error messages indicated below, other times there are no error messages, but to be honest I'm not keeping track and the errors or lack could be related to the version I'm using since b5131. I have tried at least two tags a day for the past 3 days.
Sometimes, I get the fllowing:
~/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: unknown error
current device: 0, in function launch_fattn at /home/ali/llama.cpp/ggml/src/ggml-cuda/template-instances/../fattn-common.cuh:870
cudaGetLastError()
Other times, I get the following error:
/home/ali/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: unknown error
current device: 0, in function alloc at /home/ali/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:472
cuMemSetAccess((CUdeviceptr)((char *)(pool_addr) + pool_size), reserve_size, &access, 1)
Activity
[-]Eval bug: Unable to run Llama 3. 70B or Nemotron 3.1 70B on recent releases[/-][+]Eval bug: Quad P40 unable to run 70B models on recent releases[/+]segmond commentedon Apr 17, 2025
Did you git fetch/pull before rebuilding? If so, I will encourage you to delete the directory and fetch a fresh pull from github. If you keep having the issue, try disabling fa and sm row to see if one of those options is triggering it? Does a smaller model like 8B llama cause the same issue? If so, I can try it later tonight when I get home, I have 3 P40s. If it keeps breaking, then try and bisect which branch the bug came in.
FullstackSensei commentedon Apr 17, 2025
I spent several hours trying to narrow it down this morning. I tried several tags, always doing a --reset HARD before checking out a tag. The following tests were done on b5146 after shutting down and powering on the server to make sure nothing was lingering in memory. I installed Nvidia DCGM and run
dcgmi diag -r 4
and all tests passed without issue (including stress testing VRAM).Switched from llama-server to llama-cli to test things a bit faster, stopped installing built binaries and even deleted libllama.so from . All testing done today was straight from /build-tag/bin.
Haven't tried with 8B, but tested Gemma-3-27B0Q8, Qwen-2.5-Coder-32B-Q8, and QwQ-32B-Q8 each split across all combinations of two and three cards (including permutations of which device comes first):
I don't know if the shutdown or updating to b5146 changed something, but these results are very repeatable. I do not get any error messages with llama-cli as I did with llama-server, but I also haven't had to restart the server once due to CUDA initialization errors.
Checked the device tree, and CUDA0 and CUDA1 are on one socket, and CUDA2 and CUDA3 are on the other socket.
This is the llama-cli command I'm running. Just changing --device and --tensor-split (always setting used device to 1, and unused to 0) based on the sequences described above.
I'll grab a fresh copy of the source in a new directory tonight and repeat my tests. In the meantime, please let me know if there's anything more specific I could help with. Really appreciate the help!!!
JohannesGaessler commentedon Apr 17, 2025
Please do a git bisect and identify the exact commit that introduced the problem.
FullstackSensei commentedon Apr 17, 2025
@JohannesGaessler Thanks for mentioning git bisect. I didn't know this existed and will definitely use it for work going forward.
I was doing a manual binary search this morning, but the process was quite tedious as it often required restarting as I get "ggml_cuda_init: failed to initialize CUDA: unknown error" when this happens. I can prevent this from happening if I CTRL-C quickly when I see inference is not working correctly (only one GPU would spike in load in nvtop). I wouldn't even know how to detect this in an automated way :\
github-actions commentedon Jun 1, 2025
This issue was closed because it has been inactive for 14 days since being marked as stale.