Eval bug: Quad P40 unable to run 70B models on recent releases

### Name and Version
```
llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 5145 (12b17501)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
```



### Operating systems

Linux

### GGML backends

CUDA

### Hardware

Quad Nvidia Tesla P40 on dual Xeon E5-2699v4 (two cards per CPU)


### Models

Llama-3.3-70B-Instruct-GGUF
Qwen2.5-72B-Instruct-GGUF
gemma-3-27b-it-Q8_0.gguf
QwQ-32B-Q8_0.gguf

### Problem description & steps to reproduce

I updated and built llama.cpp after sticking with the same version for a couple of months, and since then Llama 3.3 70B or Qwen 2.5 72B. llama-server also fails to generate output after starting on smaller models on two cards only like Gemma 3 27B, Mistral Small 24B, Qwen 2.5 Coder 32B, and QwQ 32B. 

If I run the 27-32B models on CUDA0 and CUDA1 they invariably fail, but generation works (mostly) fine with the following combination:
CUDA0,CUDA2
CUDA0,CUDA3
CUDA1,CUDA2
CUDA1,CUDA3
CUDA2,CUDA3

When this happens, nvtop shows GPU load on one GPU only for the smaller models that I configure to run on two GPUs, and on two GPUs only for the larger models that are configured to run on all four.

The worst part is that once this happens, llama.cpp is unable to initialize CUDA devices until I reboot the server. If I run llama-cli after this happens, I get the following:

```
llama-cli
ggml_cuda_init: failed to initialize CUDA: unknown error
build: 5145 (12b17501) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
gguf_init_from_file: failed to open GGUF file 'models/7B/ggml-model-f16.gguf'
llama_model_load: error loading model: llama_model_loader: failed to load model from models/7B/ggml-model-f16.gguf

llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/7B/ggml-model-f16.gguf'
main: error: unable to load model
```

Meanwhile, nvidia-smi and nvtop continue to work normally when this happens, without reboot.

I don't remember the exact version I was running before, so I checked out b4686 from February (I think I was on b45xx) and recompiled, and indeed 70B models work without issue. I deleted the build directory, and configured and built again. To confirm, I run llama-cli after building:

```
llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 4686 (7b891bdc)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
```

I run the same llama-server command:

```
llama-server -m llama-server -m /models/Llama-3.3-70B-Instruct-GGUF/Llama-3.3-70B-Instruct-Q8_0-00001-of-00002.gguf  \
-fa -sm row --no-mmap \
-ngl 99 -ngld 99 --port 9002 -c 10000 \
--device CUDA0,CUDA1,CUDA2,CUDA3 --tensor-split 1,1,1,1 \
--slots --metrics --numa distribute -t 40
```

and generation worked fine.

I checked out b5145 (been trying since b5131), recompiled as described below, confirmed version with llama-cli --version, and run Llama 3.3 70B using the the same command above. In the time it took me to type all this, this is all the output I got from llama-server:

",H@2C%#6H<+$D+A'FD8CG1F8#.H7)'%8#<H(#9'#.)A932+C7%/4==E$3/C".5;33


### Compile
```
cmake -B build -DGGML_RPC=ON -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_FORCE_MMQ=ON -DLLAMA_CURL=OFF -DCMAKE_CXX_FLAGS="-O3 -flto" -DCMAKE_C_FLAGS="-O3 -flto"
cmake --build build --config Release -j 80
```
 


### First Bad Commit

_No response_

### Relevant log output
Sometimes I get one of the error messages indicated below, other times there are no error messages, but to be honest I'm not keeping track and the errors or lack could be related to the version I'm using since b5131. I have tried at least two tags a day for the past 3 days.

```shell
Sometimes, I get the fllowing:
~/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: unknown error
  current device: 0, in function launch_fattn at /home/ali/llama.cpp/ggml/src/ggml-cuda/template-instances/../fattn-common.cuh:870
  cudaGetLastError()

Other times, I get the following error:

/home/ali/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: unknown error
  current device: 0, in function alloc at /home/ali/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:472
  cuMemSetAccess((CUdeviceptr)((char *)(pool_addr) + pool_size), reserve_size, &access, 1)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Quad P40 unable to run 70B models on recent releases #12990

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Compile

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Eval bug: Quad P40 unable to run 70B models on recent releases #12990

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Compile

First Bad Commit

Relevant log output

Activity

segmond commented on Apr 17, 2025

FullstackSensei commented on Apr 17, 2025

JohannesGaessler commented on Apr 17, 2025

FullstackSensei commented on Apr 17, 2025

github-actions commented on Jun 1, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions