./main GGUF CUBLAS allocating GPU memory but not using it

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior
1. pulled latest commit
2. built via "make clean && LLAMA_BUILD_SERVER=1  make -j LLAMA_CUBLAS=1"
3. re-converted original Meta 70B Chat model to FP16 and then quantized to both 5_1 and 6 versions
4. ran main via "./main -m ./models/llama2-70b-chat-ggml/ggml-chat-model-q6.bin  --n-gpu-layers 83 -c 4096 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1"

# Current Behavior
main allocates layers to the GPUs
```
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Tesla P40) as main device
llama_model_load_internal: mem required  = 1282.30 MB (+ 1280.00 MB per state)
llama_model_load_internal: allocating batch_size x (1280 kB + n_ctx x 256 B) = 576 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 80 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 83/83 layers to GPU
llama_model_load_internal: total VRAM used: 55617 MB
llama_new_context_with_model: kv self size  = 1280.00 MB
```

But top shows main has allocated 55GB of system RAM and is also using a single thread at 100% CPU.


# Environment and Context
gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110
system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |



* Physical (or virtual) hardware you are using, e.g. for Linux:

`$ lscpu`
```
~$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          48
On-line CPU(s) list:             0-47
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           62
Model name:                      Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz
Stepping:                        4
CPU MHz:                         3200.000
CPU max MHz:                     3200.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        4799.99
Virtualization:                  VT-x
L1d cache:                       768 KiB
L1i cache:                       768 KiB
L2 cache:                        6 MiB
L3 cache:                        60 MiB
NUMA node0 CPU(s):               0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):               1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Unknown: No mitigations
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-
                                 eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
                                 mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon peb
                                 s bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor d
                                 s_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadl
                                 ine_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp tpr_shad
                                 ow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clea
                                 r flush_l1d
```

* Operating System, e.g. for Linux:

`$ uname -a`
```
uname -a
Linux pve 5.15.108-1-pve #1 SMP PVE 5.15.108-2 (2023-07-20T10:06Z) x86_64 GNU/Linux
```

* SDK version, e.g. for Linux:

```
$ python3 --version
$ make --version
$ g++ --version
```

# Failure Information (for bugs)

No failures produced, other than at a single CPU core it's going to take forever to respond.

# Steps to Reproduce

See above

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

./main GGUF CUBLAS allocating GPU memory but not using it #2716

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

./main GGUF CUBLAS allocating GPU memory but not using it #2716

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions