Skip to content

./main GGUF CUBLAS allocating GPU memory but not using it #2716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
quarterturn opened this issue Aug 22, 2023 · 8 comments · Fixed by #2727
Closed
4 tasks done

./main GGUF CUBLAS allocating GPU memory but not using it #2716

quarterturn opened this issue Aug 22, 2023 · 8 comments · Fixed by #2727

Comments

@quarterturn
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

  1. pulled latest commit
  2. built via "make clean && LLAMA_BUILD_SERVER=1 make -j LLAMA_CUBLAS=1"
  3. re-converted original Meta 70B Chat model to FP16 and then quantized to both 5_1 and 6 versions
  4. ran main via "./main -m ./models/llama2-70b-chat-ggml/ggml-chat-model-q6.bin --n-gpu-layers 83 -c 4096 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1"

Current Behavior

main allocates layers to the GPUs

llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Tesla P40) as main device
llama_model_load_internal: mem required  = 1282.30 MB (+ 1280.00 MB per state)
llama_model_load_internal: allocating batch_size x (1280 kB + n_ctx x 256 B) = 576 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 80 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 83/83 layers to GPU
llama_model_load_internal: total VRAM used: 55617 MB
llama_new_context_with_model: kv self size  = 1280.00 MB

But top shows main has allocated 55GB of system RAM and is also using a single thread at 100% CPU.

Environment and Context

gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110
system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

  • Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

~$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          48
On-line CPU(s) list:             0-47
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           62
Model name:                      Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz
Stepping:                        4
CPU MHz:                         3200.000
CPU max MHz:                     3200.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        4799.99
Virtualization:                  VT-x
L1d cache:                       768 KiB
L1i cache:                       768 KiB
L2 cache:                        6 MiB
L3 cache:                        60 MiB
NUMA node0 CPU(s):               0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):               1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Unknown: No mitigations
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-
                                 eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
                                 mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon peb
                                 s bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor d
                                 s_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadl
                                 ine_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp tpr_shad
                                 ow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clea
                                 r flush_l1d
  • Operating System, e.g. for Linux:

$ uname -a

uname -a
Linux pve 5.15.108-1-pve #1 SMP PVE 5.15.108-2 (2023-07-20T10:06Z) x86_64 GNU/Linux
  • SDK version, e.g. for Linux:
$ python3 --version
$ make --version
$ g++ --version

Failure Information (for bugs)

No failures produced, other than at a single CPU core it's going to take forever to respond.

Steps to Reproduce

See above

@staviq
Copy link
Contributor

staviq commented Aug 22, 2023

Have you tried letting it use more than one thread ? Because you didn't specify -t and I'm not sure if that's intentional or not

@quarterturn
Copy link
Author

Have you tried letting it use more than one thread ? Because you didn't specify -t and I'm not sure if that's intentional or not

Yes. I have tried with "-t 32", and I think the default is equivalent to "-t 24".

@staviq
Copy link
Contributor

staviq commented Aug 22, 2023

I checked and it defaults to the number of your CPU cores, so no -t should be equivalent to -t $(nproc)

But I managed to reproduce at 519c981, it hangs without initial prompt.

Try giving it a prompt, like this ./main -m ./models/llama2-70b-chat-ggml/ggml-chat-model-q6.bin --n-gpu-layers 83 -c 4096 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -p "Hi"

EDIT: Also reproducible on c63bb1d.

Paging @ggerganov @slaren, that's a legit issue, main -ins without -p of -f results in a freeze with 100% CPU usage, and never returns control to the user. This possibly would have to wait for #2694.

@staviq
Copy link
Contributor

staviq commented Aug 23, 2023 via email

@quarterturn
Copy link
Author

Confirmed using "-p Hello" works with a fresh pull and compile:

llama_print_timings:        load time = 139803.77 ms
llama_print_timings:      sample time =  1108.92 ms /   292 runs   (    3.80 ms per token,   263.32 tokens per second)
llama_print_timings: prompt eval time =  1735.07 ms /    32 tokens (   54.22 ms per token,    18.44 tokens per second)
llama_print_timings:        eval time = 54193.79 ms /   291 runs   (  186.23 ms per token,     5.37 tokens per second)
(base) qtr@pve:~/llama.cpp$ ./main -m ./models/llama2-70b-chat-ggml/ggml-chat-model-q6.bin -t 32 --n-gpu-layers 83 -c 4096 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 32 -p "Hello"

OK to close the issue?

@klosax
Copy link
Contributor

klosax commented Aug 23, 2023

The initial problem was using -ins without any prompt -p or -f
PR #2727 will fix this.

@staviq
Copy link
Contributor

staviq commented Aug 23, 2023

OK to close the issue?

Leave it, @klosax pull request will close it, and after that pull request you would no longer have to specify dummy prompt, and when the issue gets closed this way, you will know it's fixed in master.

Also, you are using .bin models, they are pretty much semi-deprecated with new GGUF format, so I'd recommend updating after the pull request gets merged, and you should be able to convert your .bin models to new .gguf format with convert-llama-ggmlv3-to-gguf.py script.

@quarterturn
Copy link
Author

Yeah still .bin format. I wasn't aware of the conversion script until now, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants