-
Notifications
You must be signed in to change notification settings - Fork 11.8k
./main GGUF CUBLAS allocating GPU memory but not using it #2716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Have you tried letting it use more than one thread ? Because you didn't specify |
Yes. I have tried with "-t 32", and I think the default is equivalent to "-t 24". |
I checked and it defaults to the number of your CPU cores, so no But I managed to reproduce at 519c981, it hangs without initial prompt. Try giving it a prompt, like this EDIT: Also reproducible on c63bb1d. Paging @ggerganov @slaren, that's a legit issue, |
Looks like that fixes it.
Any idea why the "view it on GitHub" link does not work from the
notification email ? Did you delete the comment ?
…On Tue, 22 Aug 2023 at 23:43, klosax ***@***.***> wrote:
This solution may work:
Directly following this line
https://github.com/ggerganov/llama.cpp/blob/46ef5b5fcf4c366e1fb27726b6394adbbf8fd0ea/examples/main/main.cpp#L198
Insert:
// Should not run without any tokens
if(embd_inp.size()==0) {
embd_inp.push_back(llama_token_bos(ctx));
}
—
Reply to this email directly, view it on GitHub
<#2716 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZMNB5AOGXW7DZSPRUBZVDXWU7Z7ANCNFSM6AAAAAA32FJQFA>
.
You are receiving this because you commented.Message ID: <ggerganov/llama.
***@***.***>
|
Confirmed using "-p Hello" works with a fresh pull and compile:
OK to close the issue? |
The initial problem was using |
Leave it, @klosax pull request will close it, and after that pull request you would no longer have to specify dummy prompt, and when the issue gets closed this way, you will know it's fixed in master. Also, you are using |
Yeah still .bin format. I wasn't aware of the conversion script until now, thank you. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Current Behavior
main allocates layers to the GPUs
But top shows main has allocated 55GB of system RAM and is also using a single thread at 100% CPU.
Environment and Context
gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110
system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
$ lscpu
$ uname -a
Failure Information (for bugs)
No failures produced, other than at a single CPU core it's going to take forever to respond.
Steps to Reproduce
See above
The text was updated successfully, but these errors were encountered: