-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Prompt eval is 5x slower than in Ollama and maxes out the CPU #12237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Your actual problem is that you are not fully offloading to the GPU so some of the compute is forced to be on the CPU layers. Your model has 41 layers and you only offloaded 30 with -ngl 30... if you want full performance you have to fit the whole model in the GPU thats all there is to it. |
This is correct, but the same number of layers is offloaded to the GPU with ollama, yet the performance is much better.
|
I played some more with the build flags, looking at https://github.com/ollama/ollama/blob/main/CMakeLists.txt With
I get
matching my However, adding |
Hi ollama.cpp total duration: 57.016716541s llama.cpp output: llama_perf_sampler_print: sampling time = 395.77 ms / 1247 runs ( 0.32 ms per token, 3150.80 tokens per second) |
@cpg314 You might be able to get both prompt processing and text generation to be fast at the same time by using the new tensor buffer type override option (#11397). But it's difficult to give you a specific guide without looking at the specific model and logs. Generally, CPU ops can be offloaded to the GPU under certain conditions: llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Lines 3261 to 3288 in 58d0fc2
This makes the prompt-processing speed fast (i.e. large batches use more compute). But when you enable BLAS, it interferes with this logic and these ops are no longer offloaded to the GPU for large batches which hurts the PP perf. On the other hand, it helps the TG perf because there is less memory transfer over the PCI bus. So with some adjustments it might be possible to get best of both worlds. |
I am running the same Q4_K_M model (Mistral Small 3) in
llama.cpp
andollama
, seemingly with the same configuration, on an NVIDIA RTX 3060 12 GB VRAM and an AMD Ryzen 9 7900.However, I get significantly faster (5x) prompt eval with
ollama
: 512 t/s vs 100 t/s, while the eval time is ~2x faster withllama-cpp
.This is reproducible across time and inputs.
Ollama
llama-server
Interestingly, llama-server seems to be using all my CPU cores during prompt evaluation, no matter what value I use for the
-t
flag:It is nevertheless clearly using the GPU, as removing
-ngl 30
massively extends the running time.Logs comparison
Ollama:
llama-server:
The layers are distributed similarly on the devices: 0-9 CPU, 10-39 CUDA0, and 40 on CPU.
Two smoking guns I see:
CUDA0 compute buffer size
is 790 MiB withollama
but only 90 MiB withllama-server.
The CPU compute buffer size is absent with
ollama
, but at 260 MiB forllama-server.
ollama
printstensor 'token_embd.weight' (q4_K) (and 92 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
whilellama-server
showstensor 'token_embd.weight' (q4_K) (and 92 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
. Why is the preferred buffer different?Do you have know what controls this? There are no other log messages than above regarding CPU.
Anything else that could explain the discrepancy in performance?
Versions
The text was updated successfully, but these errors were encountered: