Prompt eval is 5x slower than in Ollama and maxes out the CPU #12237

cpg314 · 2025-03-06T23:17:55Z

I am running the same Q4_K_M model (Mistral Small 3) in llama.cpp and ollama, seemingly with the same configuration, on an NVIDIA RTX 3060 12 GB VRAM and an AMD Ryzen 9 7900.

However, I get significantly faster (5x) prompt eval with ollama: 512 t/s vs 100 t/s, while the eval time is ~2x faster with llama-cpp.
This is reproducible across time and inputs.

Ollama

total duration:       3.046232704s
load duration:        11.114031ms
prompt eval count:    1119 token(s)
prompt eval duration: 2.185s
prompt eval rate:     512.13 tokens/s
eval count:           9 token(s)
eval duration:        847ms
eval rate:            10.63 tokens/s

llama-server

$ llama-server -m /ollama/data/ollama/models/blobs/sha256-dd3af152229f92a3d61f3f115217c9c72f4b94d8be6778156dab23f894703c28 --port 8080 -ngl 30 -fa --temp 0.15 -c 2048  -ctk q4_0 -ctv q4_0 -t 12

prompt eval time =    8734.21 ms /   971 tokens (    9.00 ms per token,   111.17 tokens per second)
       eval time =    1075.76 ms /    19 tokens (   56.62 ms per token,    17.66 tokens per second)
      total time =    9809.97 ms /   990 tokens

Interestingly, llama-server seems to be using all my CPU cores during prompt evaluation, no matter what value I use for the -t flag:

It is nevertheless clearly using the GPU, as removing -ngl 30 massively extends the running time.

Logs comparison

Ollama:

system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=12

llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloaded 30/41 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  4121.89 MiB
llm_load_tensors:        CUDA0 model buffer size =  9540.47 MiB

llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'q4_0', type_v = 'q4_0', n_layer = 40, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =    22.50 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    67.50 MiB

llama_new_context_with_model: KV self size  =   90.00 MiB, K (q4_0):   45.00 MiB, V (q4_0):   45.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.52 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   791.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    14.01 MiB
llama_new_context_with_model: graph nodes  = 1127
llama_new_context_with_model: graph splits = 114 (with bs=512), 3 (with bs=1)

llama-server:

system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

load_tensors: offloading 30 repeating layers to GPU
load_tensors: offloaded 30/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  4121.89 MiB
load_tensors:        CUDA0 model buffer size =  9540.47 MiB

llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'q4_0', type_v = 'q4_0', n_layer = 40, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    67.50 MiB
llama_kv_cache_init:        CPU KV buffer size =    22.50 MiB

llama_init_from_model:        CPU  output buffer size =     0.50 MiB
llama_init_from_model:        CPU compute buffer size =   266.00 MiB
llama_init_from_model:      CUDA0 compute buffer size =   160.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    22.01 MiB
llama_init_from_model: graph nodes  = 1127
llama_init_from_model: graph splits = 164 (with bs=512), 3 (with bs=1)

The layers are distributed similarly on the devices: 0-9 CPU, 10-39 CUDA0, and 40 on CPU.

Two smoking guns I see:

the CUDA0 compute buffer size is 790 MiB with ollama but only 90 MiB with llama-server.
The CPU compute buffer size is absent with ollama, but at 260 MiB for llama-server.
ollama prints tensor 'token_embd.weight' (q4_K) (and 92 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead while llama-server shows tensor 'token_embd.weight' (q4_K) (and 92 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead. Why is the preferred buffer different?
Why is the number of graph splits different (164 vs 114)?

Do you have know what controls this? There are no other log messages than above regarding CPU.

Anything else that could explain the discrepancy in performance?

Versions

$ ollama --version
ollama version is 0.5.7-0-ga420a45-dirty
Warning: client version is 0.5.7

$ llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 4779 (d7cfe1ffe)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu

The text was updated successfully, but these errors were encountered:

cb88 · 2025-03-07T03:29:15Z

Your actual problem is that you are not fully offloading to the GPU so some of the compute is forced to be on the CPU layers.

Your model has 41 layers and you only offloaded 30 with -ngl 30... if you want full performance you have to fit the whole model in the GPU thats all there is to it.

cpg314 · 2025-03-07T06:19:50Z

This is correct, but the same number of layers is offloaded to the GPU with ollama, yet the performance is much better.
Both logs show

llm_load_tensors: offloaded 30/41 layers to GPU

cpg314 · 2025-03-07T13:17:25Z

I played some more with the build flags, looking at https://github.com/ollama/ollama/blob/main/CMakeLists.txt

With

$ cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON -DGGML_LLAMAFILE=ON -DGGML_LTO=ON  -DGGML_LTO=ON

I get

prompt eval time =    1792.28 ms /   973 tokens (    1.84 ms per token,   542.88 tokens per second)
       eval time =    1223.68 ms /    12 tokens (  101.97 ms per token,     9.81 tokens per second)
      total time =    3015.96 ms /   985 tokens
llama_init_from_model:        CPU  output buffer size =     0.50 MiB
llama_init_from_model:      CUDA0 compute buffer size =   791.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    14.01 MiB
llama_init_from_model: graph nodes  = 1127
llama_init_from_model: graph splits = 114 (with bs=512), 3 (with bs=1)

matching my ollama results.

However, adding -DGGML_BLAS=ON (as in this AUR package) causes the problem described in the issue, i.e. 5x slower prompt eval but 2x faster eval.
Is this expected?

nigelzzz · 2025-04-09T06:19:27Z

Hi
I have same issue on rpi5. prompt eval rate: 27.45 tokens/s vs 2.27 tokens per second)

ollama.cpp

total duration: 57.016716541s
load duration: 35.149005ms
prompt eval count: 948 token(s)
prompt eval duration: 34.540598779s
prompt eval rate: 27.45 tokens/s
eval count: 137 token(s)
eval duration: 22.437804718s
eval rate: 6.11 tokens/s

llama.cpp output:

llama_perf_sampler_print: sampling time = 395.77 ms / 1247 runs ( 0.32 ms per token, 3150.80 tokens per second)
llama_perf_context_print: load time = 8495.91 ms
llama_perf_context_print: prompt eval time = 383770.11 ms / 873 tokens ( 439.60 ms per token, 2.27 tokens per second)
llama_perf_context_print: eval time = 64054.85 ms / 373 runs ( 171.73 ms per token, 5.82 tokens per second)
llama_perf_context_print: total time = 449074.87 ms / 1246 tokens
prompt eval rate: 27.45 tokens/s vs 2.27 tokens per second)

ggerganov · 2025-04-09T06:56:52Z

@cpg314 You might be able to get both prompt processing and text generation to be fast at the same time by using the new tensor buffer type override option (#11397). But it's difficult to give you a specific guide without looking at the specific model and logs.

Generally, CPU ops can be offloaded to the GPU under certain conditions:

llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu

Lines 3261 to 3288 in 58d0fc2

    
           static bool ggml_backend_cuda_device_supports_buft(ggml_backend_dev_t dev, ggml_backend_buffer_type_t buft) { 
        
               return (ggml_backend_buft_is_cuda(buft) || ggml_backend_buft_is_cuda_split(buft)) && buft->device == dev; 
        
           } 
        
           static int64_t get_op_batch_size(const ggml_tensor * op) { 
        
               switch (op->op) { 
        
                   case GGML_OP_GET_ROWS: 
        
                       return 0; 
        
                   case GGML_OP_MUL_MAT: 
        
                       return op->ne[1]; 
        
                   case GGML_OP_MUL_MAT_ID: 
        
                   case GGML_OP_ROPE: 
        
                   case GGML_OP_ROPE_BACK: 
        
                       return op->ne[2]; 
        
                   default: 
        
                       return ggml_nrows(op); 
        
               } 
        
           } 
        
           static bool ggml_backend_cuda_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) { 
        
               const int min_batch_size = 32; 
        
               return get_op_batch_size(op) >= min_batch_size; 
        
               GGML_UNUSED(dev); 
        
           }

This makes the prompt-processing speed fast (i.e. large batches use more compute). But when you enable BLAS, it interferes with this logic and these ops are no longer offloaded to the GPU for large batches which hurts the PP perf. On the other hand, it helps the TG perf because there is less memory transfer over the PCI bus. So with some adjustments it might be possible to get best of both worlds.

github-actions bot added the stale label Apr 7, 2025

github-actions bot removed the stale label Apr 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt eval is 5x slower than in Ollama and maxes out the CPU #12237

Prompt eval is 5x slower than in Ollama and maxes out the CPU #12237

cpg314 commented Mar 6, 2025 •

edited

Loading

cb88 commented Mar 7, 2025

cpg314 commented Mar 7, 2025 •

edited

Loading

cpg314 commented Mar 7, 2025 •

edited

Loading

nigelzzz commented Apr 9, 2025

ggerganov commented Apr 9, 2025

Prompt eval is 5x slower than in Ollama and maxes out the CPU #12237

Prompt eval is 5x slower than in Ollama and maxes out the CPU #12237

Comments

cpg314 commented Mar 6, 2025 • edited Loading

Logs comparison

Versions

cb88 commented Mar 7, 2025

cpg314 commented Mar 7, 2025 • edited Loading

cpg314 commented Mar 7, 2025 • edited Loading

nigelzzz commented Apr 9, 2025

ggerganov commented Apr 9, 2025

cpg314 commented Mar 6, 2025 •

edited

Loading

cpg314 commented Mar 7, 2025 •

edited

Loading

cpg314 commented Mar 7, 2025 •

edited

Loading