Adding a GPU Slows down Prompt Eval on Llama 4 Maverick #13038

justinjja · 2025-04-20T23:43:37Z

justinjja
Apr 20, 2025

I know inference uses less compute on Maverick vs say Llama 70b,
Shouldn't the same apply to prompt processing?

Prompt processing does speed up with CPU only going from 70b to Maverick,
After adding a gpu 70b gets a huge speed boost, but Maverick actually slows down a little with a GPU

Machine is an Epyc 7F52 16 core + 1x RTX 3090 (PCIE x16 gen3)

Maverick CPU only:
CUDA_VISIBLE_DEVICES=-1 ./llama-server -m Maverick.gguf -c 16384
prompt eval time = 54376.33 ms / 1611 tokens ( 33.75 ms per token, 29.63 tokens per second)
eval time = 34414.90 ms / 310 tokens ( 111.02 ms per token, 9.01 tokens per second)

Maverick CPU + GPU:
./llama-server -m Maverick.gguf -c 16384 -ngl 49 -ot ".ffn_._exps.*=CPU"
prompt eval time = 71585.41 ms / 1611 tokens ( 44.44 ms per token, 22.50 tokens per second)
eval time = 10805.00 ms / 297 tokens ( 36.38 ms per token, 27.49 tokens per second)

Llama3.3 70b CPU only:
CUDA_VISIBLE_DEVICES=-1 ./llama-server -m Llama-3.3.gguf -c 16384
prompt eval time = 196771.44 ms / 1622 tokens ( 121.31 ms per token, 8.24 tokens per second)

Llama3.3 70b CPU + GPU:
./llama-server -m Llama-3.3.gguf -c 16384 -ngl 20
prompt eval time = 13547.21 ms / 1617 tokens ( 8.38 ms per token, 119.36 tokens per second)

On Maverick my pcie bandwidth was basically saturated the whole 54 seconds of prompt eval at 14GB/s

Just wondering if this is expected due to Maverick being huge, or I have bad settings, or maybe optimizations are possible?

kalifvaughn · 2025-07-17T15:06:52Z

kalifvaughn
Jul 17, 2025

I have the same issue, but I am on an M4 Max.

For every other model, offloading layers onto the GPU (for me, wired memory; e.g., -ngl 16) increases speeds. However, with Maverick, anytime I use ANY of my GPU memory for the model, it slows down. The only exception is if I can fit the entire model into memory, which then runs the fastest.

If I run the model with just -ngl 0, it runs the fastest (often between 8-12 t/s at its peak; I am running Q4 and using mmap).
If I run the model with any value greater than 0 (e.g., -ngl 16), it slows down (often crawling at 1-3 t/s).

The relationship is pretty clear: Using a higher -ngl value results in worse speeds. As -ngl decreases, speeds increase.

It's really bizarre! --mlock did not solve the issue, nor does offloading certain tensors to the CPU versus GPU.

What's also weird is that the performance I receive with -ngl 0 suggests that it is still using my RAM (indeed, my "cached files" shows all remaining RAM is cached for the model), but it's not directly in memory and it's not in wired memory. But somehow that gives me the fastest speeds? And somehow using the GPU memory hurts performance? I do not quite get it!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding a GPU Slows down Prompt Eval on Llama 4 Maverick #13038

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Adding a GPU Slows down Prompt Eval on Llama 4 Maverick #13038

Uh oh!

justinjja Apr 20, 2025

Replies: 1 comment

Uh oh!

kalifvaughn Jul 17, 2025

justinjja
Apr 20, 2025

kalifvaughn
Jul 17, 2025