Replies: 1 comment
-
I have the same issue, but I am on an M4 Max. For every other model, offloading layers onto the GPU (for me, wired memory; e.g., -ngl 16) increases speeds. However, with Maverick, anytime I use ANY of my GPU memory for the model, it slows down. The only exception is if I can fit the entire model into memory, which then runs the fastest. If I run the model with just -ngl 0, it runs the fastest (often between 8-12 t/s at its peak; I am running Q4 and using mmap). The relationship is pretty clear: Using a higher -ngl value results in worse speeds. As -ngl decreases, speeds increase. It's really bizarre! --mlock did not solve the issue, nor does offloading certain tensors to the CPU versus GPU. What's also weird is that the performance I receive with -ngl 0 suggests that it is still using my RAM (indeed, my "cached files" shows all remaining RAM is cached for the model), but it's not directly in memory and it's not in wired memory. But somehow that gives me the fastest speeds? And somehow using the GPU memory hurts performance? I do not quite get it! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I know inference uses less compute on Maverick vs say Llama 70b,
Shouldn't the same apply to prompt processing?
Prompt processing does speed up with CPU only going from 70b to Maverick,
After adding a gpu 70b gets a huge speed boost, but Maverick actually slows down a little with a GPU
Machine is an Epyc 7F52 16 core + 1x RTX 3090 (PCIE x16 gen3)
Maverick CPU only:
CUDA_VISIBLE_DEVICES=-1 ./llama-server -m Maverick.gguf -c 16384
prompt eval time = 54376.33 ms / 1611 tokens ( 33.75 ms per token, 29.63 tokens per second)
eval time = 34414.90 ms / 310 tokens ( 111.02 ms per token, 9.01 tokens per second)
Maverick CPU + GPU:
./llama-server -m Maverick.gguf -c 16384 -ngl 49 -ot ".ffn_._exps.*=CPU"
prompt eval time = 71585.41 ms / 1611 tokens ( 44.44 ms per token, 22.50 tokens per second)
eval time = 10805.00 ms / 297 tokens ( 36.38 ms per token, 27.49 tokens per second)
Llama3.3 70b CPU only:
CUDA_VISIBLE_DEVICES=-1 ./llama-server -m Llama-3.3.gguf -c 16384
prompt eval time = 196771.44 ms / 1622 tokens ( 121.31 ms per token, 8.24 tokens per second)
Llama3.3 70b CPU + GPU:
./llama-server -m Llama-3.3.gguf -c 16384 -ngl 20
prompt eval time = 13547.21 ms / 1617 tokens ( 8.38 ms per token, 119.36 tokens per second)
On Maverick my pcie bandwidth was basically saturated the whole 54 seconds of prompt eval at 14GB/s
Just wondering if this is expected due to Maverick being huge, or I have bad settings, or maybe optimizations are possible?
Beta Was this translation helpful? Give feedback.
All reactions