-
Notifications
You must be signed in to change notification settings - Fork 11.6k
llama : add option to override model tensor buffers #11397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Is there a chance that the direction you're taking these changes might allow for scheduling specific threads to work on specific tensors? With R1 coming out, I'm very interested in reviving my work on trying to improve memory locality to increase CPU inference speeds. |
No, that's something that would need to handled at a lower level in the CPU backend. |
Thanks for the reply @slaren. I figured it wouldn't directly help, but that maybe you'd be adding useful metadata to tensor objects that could help coordinate affinity in the future. I'll start a fresh branch and see how far I get.
I'll also try to pull this branch and test it to see what the speedup and sysmem savings look like. |
Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU: -ngl 0 = 4.65t/s So there is definitely a major speedup potential for this patch. I can't offload all 62 layers for this model because I only have 24GB VRAM, but I expect the trend would be continue in the same general direction. This is without dropping caches, so its inefficient, but I didn't have the time to do a proper drop/reload cycle since it takes so long to be read back into memory on each test run. |
@bmtwl |
What are the shared expert tensors called in |
I believe the pattern |
Thanks - I'll give this a try later in the week. This PR together with Reddit post opens up the interesting possibility: https://old.reddit.com/r/LocalLLaMA/comments/1ibbloy/158bit_deepseek_r1_131gb_dynamic_gguf/ of quantising up/gate projections to q2_k and down projections to q4_k (or something similar), then keeping everything else as Sadly I need to move some stuff about to get space to upscale the fp8 download to bf16 before I can try it, but will report back when I do. |
It might be worth trying |
Just being able to split the experts between NUMA nodes would make a big difference, but not sure how easy that would be as IIRC the experts' tensors are all in one huge tensor now? |
During normal operation, When I fit a model between ram and vram, Does the offloading follow a set layer sequence? (layer 0 is chosen first to be offloaded to GPU, then layer 1, etc) Between GPU offloading and ram, which takes priority?
Do you remember how much of a speedup? No need for extensive benchmarks, just the rough % estimate. |
I can't seem to offload more than 29 layers of R1 (unsloth's UD-IQ2_XXS) via RPC. 29 layers and below work fine, but 30 just crashes my rpc_server, with no error output. It is not an issue of VRAM as even setting context very low so that it takes up nowhere near my GPU's limits and it still crashes. |
I had a similar problem where if I used a single GPU (via If I didn't use either of these it tried to allocate this 1.4TB monster buffer:
After some searching I found this issue: and recompiled using (It's likely nothing to do with this PR, but thought it might help!) |
I figured it out: you have to reorder the devices so the local and mainly these:
Means this works: --device "RPC[IP1:PORT1],RPC[IP1:PORT2],RPC[IP1:PORT1],RPC[IP2:PORT2],CUDA0,CUDA1" But if I don't do this I get OOM errors with plenty of VRAM left like you had. |
I'm testing this with and without #11446 and without on unsloth's UD-IQ2_XXS I was only able to offload 29 layers, and with I was able to allocate only 28 (on a Q4_K_S quant). This is not a VRAM issue, it would have plenty of spare VRAM, it would even get past allocation, and get to warmup, where the rpc-server would then just crash. The other issue is performance the more layers I allocate the worse performance gets while bmtwl shows performance increase with more layers offloaded with non-RPC based offloading. |
I am able to load the model with
But as soon as I send the prompt I receive:
Without the Testing with 4x RTX 3090 and 320GiB RAM. Built with |
Maybe try |
No luck, still the same issue. Oddly enough, the issue only happens when sending more than 450 tokens. |
It's trying to allocate a tensor of size 2^64, which suggest there is an integer overflow somewhere. If you set the environment variable |
It is the Is it possible to try to force this particular one to be allocated into the GPU buffer? |
This is most likely a bug, we need to understand why it is happening and fix it. Since you mentioned that it only happens with large prompts, I suspect that this is caused by a zero-sized tensors. When evaluating a batch where no logits are required (which happens when evaluating a prompt that needs to be split into multiple ubatches), zero-size tensors are created to skip the calculation of the logits. diff --git a/ggml/src/ggml-alloc.c b/ggml/src/ggml-alloc.c
index 9a3bf9f29..470ef13e6 100644
--- a/ggml/src/ggml-alloc.c
+++ b/ggml/src/ggml-alloc.c
@@ -179,6 +179,9 @@ static size_t ggml_dyn_tallocr_alloc(struct ggml_dyn_tallocr * alloc, size_t siz
// this should never happen
GGML_LOG_ERROR("%s: not enough space in the buffer to allocate %zu bytes, largest block available %zu bytes\n",
__func__, size, max_avail);
+ GGML_LOG_ERROR("%s: tensor: %s, shape: %ld %ld %ld %ld, size: %zu",
+ __func__, tensor->name, tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3],
+ ggml_nbytes(tensor));
GGML_ABORT("not enough space in the buffer");
}
} |
Ok nvm, I think I see the problem. I will push a possible fix soon. |
I'll upload the modified thread pool code later when I have time. Please note upfront that the code I wrote is very messy and didn't account for more environmental compilation scenarios. It's for your reference. |
Bump 😃 Sorry for the bump, but this PR is really essential for me to test the MLA stuff using the full-sized ggml_tensor * llm_graph_context::build_attn_mha(
ggml_cgraph * gf,
ggml_tensor * q,
ggml_tensor * k,
ggml_tensor * v,
ggml_tensor * kq_b,
ggml_tensor * kq_mask,
bool v_trans,
float kq_scale) const {
//const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa(il);
//const int64_t n_embd_v_gqa = hparams.n_embd_v_gqa(il);
//const int64_t n_head = hparams.n_head(il);
//const int64_t n_head_kv = hparams.n_head_kv(il);
//const auto & n_embd_head_k = hparams.n_embd_head_k;
//const auto & n_embd_head_v = hparams.n_embd_head_v;
const auto n_embd_head_v = v_trans ? v->ne[1] : v->ne[0];
const auto n_tokens = q->ne[1];
const auto n_head = q->ne[2];
const auto n_kv = k->ne[1]; I think it should be way cleaner to add now as nearly all the uglyness came from those hard-coded GQA assumptions from the GGUF file (ie: MLA converts into MQA or MHA depending on whether you use the "nieve" method or not). I'm not sure if it might be worth factoring out the tensor-name regex stuff for use with this PR that aims to do something similar for the |
I'm not sure if I am using it correctly, but on my Mac, overriding the buffers seems to lead to double the allocation. I am testing with: make -j && ./bin/llama-cli -m ../models/deepseek-v2-lite-chat/ggml-model-q8_0.gguf -ot "ffn_.*"=CPU -lv 1 And I see in the output:
If I don't pass the
The |
With Metal you would need to disable mmap to see lower memory usage, since the entire file or a large fraction of it will remain mapped. |
@slaren Thanks (and sorry for the bumb again)! |
Sorry to hijack the thread, but how would you suggest running DeepSeek-R3-UD-Q2_K_XL.gguf on a system with 192GB RAM and 128GB VRAM with multiple GPUs (VRAM, ordered from CUDA_VISIBLE_DEVICES with 24/24/32/48 GB) Would running
EDIT: It seems to work but uses just 10-12 GB of VRAM on each GPU |
You would need to increase the layers offloaded to fill the vram of each GPU as much as possible.
On my 2x3090s, it was this way. |
if llama.cpp support offload all routed-experts to CPU host memory? that is same to ktransformers. so the two solution will have the same performance? |
How would one override tensor buffers when using the API from include/llama.h? The public interface currently look like this: struct llama_model_tensor_buft_override {
const char * pattern;
ggml_backend_buffer_type_t buft;
};
struct llama_model_params {
// NULL-terminated list of devices to use for offloading (if NULL, all available devices are used)
ggml_backend_dev_t * devices;
// NULL-terminated list of buffer types to use for tensors that match a pattern
const struct llama_model_tensor_buft_override * tensor_buft_overrides;
// ... etc ...
}; As I understand it, since Maybe the API could be updated to something like this (based on ggml/include/ggml-backend.h#L130): enum ggml_backend_dev_type {
// CPU device using system memory
GGML_BACKEND_DEVICE_TYPE_CPU,
// GPU device using dedicated memory
GGML_BACKEND_DEVICE_TYPE_GPU,
// accelerator devices intended to be used together with the CPU backend (e.g. BLAS or AMX)
GGML_BACKEND_DEVICE_TYPE_ACCEL
};
struct llama_model_tensor_buft_override {
const char * pattern;
ggml_backend_dev_type buft;
}; This would allow users of the public interface to create their own |
You need to use the ggml API to obtain the buffer types, in the same way the llama.cpp examples do it. The llama.cpp API includes ggml, in fact, when you include |
I see, thank you |
Is it possible to use --override-tensor to specify which layers goes to gpu0 and others to gpu1? With qwen3-30B-A3B, it's too large to fill in my single 4070ti super but perfect usable quant exist if I plug in old 3060. Can't find the definite answer, but I assume it would be like deepseek that some layers should be prioritized to the beefier card. Does that make sense for pure gpu setup? I understand that layers needs to be processed sequentially otherwise gpu2gpu communication overhead kicks in frequectly, but I can't find how much of impact is it. |
This is possible with ik_llama.cpp fork e.g. you can put tensors/layers exactly where you want them across multiple CUDA devices or CPU e.g. i'm experimenting currently with Qwen3-235B-A22B MoE to fit a quant perfectly on 24GB VRAM + 96GB RAM. Example Command and LogsPartial logs shown for brevity:
I'm not 100% clear if mainline llama.cpp allows to specify anything other than
Its perfectly fine and a good idea to place some layers on CUD0 and other layers on CUDA1 and it will perform well. No need to worry about P2P NVLINK etc as this is not tensor-parallel/data-parallel stuff like vLLM and sglang may use for say 8x or 16x GPU nodes. |
The ik_llama implementation is just a copy paste of this PR, far from being only "inspired" from it as claimed. |
Okay, I had a moment to circle back around and test this out with a recent version of mainline llama.cpp. It does seem to allow you to specify e.g. I'd recommend setting
Then you can piece together as you like with either one long Note it doesn't print out unmatched layers even with Here is another quick example. It is quite flexible and handy for some models if you even want to offload say only attention and kv-cache to CPU etc. Example of using -ot to place exact tensors/layers on different CPU/GPU backends
./build/bin/llama-server \
--verbosity 1 \
--model /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf \
-fa \
--n-gpu-layers 99 \
--ctx-size 8192 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-ot attn=CPU \
-ot blk\.[0-5]\.ffn.*=CUDA0 \
-nkvo \
--threads 16 \
--host 127.0.0.1 \
--port 8088
tensor blk.0.attn_norm.weight buffer type overriden to CPU
tensor blk.0.attn_q.weight buffer type overriden to CPU
tensor blk.0.attn_k.weight buffer type overriden to CPU
tensor blk.0.attn_v.weight buffer type overriden to CPU
tensor blk.0.attn_output.weight buffer type overriden to CPU
tensor blk.0.ffn_norm.weight buffer type overriden to CUDA0
tensor blk.0.ffn_down.weight buffer type overriden to CUDA0
tensor blk.0.ffn_up.weight buffer type overriden to CUDA0
tensor blk.1.attn_norm.weight buffer type overriden to CPU
tensor blk.1.attn_q.weight buffer type overriden to CPU
tensor blk.1.attn_k.weight buffer type overriden to CPU
tensor blk.1.attn_v.weight buffer type overriden to CPU
tensor blk.1.attn_output.weight buffer type overriden to CPU
tensor blk.1.ffn_norm.weight buffer type overriden to CUDA0
tensor blk.1.ffn_down.weight buffer type overriden to CUDA0
tensor blk.1.ffn_up.weight buffer type overriden to CUDA0
tensor blk.2.attn_norm.weight buffer type overriden to CPU
tensor blk.2.attn_q.weight buffer type overriden to CPU
tensor blk.2.attn_k.weight buffer type overriden to CPU
tensor blk.2.attn_v.weight buffer type overriden to CPU
tensor blk.2.attn_output.weight buffer type overriden to CPU
tensor blk.2.ffn_norm.weight buffer type overriden to CUDA0
tensor blk.2.ffn_down.weight buffer type overriden to CUDA0
tensor blk.2.ffn_up.weight buffer type overriden to CUDA0
tensor blk.3.attn_norm.weight buffer type overriden to CPU
tensor blk.3.attn_q.weight buffer type overriden to CPU
tensor blk.3.attn_k.weight buffer type overriden to CPU
tensor blk.3.attn_v.weight buffer type overriden to CPU
tensor blk.3.attn_output.weight buffer type overriden to CPU
tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
tensor blk.3.ffn_down.weight buffer type overriden to CUDA0
tensor blk.3.ffn_up.weight buffer type overriden to CUDA0
tensor blk.4.attn_norm.weight buffer type overriden to CPU
tensor blk.4.attn_q.weight buffer type overriden to CPU
tensor blk.4.attn_k.weight buffer type overriden to CPU
tensor blk.4.attn_v.weight buffer type overriden to CPU
tensor blk.4.attn_output.weight buffer type overriden to CPU
tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
tensor blk.4.ffn_down.weight buffer type overriden to CUDA0
tensor blk.4.ffn_up.weight buffer type overriden to CUDA0
tensor blk.5.attn_norm.weight buffer type overriden to CPU
tensor blk.5.attn_q.weight buffer type overriden to CPU
tensor blk.5.attn_k.weight buffer type overriden to CPU
tensor blk.5.attn_v.weight buffer type overriden to CPU
tensor blk.5.attn_output.weight buffer type overriden to CPU
tensor blk.5.ffn_norm.weight buffer type overriden to CUDA0
tensor blk.5.ffn_down.weight buffer type overriden to CUDA0
tensor blk.5.ffn_up.weight buffer type overriden to CUDA0
tensor blk.6.attn_norm.weight buffer type overriden to CPU
tensor blk.6.attn_q.weight buffer type overriden to CPU
tensor blk.6.attn_k.weight buffer type overriden to CPU
tensor blk.6.attn_v.weight buffer type overriden to CPU
tensor blk.6.attn_output.weight buffer type overriden to CPU
tensor blk.7.attn_norm.weight buffer type overriden to CPU
tensor blk.7.attn_q.weight buffer type overriden to CPU
tensor blk.7.attn_k.weight buffer type overriden to CPU
tensor blk.7.attn_v.weight buffer type overriden to CPU
tensor blk.7.attn_output.weight buffer type overriden to CPU
.
.
. @slaren Hey sorry I don't understand what appears as "beef" between both forks. I recognize there is history way beyond me. I was confused if this I appreciate everyone, thanks! |
Adds command line parameter
--override-tensor
(-ot
) that allows changing the buffer type where a model tensor is allocated. This gives user fine grained control over what tensors are to offloaded to each device.How is this useful: for example, to force the experts in MoE models to stay on the CPU, while offloading the rest to the GPU, you could use
-ngl 99 -ot exps=CPU
. This may allow more efficient offloading schemes.The syntax is
<tensor name pattern>=<buffer type>
. Currently the pattern is just a string search (edit: this is no longer the case, it is a C++ regex search), ie. any tensors that contains the characters in<tensor name pattern>
will be matched and loaded into the given buffer type. Multiple overrides can be given by separating them with commas, or passing the-ot
option multiple times. To see what tensors are being matched, enable debugging output with-v
.At this point it is just a demo, feel free to experiment and report if you find any interesting uses.
Edit: added regex support, for example to keep experts of layers 20-99 in the CPU you could use
-ot "[2-9][0-9]\.ffn_.*_exps\.=CPU"
TODO: