You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are several patterns used to allocate memory for a list of fixed size tensor, such as model weights:
Manually calculating the number of elements of each tensor and adding it all up
Creating the tensors in a no-alloc context, adding them to list or map, or obtaining them by name from a ggml_context with ggml_get_tensor, summing their sizes and finally allocating them (the last one is $O(N^2)$ )
Creating the tensors in a no-alloc context, allocate the weights manually with ggml-alloc, first with a measure allocator and then again with the exact memory requirements (current llama.cpp finetune)
Create a ggml_context with a lot of memory and hope for the best
This becomes significantly more complicated when the weights have to be split between different backends (current llama.cpp and ggml-backend wip).
For something so basic, this is a lot more complicated than it should, and we should have a normalized way to do this. At the most basic level, it could be simply a function to automatically allocate all the tensors created in a no-alloc context with the exact memory requirements. Support for multiple backends will be more complicated.
This could also be useful for debugging operations in compute contexts, where it might be desirable to allocate memory for every tensor in the graph to be able to inspect the results of each op later.
The text was updated successfully, but these errors were encountered:
Yes, we should consolidate the different ways of allocating memory.
At the most basic level, it could be simply a function to automatically allocate all the tensors created in a no-alloc context with the exact memory requirements.
Either this, or even just a function that returns the required memory for a context by doing a similar loop as the one in ggml-org/llama.cpp#3605 would be helpful.
Uh oh!
There was an error while loading. Please reload this page.
There are several patterns used to allocate memory for a list of fixed size tensor, such as model weights:
no-alloc
context, adding them to list or map, or obtaining them by name from aggml_context
withggml_get_tensor
, summing their sizes and finally allocating them (the last one isno-alloc
context, allocate the weights manually withggml-alloc
, first with a measure allocator and then again with the exact memory requirements (current llama.cppfinetune
)no-alloc
context, then enumerating the tensors in the context and summing their sizes (newfinetune
in ggml : add context enumeration functions llama.cpp#3605)ggml_context
with a lot of memory and hope for the bestThis becomes significantly more complicated when the weights have to be split between different backends (current llama.cpp and
ggml-backend
wip).For something so basic, this is a lot more complicated than it should, and we should have a normalized way to do this. At the most basic level, it could be simply a function to automatically allocate all the tensors created in a
no-alloc
context with the exact memory requirements. Support for multiple backends will be more complicated.This could also be useful for debugging operations in compute contexts, where it might be desirable to allocate memory for every tensor in the graph to be able to inspect the results of each op later.
The text was updated successfully, but these errors were encountered: