Skip to content

ggml : improve memory allocation for weights and similar lists of tensors #578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
slaren opened this issue Oct 13, 2023 · 2 comments
Closed
Labels
refactoring Refactoring

Comments

@slaren
Copy link
Member

slaren commented Oct 13, 2023

There are several patterns used to allocate memory for a list of fixed size tensor, such as model weights:

  • Manually calculating the number of elements of each tensor and adding it all up
  • Creating the tensors in a no-alloc context, adding them to list or map, or obtaining them by name from a ggml_context with ggml_get_tensor, summing their sizes and finally allocating them (the last one is $O(N^2)$ )
  • Creating the tensors in a no-alloc context, allocate the weights manually with ggml-alloc, first with a measure allocator and then again with the exact memory requirements (current llama.cpp finetune)
  • Creating the tensors in a no-alloc context, then enumerating the tensors in the context and summing their sizes (new finetune in ggml : add context enumeration functions llama.cpp#3605)
  • Create a ggml_context with a lot of memory and hope for the best

This becomes significantly more complicated when the weights have to be split between different backends (current llama.cpp and ggml-backend wip).

For something so basic, this is a lot more complicated than it should, and we should have a normalized way to do this. At the most basic level, it could be simply a function to automatically allocate all the tensors created in a no-alloc context with the exact memory requirements. Support for multiple backends will be more complicated.

This could also be useful for debugging operations in compute contexts, where it might be desirable to allocate memory for every tensor in the graph to be able to inspect the results of each op later.

@ggerganov ggerganov added the refactoring Refactoring label Oct 13, 2023
@ggerganov ggerganov moved this to Todo in ggml : roadmap Oct 13, 2023
@ggerganov
Copy link
Member

Yes, we should consolidate the different ways of allocating memory.

At the most basic level, it could be simply a function to automatically allocate all the tensors created in a no-alloc context with the exact memory requirements.

Either this, or even just a function that returns the required memory for a context by doing a similar loop as the one in ggml-org/llama.cpp#3605 would be helpful.

@slaren
Copy link
Member Author

slaren commented Jan 29, 2024

I consider this fixed with ggml_backend_alloc_ctx_tensors.

@ggerganov ggerganov moved this from Todo to Done in ggml : roadmap Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Refactoring
Projects
None yet
Development

No branches or pull requests

2 participants