-
Notifications
You must be signed in to change notification settings - Fork 11.7k
GPU Memory Leak #5873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have noticed that cublasCreate is called, but there is no cublasDestroy. Could this be the problem? |
See also: #5576 |
Added a new PR with corrected Cublas release GPU memory code: #5898 |
I don't understand how a process that has died could still occupy GPU memory. This looks like a non- |
I misunderstood that the GPU memory stayed occupied after the process ends. Currently, the CUDA backend cannot be cleaned-up completely because we have some global objects which are never destroyed. The problem is acknowledged and related to #3960. Keep track of that issue for further updates on this |
Thank you for acknowledging it. This is really a huge issue if you want llama.cpp being used in professional production software and not only in playing with llms. I have added a PR for releasing Cublas, trying to help (this freed 15% of the still occupied GPU memory), but I was not able to find the other 85%. I hope that you will find some time for this and give this a high priority. This problem should be solved before any further improvements to the library. |
Update: LLava is even a bigger problem because when using it the not freed GPU memory doubles (1.8 GB GPU memory is not freed in the above example)! |
To clarify, this is not what happens. Some resources are never freed but they are not leaked either, they are reused for future inferences. Memory usage does not increase indefinitely. |
Thank you for the clarification slaren, I can confirm that this is the case. But, it seems that Llava allocates a separate second pool of GPU memory because the not freed GPU memory doubles when using Llava. This was the reason for why I first thought that it keeps allocating extra GPU memory. |
slaren, any news on this? Do you think that you will have some time to fix the CUDA backend? |
|
After disposing everything the GPU memory is still not freed. When running the same code for the second time it crashes with a message that it cannot allocate enough memory. There are some things which are not freed from GPU memory.
You can test the issue by using
ggml_backend_cuda_get_device_memory(0, out freemem, out totalmem);
before and after using a llama model.freemem
will show how much memory is still allocated.See this also: SciSharp/LLamaSharp#575
The text was updated successfully, but these errors were encountered: