Skip to content

GPU Memory Leak #5873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zsogitbe opened this issue Mar 4, 2024 · 12 comments
Closed

GPU Memory Leak #5873

zsogitbe opened this issue Mar 4, 2024 · 12 comments

Comments

@zsogitbe
Copy link

zsogitbe commented Mar 4, 2024

After disposing everything the GPU memory is still not freed. When running the same code for the second time it crashes with a message that it cannot allocate enough memory. There are some things which are not freed from GPU memory.

You can test the issue by using ggml_backend_cuda_get_device_memory(0, out freemem, out totalmem); before and after using a llama model. freemem will show how much memory is still allocated.

See this also: SciSharp/LLamaSharp#575

@zsogitbe
Copy link
Author

zsogitbe commented Mar 5, 2024

I have noticed that cublasCreate is called, but there is no cublasDestroy. Could this be the problem?

@zsogitbe
Copy link
Author

zsogitbe commented Mar 5, 2024

See also: #5576

@zsogitbe
Copy link
Author

zsogitbe commented Mar 6, 2024

Added a new PR with corrected Cublas release GPU memory code: #5898

@ggerganov
Copy link
Member

I don't understand how a process that has died could still occupy GPU memory. This looks like a non-llama.cpp related issue

@zsogitbe
Copy link
Author

zsogitbe commented Mar 6, 2024

You are obviously a genius Georgi, so I am sure that you will figure it out. Please run the main example code in llama.cpp, free all memory and then stop at the end with debugging and check your GPU memory status. You will see that GPU memory is not freed.
This is a huge problem for production applications because GPU memory is very expensive and precious and you cannot just keep filling it with not used things.
We need to load the library and then allow for several inferences, but if each inference leaves hundreds of MBs GPU memory behind then you will soon not be able to do inference on the GPU...
image
In this example nearly 1GB is stayed on the GPU after running the inference and it is impossible to free it without exiting the app (main process end).

@ggerganov
Copy link
Member

I misunderstood that the GPU memory stayed occupied after the process ends.

Currently, the CUDA backend cannot be cleaned-up completely because we have some global objects which are never destroyed. The problem is acknowledged and related to #3960. Keep track of that issue for further updates on this

@zsogitbe
Copy link
Author

zsogitbe commented Mar 6, 2024

Thank you for acknowledging it. This is really a huge issue if you want llama.cpp being used in professional production software and not only in playing with llms. I have added a PR for releasing Cublas, trying to help (this freed 15% of the still occupied GPU memory), but I was not able to find the other 85%. I hope that you will find some time for this and give this a high priority. This problem should be solved before any further improvements to the library.

@zsogitbe
Copy link
Author

zsogitbe commented Mar 6, 2024

Update: LLava is even a bigger problem because when using it the not freed GPU memory doubles (1.8 GB GPU memory is not freed in the above example)!

@slaren
Copy link
Member

slaren commented Mar 6, 2024

We need to load the library and then allow for several inferences, but if each inference leaves hundreds of MBs GPU memory behind then you will soon not be able to do inference on the GPU

To clarify, this is not what happens. Some resources are never freed but they are not leaked either, they are reused for future inferences. Memory usage does not increase indefinitely.

@zsogitbe
Copy link
Author

zsogitbe commented Mar 6, 2024

Thank you for the clarification slaren, I can confirm that this is the case. But, it seems that Llava allocates a separate second pool of GPU memory because the not freed GPU memory doubles when using Llava. This was the reason for why I first thought that it keeps allocating extra GPU memory.

@zsogitbe
Copy link
Author

zsogitbe commented Mar 14, 2024

slaren, any news on this? Do you think that you will have some time to fix the CUDA backend?

@ggerganov
Copy link
Member

Currently, the CUDA backend cannot be cleaned-up completely because we have some global objects which are never destroyed. The problem is acknowledged and related to #3960. Keep track of that issue for further updates on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants