-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Fix 2 cuda memory leaks in ggml-cuda.cu #5576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -3217,6 +3225,7 @@ int main(int argc, char **argv) | |||
sigemptyset (&sigint_action.sa_mask); | |||
sigint_action.sa_flags = 0; | |||
sigaction(SIGINT, &sigint_action, NULL); | |||
sigaction(SIGUSR1, &sigint_action, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed before we merge. I was having trouble with some of my analysis tools with INT, so USR1 made it easier to iterate while testing the leak fix.
This fixes 2 memory leaks in ggml-cuda.cu - cuMemUnmap called now for the pool allocation - cublasDestroy called to release cublas handles
5ee7e7a
to
9ee3ba6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits:
- Extra blank line added at end of server.cpp, but missing newline at end of ggml-cuda.cu
- ggml_free_cublas is declared three different times?
With the new ggml-backend interface, something like ggml_free_cublas isn't really supposed to be part of the API, but neither is ggml_init_cublas...
ggml-cuda.cu
Outdated
extern "C" GGML_CALL void ggml_free_cublas(void); | ||
GGML_CALL void ggml_free_cublas(void) { | ||
for (int id = 0; id < g_device_count; ++id) { | ||
#if !defined(GGML_USE_HIPBLAS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be
#if !defined(GGML_USE_HIPBLAS) | |
#if !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) |
but then the condition for vmm in ggml-cuda.cu
should also be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
The correct way to fix would be:
I don't really see the point of doing this at this point and in this way, this memory is intentionally never freed, freeing it just before the application ends achieves nothing, the memory would be freed by the OS anyway. |
We use this code in a downstream project as a library, in a long running process, and we need the library not to leak when we unload models and the system is idle so that the GPU resources get freed up and the GPU is allowed to return to a low power state. My intent on wiring this into server.cpp is simply to demonstrate the leak and the fix so folks in this community can repro. Depending on how this ultimately gets wired into the shutdown logic, it should apply to any/all uses I would think. |
I agree, I don't think I've wired ggml_free_cublas correctly, but I'm not sure what the right pattern is for this so I'm hoping maintainers can direct me. (perhaps automatically wired up in an existing shutdown/free routine, or maybe as a new API in ggml.h?) |
I have corrected the PR ggml-org#5576 which causes crash and streamlined the code. Unfortunately, this does not free all occupied GPU memory yet (only 15% of it). We still need to find some objects which are not freed after releasing GPU memory.
This PR causes crash in the pooling code (cuMemUnmap). I have corrected the code and streamlined it for llama.cpp: #5898 |
Closing in favor of #5898 |
In our downstream usage of the server, we noticed it wouldn't fully unload the GPU when idle. Using the cuda memory leak detection tool I was able to find where the leaks were located.
compute-sanitizer --tool memcheck --leak-check full ./bin/server ...
I'm not sure where the best place is to wire this up, so I'm open to suggestions from reviewers, however as coded it does pass the memcheck tool without leaks, and when we integrate this into our usage of server.cpp we do see the GPU unload when it goes idle.
This fixes 2 memory leaks in ggml-cuda.cu