-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[CUDACachingAlloc/GPUInference] Implement garbage collection without GPU sync #74261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit a94589f (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This pull request was exported from Phabricator. Differential Revision: D34482514 |
cc @mcarilli |
This pull request was exported from Phabricator. Differential Revision: D34482514 |
87b11d7
to
4259d76
Compare
This pull request was exported from Phabricator. Differential Revision: D34482514 |
4259d76
to
e8666d3
Compare
This pull request was exported from Phabricator. Differential Revision: D34482514 |
e8666d3
to
915860d
Compare
This pull request was exported from Phabricator. Differential Revision: D34482514 |
915860d
to
e567d2c
Compare
This pull request was exported from Phabricator. Differential Revision: D34482514 |
e567d2c
to
1624443
Compare
This pull request was exported from Phabricator. Differential Revision: D34482514 |
1624443
to
14c5db7
Compare
This pull request was exported from Phabricator. Differential Revision: D34482514 |
14c5db7
to
f0a756f
Compare
This pull request was exported from Phabricator. Differential Revision: D34482514 |
f0a756f
to
803e2c2
Compare
This pull request was exported from Phabricator. Differential Revision: D34482514 |
803e2c2
to
9f82fd2
Compare
9f82fd2
to
3b86ea0
Compare
This pull request was exported from Phabricator. Differential Revision: D34482514 |
cc @mwootton |
…GPU sync (pytorch#74261) Summary: Pull Request resolved: pytorch#74261 ### Goal Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync. ### Why do we need this? Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream. - `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request. Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size). - `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks. Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash. ### Proposed idea - If the garbage collection threshold is set, try to reclaim some memory blocks *without* synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively). - GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine. - Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold. - This code path is totally optional; by default it won't be invoked. Test Plan: - Unit tests - Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks. Reviewed By: jianyuh Differential Revision: D34482514 fbshipit-source-id: 1d29b589752d489ecbf9bd7d27fb624f980e5666
This pull request was exported from Phabricator. Differential Revision: D34482514 |
3b86ea0
to
a94589f
Compare
…GPU sync (#74261) Summary: Pull Request resolved: #74261 ### Goal Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync. ### Why do we need this? Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream. - `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request. Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size). - `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks. Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash. ### Proposed idea - If the garbage collection threshold is set, try to reclaim some memory blocks *without* synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively). - GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine. - Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold. - This code path is totally optional; by default it won't be invoked. Test Plan: - Unit tests - Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks. Reviewed By: jianyuh Differential Revision: D34482514 fbshipit-source-id: d5eae62ac60b94b0bca851f9d233a092d086e3c2
Hey @jaewonlee-fb. |
…GPU sync (#74261) Summary: Pull Request resolved: #74261 ### Goal Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync. ### Why do we need this? Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream. - `release_available_cached_blocks(params)`: Free blocks exceeding the `CachingAllocatorConfig::max_split_size()` until we can satisfy the request. Issue: If the `max_split_size` is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size). - `release_cached_blocks()`: Waits for all the in-flight events and then reclaim blocks. Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash. ### Proposed idea - If the garbage collection threshold is set, try to reclaim some memory blocks *without* synchronization. It should be safe to do so, as `release_available_cached_blocks` essentially does the same thing (but less aggressively). - GC is triggered only when we fail to serve a `malloc` request from the block pool. No need to free blocks when the block pool is functioning just fine. - Prioritize reclaiming blocks that weren't reused for long time. Reclamation stops once the used memory capacity < threshold. - This code path is totally optional; by default it won't be invoked. Test Plan: - Unit tests - Manually checked that the GPU memory usage stays as indicated by the garbage collector. If not the caching allocator at least tries to keep freeing the blocks. Reviewed By: jianyuh Differential Revision: D34482514 fbshipit-source-id: d5eae62ac60b94b0bca851f9d233a092d086e3c2 (cherry picked from commit 05780f1)
@@ -102,6 +102,7 @@ struct DeviceStats { | |||
// cudaMalloc).. | |||
struct BlockInfo { | |||
int64_t size = 0; | |||
int32_t gc_counter = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems a typo ? gc_counter -> gc_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh it's not used actually.
Summary:
Goal
Implement a cheap way to reclaim GPU memory (garbage collection) without incurring GPU sync.
Why do we need this?
Currently, there are only two ways to reclaim GPU memory block already assigned to a particular stream.
release_available_cached_blocks(params)
: Free blocks exceeding theCachingAllocatorConfig::max_split_size()
until we can satisfy the request.Issue: If the
max_split_size
is unset (default), this function is a no-op. Even if this is set, the reclamation is quite conservative (e.g., never frees blocks under max_split_size).release_cached_blocks()
: Waits for all the in-flight events and then reclaim blocks.Issue: 'waiting for all event' is very expensive as it will likely stall all the GPU operations. Many GPU applications without a proper handling of potential GPU throttling would suffer/crash.
Proposed idea
release_available_cached_blocks
essentially does the same thing (but less aggressively).malloc
request from the block pool. No need to free blocks when the block pool is functioning just fine.Test Plan:
Differential Revision: D34482514