Skip to content

Conversation

denera
Copy link
Collaborator

@denera denera commented Apr 23, 2025

Description

In cases where initialize_ub()+destroy_ub() pairs are called more than once (e.g. in-process restarts), the cuBLAS workspace allocation is mishandled and grows exponentially. This PR safeguards the workspace expansion in initialize_ub() to avoid this leak.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@denera denera added bug Something isn't working 2.3.0 labels Apr 23, 2025
@denera denera requested a review from ksivaman April 23, 2025 22:22
@denera denera self-assigned this Apr 23, 2025
…ponential growth across repeat initializations

Signed-off-by: Alp Dener <[email protected]>
@denera denera force-pushed the bugfix/initialize-ub-cublas-workspace-leak branch from e2c1ee4 to 1d8e404 Compare April 25, 2025 17:17
@denera
Copy link
Collaborator Author

denera commented Apr 25, 2025

/te-ci pytorch L0 L1

@ksivaman
Copy link
Member

Confirmed offline that this fixes the issue of GPU memory not being reclaimed after user buffer cleanup (destroy_ub).

@ksivaman
Copy link
Member

Pipeline 27525544

@denera denera merged commit 4e9c2c3 into NVIDIA:main Apr 28, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.3.0 bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants