Skip to content

Conversation

pxl-th
Copy link
Member

@pxl-th pxl-th commented May 4, 2023

TODO: add detailed PR description

Builds upon #419
Fixes #418

@mzy2240
Copy link

mzy2240 commented May 9, 2023

I am wondering if this PR would also fix the synchronization issue in #298

@pxl-th
Copy link
Member Author

pxl-th commented May 11, 2023

In general this PR fixes use-after-free issues, where previously one could hit a scenario where:

  • call a HIP-based kernel (e.g. rocBLAS gemm)
  • GC frees arrays you passed as arguments before kernel finishes
  • you get use-after-free error and/or a hang

But the issue with HIP-based allocations at the moment is that they are drawn from a HIP memory pool which grows on-demand and you can limit the total size of the pool only starting from ROCm 5.5+ via hipDeviceSetLimit.

And when the pool grows to near 100% of the total memory available on the device (because GC is not keeping up) it kills our HSA queue since it cannot allocate necessary resources for a kernel dispatch.

Interestingly, you can get the memory limit of the pool on ROCm 5.4 but not set it.

And constraining pool growth from the Julia side seems unreliable at the moment.

@pxl-th
Copy link
Member Author

pxl-th commented Jun 16, 2023

Superseded by #423

@pxl-th pxl-th closed this Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants