Skip to content

Commit 701bd77

Browse files
rraminenpragupta
authored andcommitted
Clean up CUDA state between tests (#2335)
This PR fixes the unit test, test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s] ``` Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda") RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432] ``` This error occurs only on gfx1101 arch. This error is coming from an integer overflow when another unit test, test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel creates a tensor with a huge numel, which overflows into a higher torch.cuda.max_memory_reserved() when you call test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction afterward. To avoid this we introduced torch.cuda.empty_cache() and torch.cuda.reset_peak_memory_stats() to clean up CUDA states. JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295 (cherry picked from commit f86d184)
1 parent 5650382 commit 701bd77

File tree

1 file changed

+3
-0
lines changed

1 file changed

+3
-0
lines changed

test/test_cuda.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -467,6 +467,9 @@ def test_out_of_memory_retry(self):
467467
IS_JETSON, "oom reporting has issues on jetson igx due to partial nvml support"
468468
)
469469
def test_set_per_process_memory_fraction(self):
470+
if torch.version.hip and ('gfx1101' in torch.cuda.get_device_properties(0).gcnArchName):
471+
torch.cuda.empty_cache()
472+
torch.cuda.reset_peak_memory_stats()
470473
orig = torch.cuda.get_per_process_memory_fraction(0)
471474
torch.cuda.reset_peak_memory_stats(0)
472475
try:

0 commit comments

Comments
 (0)