Skip to content

Commit f86d184

Browse files
authored
Clean up CUDA state between tests (#2335)
This PR fixes the unit test, test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s] ``` Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda") RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432] ``` This error occurs only on gfx1101 arch. This error is coming from an integer overflow when another unit test, test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel creates a tensor with a huge numel, which overflows into a higher torch.cuda.max_memory_reserved() when you call test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction afterward. To avoid this we introduced torch.cuda.empty_cache() and torch.cuda.reset_peak_memory_stats() to clean up CUDA states. JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295
1 parent b622862 commit f86d184

File tree

1 file changed

+3
-0
lines changed

1 file changed

+3
-0
lines changed

test/test_cuda.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -437,6 +437,9 @@ def test_out_of_memory_retry(self):
437437
IS_JETSON, "oom reporting has issues on jetson igx due to partial nvml support"
438438
)
439439
def test_set_per_process_memory_fraction(self):
440+
if torch.version.hip and ('gfx1101' in torch.cuda.get_device_properties(0).gcnArchName):
441+
torch.cuda.empty_cache()
442+
torch.cuda.reset_peak_memory_stats()
440443
orig = torch.cuda.get_per_process_memory_fraction(0)
441444
try:
442445
# test invalid fraction value.

0 commit comments

Comments
 (0)