Skip to content

Commit 90e7375

Browse files
generatedunixname499836121facebook-github-bot
generatedunixname499836121
authored andcommitted
Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
Summary: As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; pytorch/pytorch#139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider X-link: pytorch/pytorch#145130 Approved by: https://github.com/ngimel Reviewed By: izaitsevfb Differential Revision: D71711852 fbshipit-source-id: 4f57539b8f37f1f4c92a57c19276e84f81bffa23
1 parent 10a7be3 commit 90e7375

File tree

1 file changed

+10
-0
lines changed

1 file changed

+10
-0
lines changed

userbenchmark/dynamo/dynamobench/common.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3592,6 +3592,16 @@ def run(runner, args, original_dir=None):
35923592
# some of the models do not support use_deterministic_algorithms
35933593
torch.use_deterministic_algorithms(True)
35943594
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
3595+
# TODO(eqy): revisit when cuBLASLt workspace size is bumped
3596+
# if args.only is not None and args.only in {
3597+
# "DebertaForQuestionAnswering",
3598+
# "RobertaForQuestionAnswering",
3599+
# "nvidia_deeprecommender",
3600+
# "volo_d1_224",
3601+
# }:
3602+
# # These seem unhappy with numerics of larger cuBLASLt workspace
3603+
# # sizes following #145130 (due to enabling split-k?)
3604+
# torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False
35953605
torch.backends.cudnn.deterministic = True
35963606
torch.backends.cudnn.allow_tf32 = False
35973607
torch.backends.cudnn.benchmark = False

0 commit comments

Comments
 (0)