forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 74
rocm7.1_internal_testing_IFU_2025-09-09 #2677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
pragupta
merged 1,503 commits into
rocm7.1_internal_testing
from
rocm7.1_internal_testing_IFU_2025-09-09
Sep 24, 2025
Merged
rocm7.1_internal_testing_IFU_2025-09-09 #2677
pragupta
merged 1,503 commits into
rocm7.1_internal_testing
from
rocm7.1_internal_testing_IFU_2025-09-09
Sep 24, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…lt (pytorch#159889)" This reverts commit 4ae57d4. Reverted pytorch#159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](pytorch#159889 (comment)))
On Zen 2 (AMD EPYC) and Intel Sapphire Rapids this fails with small differences when compiled with native targeted optimizations. I.e. it fails with `-march=znver2` but succeeds with `-march=znver1`. I assume some operator fusing is being used by GCC. Small differences like using `vmovdqa` can be seen in the minimized code of the baddbmm kernel: https://godbolt.org/z/jsxMa91Wb The greatest differences are consistent and the same on both CPU architectures: ``` Greatest absolute difference: 3.43852152582258e-05 at index (1, 2, 1) (up to 1e-05 allowed) Greatest relative difference: 3.6034286949870875e-06 at index (1, 2, 1) (up to 1.3e-06 allowed) ``` Hence I assume this is in the expected tolerances especially as `complex128` and all other types pass. Pull Request resolved: pytorch#152424 Approved by: https://github.com/malfet
This reverts commit 90b0864. Reverted pytorch#160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](pytorch#160449 (comment)))
Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error. This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control. Pull Request resolved: pytorch#161862 Approved by: https://github.com/ezyang, https://github.com/mlazos
…ytorch#162044)" This reverts commit cd529b6. Reverted pytorch#162044 on behalf of https://github.com/jeffdaily due to mi200 backlog is purged, and mi300 runners are failing in GHA download ([comment](pytorch#162044 (comment)))
# Motivation https://github.com/pytorch/pytorch/pull/143553/files#diff-6492991193449e118ff0c8d42ca544cc38a73604e505ff246a3c711aeab91748R1345 makes `largeTensorTest` malfunction on XPU. This PR aims to fix it. Pull Request resolved: pytorch#161988 Approved by: https://github.com/EikanWang, https://github.com/albanD
…h#161907) `CMAKE_PREFIX_PATH` is a list of paths used to find dependencies. The test overwrites that with a single path causing dependencies such as protobuf or Abseil not being found. Instead prepend the path to the existing value. This fixes a test failure: > pytorch-v2.7.1/test/inductor/test_aot_inductor_package.py", line 242, in test_compile_after_package > self.assertTrue(so_path.exists()) > AssertionError: False is not true Caused by: ``` /software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::utility: No such file or directory /software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::variant: No such file or directory collect2: error: ld returned 1 exit status ``` Pull Request resolved: pytorch#161907 Approved by: https://github.com/Skylion007
I found a number of places that seem to want forwarding references but the type signature does not reflect that Pull Request resolved: pytorch#161094 Approved by: https://github.com/malfet
Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#162164 Approved by: https://github.com/bdhirsh, https://github.com/albanD, https://github.com/wconstab
Fixes pytorch#161868 Pull Request resolved: pytorch#162106 Approved by: https://github.com/jansel, https://github.com/zou3519
…ch#158747) This is a part of our effort for integrating Composable Kernel library for Inductor backend. Currently we have a submodule, but would prefer to have commit pin control over the library as with Triton. We intentionally avoid putting all installation logic in CI scripts to allow locally built versions to have this functionality. The idea is to have CK as a pytorch dependency in pytorch 2.9 release to allow people to use it with inductor and AOT inductor and then gradually step away from submodule usage. Right now CK usage in SDPA/Gemm is tied to submodule files. This PR is a remake of due to branch error: pytorch#156192 Pull Request resolved: pytorch#158747 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <[email protected]> Co-authored-by: Jack Taylor <[email protected]> Co-authored-by: Max Podkorytov <[email protected]> Co-authored-by: Copilot <[email protected]>
[PEP 735](https://peps.python.org/pep-0735) introduces the [dependency-groups] table for a number of use-cases one of which includes specifying development dependencies for projects. Pull Request resolved: pytorch#161216 Approved by: https://github.com/seemethere
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](intel/torch-xpu-ops@83c5a5a), includes: - Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed - Fallback lu_factor kernel to CPU for single batch - Enable aten::linalg_inv and aten::linalg_inv_ex on XPU Pull Request resolved: pytorch#162062 Approved by: https://github.com/EikanWang
) This PR implements the semantics change to `torch._dynamo.error_on_graph_break`: - ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~ - `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks. - `error_on_graph_break` does nothing when `fullgraph=True` - `error_on_graph_break` does NOT guarantee a single graph Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation: - `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled - `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time - `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time Pull Request resolved: pytorch#161747 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#161739
…he CUDACachingAllocator (pytorch#158352) ## Introduction During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture. This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path. ## Terms * **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it. * **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`. ## When can we reuse a block during capture? ### Strong Rule (Graph-Wide Safety) This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph. > A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph. Why it's safe: This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness. ### Per-stream Rule (A Practical Optimization) The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check. In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream. > Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S. In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins. ## Implementation * On `free(block)` during capture * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail. * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path. * Otherwise, store the marker handles and keep the block in the capture-private structures. * On `allocate(stream)` during capture (attempt per-stream reclaim) * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`. * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal. * If yes, hand the block to S for immediate reuse within the same capture. * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances. * On capture end * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture. ## Examples (2 streams) <img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" /> * Case 0 — Unsafe The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails. Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this. * Case 1 — Reusable on stream 1 Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1. * Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator` This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable. * Case 3 — Safe (strong rule holds) In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block. * Case 4 — Freeing after a join See the note below. ## Edge Case: Freeing after a join Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](pytorch#158352 (review))). In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused. ## Thanks Thanks to @galv for his great idea around graph parsing and empty nodes. Pull Request resolved: pytorch#158352 Approved by: https://github.com/ngimel, https://github.com/eqy Co-authored-by: Jeff Daily <[email protected]>
…orch#161984) Added a helper API to tell if the world is entirely within a P2P domain or crosses network. This is mainly for nblocks tuning purpose. (In later PRs) Pull Request resolved: pytorch#161984 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#161983
so that the signal calls do not step on each other's foot. Pull Request resolved: pytorch#162026 Approved by: https://github.com/ngimel
…161407) Summary: Creates a fallback path for `torch._grouped_mm`, using the naive for loop implementation (or bmm). For the sake of keeping the PR small, this PR only enables SM80+ (CUDA capability 8.0 and up), since I am testing this on an A100 machine. In future PRs, we can increase the coverage of the fallback to: 1. float32 and float16, which will extend the GPU coverage 2. cpu Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#161407 Approved by: https://github.com/drisspg, https://github.com/eqy
…61717) Summary: Moves the `torch._grouped_mm` fallback from cuda-only code to a place where it can be used by multiple backends. Specifically: 1. make the fallback path and util functions reusable and move them to `ATen/native/GroupedMMUtils.h` 2. register a backend-agnostic kernel to composite explicit autograd key 3. refactor the grouped_mm tests to their own test case and enable CPU At the end of this PR, here is the support matrix: * CUDA SM90+: fast path with test coverage (no change) * CUDA SM80+: fallback with test coverage (no change) * CPU: fallback works, but without test coverage (new in this PR) * other SM versions and other backends: will probably already work, but let's leave this to future PRs * float32/float16: will probably already work, but let's leave this to future PRs Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#161717 Approved by: https://github.com/ngimel, https://github.com/drisspg ghstack dependencies: pytorch#161407
…62059) Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash // on A100 and H100 pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x // on H100 pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#162059 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: pytorch#161407, pytorch#161717
I dont have a failing test case but just saw an extra guard somewhere. Pull Request resolved: pytorch#162105 Approved by: https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel
…pytorch#161688) Fixes pytorch#161080 torch.export.export fails with TypeError: expand() got an unexpected keyword argument 'implicit' when calling torch.expand_copy(..., implicit=True). This happened because expand_copy = _make_copy_from_view(aten.expand) register aten. expand as the decomposition path for aten.expand_copy, which doesn’t accept the implicit argument. I have added an explicit a decomposition for aten.expand_copy in torch/_decomp/decompositions.py to ignore the implicit argument, and a simple unit test to demonstrate the bug being fixed. Pull Request resolved: pytorch#161688 Approved by: https://github.com/angelayi, https://github.com/can-gaa-hou
…ch#162073) for 2.9 🙏 Pull Request resolved: pytorch#162073 Approved by: https://github.com/drisspg
pytorch#161951) …h.is_complex. The PR proposes adding a simple, self-explanatory example to the documentation page. The example demonstrates the function's output for tensors with various data types, showing both True and False return values. Fixes pytorch#161859 Pull Request resolved: pytorch#161951 Approved by: https://github.com/zou3519
…orch#161355) Pull Request resolved: pytorch#161355 Approved by: https://github.com/zou3519
Update cpp-httplib with better error handling, bugfixes, and performance. Header only library update. Pull Request resolved: pytorch#162181 Approved by: https://github.com/jansel
Summary: att Test Plan: ci Rollback Plan: Reviewed By: minjang Differential Revision: D80828148 Pull Request resolved: pytorch#161798 Approved by: https://github.com/minjang, https://github.com/SherlockNoMad
Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci
This reverts commit 2c03f0a. Reverted pytorch#162007 on behalf of https://github.com/jeanschmidt due to Breaks internal builds see [D81588372](https://www.internalfb.com/diff/D81588372), @malfet may you help the author? ([comment](pytorch#162007 (comment)))
This reverts commit b40d943. Reverted pytorch#162001 on behalf of https://github.com/jeanschmidt due to break a few internal tests ([comment](pytorch#161999 (comment)))
fbgemm_gpu was failing to clone due to missing submodule commit. ``` + pushd fbgemm/fbgemm_gpu ~/pytorch/fbgemm/fbgemm_gpu ~/pytorch + git checkout 7f1de94a4c2d14f59ad4ca84538c36084ea6b2c8 --recurse-submodules fatal: failed to unpack tree object b1281b8b08d973a7064f864f47eeb30f3e2596e9 error: Submodule 'external/composable_kernel' could not be updated. error: Cannot update submodule: external/composable_kernel ``` Log File [inductor-periodic · pytorch/pytorch@5babb4d](https://github.com/pytorch/pytorch/actions/runs/17536630806/job/49802458834) Pull Request resolved: pytorch#162385 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>
Implements mps sparse mul operation as well as enables other operations such as: 1. copy_ 2. div 3. sum 4. floor 5. power 6. sub 7. floor_divide Pull Request resolved: pytorch#162349 Approved by: https://github.com/pearu, https://github.com/malfet Co-authored-by: Nikita Shulga <[email protected]>
All these UTs are working as is, just removing the skip - test_p2p_ipc - test_repros.py: working, added fp8 support - test_activation_checkpointing.py - test_content_store.py - test_cuda_multigpu.py - test_compute_comm_reordering.py - test_segment_reductions.py - test_dataloader.py - test_math_ops.py - test_loop_ordering.py - test_control_flow.py - distributed_test.py - test_mem_tracker.py - test_fsdp_optim_state.py - test_fully_shard_mixed_precision.py: skippped for < ROCm7.0 - test_aot_inductor_custom_ops.py - test_c10d_ops_nccl.py - test_eager_transforms.py - test_sparse_csr.py - test_inductor_collectives.py - test_fake_tensor.py - test_cupy_as_tensor.py - test_cuda.py: enable UTs that are working - test_matmul_cuda.py: enable UTs that are working Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#161715 Approved by: https://github.com/msaroufim Co-authored-by: Mark Saroufim <[email protected]>
Our compiler is generating inefficient code for the offsetCalc in certain situations. The root-cause for this needs to be identified. For now specialized unrolling based on 'dims' notably helps perf. Pull Request resolved: pytorch#161700 Approved by: https://github.com/jeffdaily
many tests taking >30 min and causing timeouts Pull Request resolved: pytorch#162496 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>
…pytorch#162387) Pull Request resolved: pytorch#162387 Approved by: https://github.com/Skylion007
# why - gather everything up to make choices, without running potentially expensive generators - enables overrides where we toss the entire list of configs from inductor, without having to enumrate it (expensive) # what - add a holding class that just gets all the components necessary to generate a ChoiceCaller - use that class to generate ChoiceCallers - this does not (yet) add the override function, but just prepares the scene ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569) Pull Request resolved: pytorch#161347 Approved by: https://github.com/eellison
…pytorch#161348) \# why - every callsite just executes the generator on the spot - previous pr adds the ability to add an override before expensive generators are executed, so we don't need this generator anymore \# what - rather than yielding the ChoiceCaller, just return the list of all valid ChoiceCallers \# testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574) Pull Request resolved: pytorch#161348 Approved by: https://github.com/eellison ghstack dependencies: pytorch#161347
# why - enable us to override the default configs, or fall back to them through subclassing InductorChoices # what - override (private) function - default implementationt takes the kernel template choice (ktc) generator for every template and just executes the generator - future overrides can decide to replace those generators, or filter out choices - the 2nd expensive step (maybe_append_choices, choice_or_none) is handled outside this function, in the main V.choices.get_mm_configs this means that any overriding benefits from not generating expensive templates that aren't going to be used # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520570](https://our.internmc.facebook.com/intern/diff/D81520570) Pull Request resolved: pytorch#161349 Approved by: https://github.com/eellison ghstack dependencies: pytorch#161347, pytorch#161348
…pytorch#162238) # why - unnecessary as we only ever need to know the dtype and maybe the device - we already take in the kernel inputs which have the device - enable us to specify the layout after finding all the configs but before generating the ChoiceCallers # what - replace all calls in template_heuristics that used to take Layout with now just taking out_dtype # testing ci Differential Revision: [D81820115](https://our.internmc.facebook.com/intern/diff/D81820115) Pull Request resolved: pytorch#162238 Approved by: https://github.com/eellison ghstack dependencies: pytorch#161347, pytorch#161348, pytorch#161349
When visualizing the schedules using `_PipelineScheduleExecution`, we don't provide any spacing between dependencies, so when visualizing `DualPipeV` it looks like this: <img width="3168" height="486" alt="image" src="https://github.com/user-attachments/assets/d2c881ad-4ee0-46b6-ac03-13e5600b5a55" /> While it has the correct order of operations, it does not show the dependencies correctly. As shown in the original implementation, it should look something like this: <img width="3542" height="384" alt="image" src="https://github.com/user-attachments/assets/c930fa98-848e-4951-a58b-c81f41092d14" /> This allows an option to add spacing to the visualizer, so it is easier to see dependencies. After change: <img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/7708367e-bdb4-46e8-a7c4-f19e18047f59" /> Pull Request resolved: pytorch#160474 Approved by: https://github.com/fegin
Correct the rounding scheme for nearest in quantile. Pull Request resolved: pytorch#162423 Approved by: https://github.com/soulitzer
…ec is ok (pytorch#160580) My goal right now is to try to make the "vanilla" AccumulateGrad path for DTensor (that just calls detach) fast. I'm doing this in two steps: (1) [this PR]: hardcode aten.detach in DTensor to re-use the input tensor's DTensorSpec, instead of running "real" sharding prop. (2) [assuming success of 1]: move the detach() call into C++, try adding a DTensor dispatch key, and avoid dispatching back to python entirely (except for some code that probably needs to allocate a pyobject for the output DTensor, from C++) I'm pushing this PR first to confirm that I don't break anything with my detach fastpath. I did some manual local testing to confirm that for normal usages of detach, the input and output DTensor have equal DTensorSpec objects. Technically, we previously would allocate a fresh DTensorSpec, and with this change we are just re-using the input tensor's DTensorSpec. So I'm mostly hoping that DTensorSpecs don't generally get mutated This by itself does seem to speed up `alias` by quite a bit (roughly 2.5x speedup, from ~336us -> 133us): **aten.detach(plain_tensor)** ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f8da2921790> _ = x.detach() 4.80 us 1 measurement, 100000 runs , 1 thread ``` **aten.detach(DTensor) [before this PR]** ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f47cd68e750> _ = x_dt.detach() 336.40 us 1 measurement, 1000 runs , 1 thread ``` **aten.detach(DTensor) [after this PR]** ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7f0a34c05520> _ = x_dt.detach() Median: 133.45 us 2 measurements, 1000 runs per measurement, 1 thread ``` benchmark script: ``` import torch import torch.distributed as dist from torch.distributed.tensor import DeviceMesh, DTensor, Partial, Replicate, Shard from torch.testing._internal.distributed.fake_pg import FakeStore import torch.utils.benchmark as benchmark fake_store = FakeStore() dist.init_process_group("fake", store=fake_store, rank=0, world_size=2) mesh = torch.distributed.device_mesh.init_device_mesh('cuda', (2,)) x = torch.randn(4, 4, requires_grad=True) x_dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False) t0 = benchmark.Timer( stmt='_ = x_dt.detach()', globals={'x_dt': x_dt}, ) print(t0.blocked_autorange()) dist.destroy_process_group() ``` Pull Request resolved: pytorch#160580 Approved by: https://github.com/ezyang
This reverts commit 3ea6868. Reverted pytorch#162349 on behalf of https://github.com/malfet due to Fails trunk tests, with uint8 sum ([comment](pytorch#162349 (comment)))
…ytorch#161394) This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are: - Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit` - Added `--benchmark-name` argument for customizing the benchmark name in output - Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name Sample command to run a single operator: `python -m pt.mm_test --use-compile` Pull Request resolved: pytorch#161394 Approved by: https://github.com/jbschlosser
Pull Request resolved: pytorch#158686 Approved by: https://github.com/eellison
…2432) We update it to call write_atomic instead of file.write Pull Request resolved: pytorch#162432 Approved by: https://github.com/oulgen
Fixes pytorch#162357 Fixes pytorch#160970 Fixes pytorch#161038 Fixes pytorch#160951 Fixes pytorch#161698 These tests were introduced in pytorch#160765 and they are all flaky when `torch._inductor.aot_compile` uses multiple threads (the default option). The issue could be reproduced by running them locally multiple times. For example, ``` pytest --flake-runs 10 --flake-finder -v inductor/test_fxir_backend.py -k test_aoti_fx_add (output logs at P1938386961) ... --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)] graph_break [] --------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------------------------------------------------------------------- inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)] graph_break [] ================================================================================================================================================= short test summary info ================================================================================================================================================== FAILED [0.4834s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__' FAILED [0.4576s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__' FAILED [0.4613s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__' =============================================================================================================================================== 3 failed, 7 passed in 12.89s =============================================================================================================================================== ``` Setting `compile_threads` to 1 will get rid of the test flakiness, but there might be underlying issues from pytorch#160765. Pull Request resolved: pytorch#162472 Approved by: https://github.com/angelayi, https://github.com/Skylion007
I confirmed that the tracing was correct i.e. NamedTupleVariable had the correct dynamic attribute added to it. The problem was that NamedTupleVariable was always marked as immutable. This does not reflect the behavior of namedtuple. Subclasses of namedtuple may be mutable, so when a NamedTupleVariable is derived from a subclass that is mutable, I made NamedTupleVariable mutable as well. Then side_effects correctly updates the returned object. Fixes pytorch#161610 Pull Request resolved: pytorch#161645 Approved by: https://github.com/anijain2305, https://github.com/StrongerXi
…sting_IFU_2025-09-09 # Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/requirements-ci.txt # aten/src/ATen/Context.cpp # aten/src/ATen/cuda/detail/OffsetCalculator.cuh # aten/src/ATen/cuda/tunable/GemmHipblaslt.h # aten/src/ATen/native/ConvUtils.h # aten/src/ATen/native/Convolution.cpp # aten/src/ATen/native/Normalization.cpp # aten/src/ATen/native/cuda/Blas.cpp # aten/src/ATen/native/miopen/Conv_miopen.cpp # requirements.txt # test/distributed/_tools/test_fsdp2_mem_tracker.py # test/distributed/tensor/parallel/test_tp_examples.py # test/dynamo/test_activation_checkpointing.py # test/dynamo/test_structured_trace.py # test/inductor/test_aot_inductor.py # test/inductor/test_combo_kernels.py # test/test_matmul_cuda.py # test/test_sparse.py # torch/_higher_order_ops/triton_kernel_wrap.py # torch/_inductor/choices.py # torch/_inductor/codegen/triton.py # torch/testing/_internal/common_cuda.py
This reverts commit 69a25f6.
Jenkins build for 9e7df766290def1ac0112fc758a6fa1ea126e95a commit is in progress |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Re-attempt of landing this IFU PR: #2625
Snapshot of rocm7.1_internal_testing before we tried to land this PR: https://github.com/ROCm/pytorch/tree/rocm7.1_internal_testing_snapshot_2025_09_24
rocm_base: 681e60e
Tested this PR on MI300x using registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16623_ubuntu24.04_py3.12_pytorch_rocm7.1_internal_testing_681e60e1
Ran the following UTs:
test_nn, test_torch, test_cuda, test_ops, test_unary_ufuncs, test_autograd, inductor/test_torchinductor
All ran fine, attaching logs!
default_ut.log
Successful wheel build job with this branch: http://rocm-ci.amd.com/view/preview/job/pytorch2.8-manylinux-wheels-preview/116/