Skip to content

Conversation

pragupta
Copy link
Collaborator

@pragupta pragupta commented Sep 24, 2025

Re-attempt of landing this IFU PR: #2625
Snapshot of rocm7.1_internal_testing before we tried to land this PR: https://github.com/ROCm/pytorch/tree/rocm7.1_internal_testing_snapshot_2025_09_24

rocm_base: 681e60e

Tested this PR on MI300x using registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16623_ubuntu24.04_py3.12_pytorch_rocm7.1_internal_testing_681e60e1

Ran the following UTs:
test_nn, test_torch, test_cuda, test_ops, test_unary_ufuncs, test_autograd, inductor/test_torchinductor

All ran fine, attaching logs!
default_ut.log

Successful wheel build job with this branch: http://rocm-ci.amd.com/view/preview/job/pytorch2.8-manylinux-wheels-preview/116/

pytorchmergebot and others added 30 commits September 4, 2025 13:13
…lt (pytorch#159889)"

This reverts commit 4ae57d4.

Reverted pytorch#159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](pytorch#159889 (comment)))
On Zen 2 (AMD EPYC) and Intel Sapphire Rapids this fails with small differences when compiled with native targeted optimizations. I.e. it fails with `-march=znver2` but succeeds with `-march=znver1`.

I assume some operator fusing is being used by GCC. Small differences like using `vmovdqa` can be seen in the minimized code of the baddbmm kernel: https://godbolt.org/z/jsxMa91Wb

The greatest differences are consistent and the same on both CPU architectures:
```
Greatest absolute difference: 3.43852152582258e-05 at index (1, 2, 1) (up to 1e-05 allowed)
Greatest relative difference: 3.6034286949870875e-06 at index (1, 2, 1) (up to 1.3e-06 allowed)
```

Hence I assume this is in the expected tolerances  especially as `complex128` and all other types pass.
Pull Request resolved: pytorch#152424
Approved by: https://github.com/malfet
This reverts commit 90b0864.

Reverted pytorch#160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](pytorch#160449 (comment)))
Many users want a config to force all cuda ops captured by cudagraph. When not possible, pt2 should error.

This PR adds `torch._inductor.triton.cudagraph_or_error` for that (default as False). Also added an environment variable `TORCHINDUCTOR_CUDAGRAPH_OR_ERROR` to control.

Pull Request resolved: pytorch#161862
Approved by: https://github.com/ezyang, https://github.com/mlazos
…ytorch#162044)"

This reverts commit cd529b6.

Reverted pytorch#162044 on behalf of https://github.com/jeffdaily due to mi200 backlog is purged, and mi300 runners are failing in GHA download ([comment](pytorch#162044 (comment)))
…h#161907)

`CMAKE_PREFIX_PATH` is a list of paths used to find dependencies. The test overwrites that with a single path causing dependencies such as protobuf or Abseil not being found.

Instead prepend the path to the existing value.

This fixes a test failure:
> pytorch-v2.7.1/test/inductor/test_aot_inductor_package.py", line 242, in test_compile_after_package
>    self.assertTrue(so_path.exists())
> AssertionError: False is not true

Caused by:
```
/software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::utility: No such file or directory
/software/binutils/2.42-GCCcore-13.3.0/bin/ld: cannot find -labsl::variant: No such file or directory
collect2: error: ld returned 1 exit status
```

Pull Request resolved: pytorch#161907
Approved by: https://github.com/Skylion007
I found a number of places that seem to want forwarding
references but the type signature does not reflect that

Pull Request resolved: pytorch#161094
Approved by: https://github.com/malfet
…ch#158747)

This is a part of our effort for integrating Composable Kernel library for Inductor backend. Currently we have a submodule, but would prefer to have commit pin control over the library as with Triton. We intentionally avoid putting all installation logic in CI scripts to allow locally built versions to have this functionality.

The idea is to have CK as a pytorch dependency in pytorch 2.9 release to allow people to use it with inductor and AOT inductor and then gradually step away from submodule usage. Right now CK usage in SDPA/Gemm is tied to submodule files.

This PR is a remake of due to branch error: pytorch#156192

Pull Request resolved: pytorch#158747
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <[email protected]>
Co-authored-by: Jack Taylor <[email protected]>
Co-authored-by: Max Podkorytov <[email protected]>
Co-authored-by: Copilot <[email protected]>
[PEP 735](https://peps.python.org/pep-0735) introduces the
[dependency-groups] table for a number of use-cases one of
which includes specifying development dependencies for projects.

Pull Request resolved: pytorch#161216
Approved by: https://github.com/seemethere
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@83c5a5](intel/torch-xpu-ops@83c5a5a), includes:

- Revert "Disable xccl timer avoid drlm hang" because XPU time event issue has been fixed
- Fallback lu_factor kernel to CPU for single batch
- Enable aten::linalg_inv and aten::linalg_inv_ex on XPU
Pull Request resolved: pytorch#162062
Approved by: https://github.com/EikanWang
)

This PR implements the semantics change to `torch._dynamo.error_on_graph_break`:
- ~`torch.compile` now has a new `error_on_graph_break` kwarg that serves as a lower-priority toggle for erroring/continuing on graph breaks~
- `error_on_graph_break` is a new internal `torch.compile `setting that is lower-priority than `fullgraph`. It allows the user to toggle erroring/continuing on graph breaks.
- `error_on_graph_break` does nothing when `fullgraph=True`
- `error_on_graph_break` does NOT guarantee a single graph

Followup [DONE]: need to change the programming model docs to reflect the 3 graph break modes for compilation:
- `fullgraph=True`: enforce one graph, no graph breaks, cannot be toggled
- `fullgraph=False, error_on_graph_break=True`: errors on graph breaks, latter can be toggled during compile time
- `fullgraph=False, error_on_graph_break=False`: resumes tracing on graph breaks, latter can be toggled during compile time

Pull Request resolved: pytorch#161747
Approved by: https://github.com/mlazos
ghstack dependencies: pytorch#161739
…he CUDACachingAllocator (pytorch#158352)

## Introduction

During CUDA Graph capture, the CUDA caching allocator currently defers reclaiming blocks until capture ends. This is because CUDA forbids querying events recorded during capture (the CUDA operation is not executed during the capture stage), so the allocator cannot use its normal event-based logic. However, capture records an DAG (we call it **capturing graph**) of work. We can use the capturing graph to determine when a block’s old lifetime is fully before future work, and safely reuse it within the same capture.

This PR adds an experimental flag `graph_capture_record_stream_reuse: True|False (default: False)`. When enabled, the allocator inserts lightweight free markers and uses capture ordering to decide if a freed block is safe to reuse during capture. If the proof cannot be established, we fall back to the existing post-capture path.

## Terms

* **Free marker**: A capture-legal no-op (created with `cudaGraphAddEmptyNode`) inserted after the last captured use of the block on each stream that used it.
* **Terminal**: The set of the lastest operations of the stream (or the capturing graph). Any newly captured op on that stream will attach after all nodes in this set. For a stream currently capturing, it is the set of nodes returned in `dependencies_out` by `cudaStreamGetCaptureInfo`.

## When can we reuse a block during capture?

### Strong Rule (Graph-Wide Safety)

This rule provides a universal guarantee that a block is safe for reuse by any stream in the graph.

> A block is safe to reuse if every free marker is a predecessor of every terminal of all active streams in the graph.

Why it's safe:

This rule establishes a strict global ordering. Since any new operation on any stream must be appended after that stream's terminals, this condition guarantees that the block's new lifetime begins only after its old lifetime has completely ended everywhere. This prevents lifetime overlaps when the graph is replayed, ensuring correctness.

### Per-stream Rule (A Practical Optimization)

The strong rule, while safe, is often unnecessarily restrictive. The `DeviceCachingAllocator` introduces a crucial constraint that allows for a simpler check.

In `DeviceCachingAllocator`, `get_free_block` only returns blocks whose `block->stream == p.stream()`. In other words, we never reuse a block on a stream different from the allocation stream. This means we don't need to verify safety across the entire graph. We only need to confirm that the block is safe to reuse from the perspective of its own allocation stream.

> Reuse a block for allocations on stream S if every free marker is a predecessor of every node in the terminal set of S.

In short, a block is considered **reusable** on stream S as long as all marker marking it "free" are guaranteed to complete before any new work that might need it on stream S begins.

## Implementation

* On `free(block)` during capture
  * For each stream in `block->stream_uses` and the allocation stream, insert a free marker (empty node) and make it that stream’s tail.
  * If we cannot place markers for all such streams (for example, a stream is not in capture), defer to the post-capture path.
  * Otherwise, store the marker handles and keep the block in the capture-private structures.
* On `allocate(stream)` during capture (attempt per-stream reclaim)
  * Query the allocation stream S’s terminal via `cudaStreamGetCaptureInfo`.
  * For each deferred block, check whether it is allocated on this stream, and each of its free markers is a predecessor of the terminal.
    * If yes, hand the block to S for immediate reuse within the same capture.
    * If no, keep it deferred; it will be reconsidered as capture progresses and S’s terminal advances.
* On capture end
  * Any still-deferred blocks follow the existing post-capture reclamation (event insertion/polling). External behavior remains unchanged if we cannot prove safety during capture.

## Examples (2 streams)

<img width="641" height="801" alt="pytorch-remove-cudagraph-defer-reclaiming (6)" src="https://github.com/user-attachments/assets/41adc835-d448-483b-99ba-b4341cb7d2a2" />

* Case 0 — Unsafe
The two frees are not ordered with respect to each other. For stream 1, the other stream’s free marker does not precede this stream’s terminal, so the per-stream condition fails.
Counterexample intuition for the unsafe setups: imagine `f2(x)` runs for a long time. If DeviceCachingAllocator reused block `x` on a stream whose terminal is not ordered after the free markers, the new lifetime could overlap the old one on replay, risking use-after-free or data corruption. The per-stream rule prevents exactly this.
* Case 1 — Reusable on stream 1
Stream 1’s terminal is after both frees, so every free marker precedes stream 1’s terminal. The block is reusable for allocations on stream 1.
* Case 2 — Not reusable on stream 2, but this cannot occur in `DeviceCachingAllocator`
This depicts reusing the block on stream 2 while stream 1’s free is not yet ordered before stream 2’s terminal. Though the block is not safe to reuse on stream 2, DeviceCachingAllocator will not choose that block for stream 2 anyway: `get_free_block` rejects blocks whose `stream != p.stream()`. So this case is unreachable.
* Case 3 — Safe (strong rule holds)
In this scenario, the terminal nodes of all streams are positioned after the block's free markers, satisfying the strong rule. This guarantees the block is safe for reuse by any stream in the capturing graph. However, since `DeviceCachingAllocator ` only reuses a block on its original allocation stream, verifying this strong condition is unnecessary. We only need to ensure the per-stream rule is met for the specific stream requesting the block.
* Case 4 — Freeing after a join
See the note below.

## Edge Case: Freeing after a join

Our current dependency tracking has a limitation in scenarios where a block is freed after a stream join, see @galv's [comments here](pytorch#158352 (review))).

In the case 4, we have a missed opportunity. Because the block's usage is not explicitly marked, we cannot determine that the block's actual last use may have occurred much earlier, long before the join. Then, we must wait for the subsequent join before the block can be reused.

## Thanks
Thanks to @galv for his great idea around graph parsing and empty nodes.

Pull Request resolved: pytorch#158352
Approved by: https://github.com/ngimel, https://github.com/eqy

Co-authored-by: Jeff Daily <[email protected]>
…orch#161984)

Added a helper API to tell if the world is entirely within a P2P domain or crosses network.
This is mainly for nblocks tuning purpose. (In later PRs)

Pull Request resolved: pytorch#161984
Approved by: https://github.com/ngimel
ghstack dependencies: pytorch#161983
so that the signal calls do not step on each other's foot.

Pull Request resolved: pytorch#162026
Approved by: https://github.com/ngimel
…161407)

Summary:

Creates a fallback path for `torch._grouped_mm`, using the naive for
loop implementation (or bmm).

For the sake of keeping the PR small, this PR only enables SM80+ (CUDA
capability 8.0 and up), since I am testing this on an A100 machine. In
future PRs, we can increase the coverage of the fallback to:
1. float32 and float16, which will extend the GPU coverage
2. cpu

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_3d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_2d_2d -x
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm_3d_3d -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: pytorch#161407
Approved by: https://github.com/drisspg, https://github.com/eqy
…61717)

Summary:

Moves the `torch._grouped_mm` fallback from cuda-only code to a place
where it can be used by multiple backends. Specifically:
1. make the fallback path and util functions reusable and move them to
   `ATen/native/GroupedMMUtils.h`
2. register a backend-agnostic kernel to composite explicit autograd key
3. refactor the grouped_mm tests to their own test case and enable CPU

At the end of this PR, here is the support matrix:
* CUDA SM90+: fast path with test coverage (no change)
* CUDA SM80+: fallback with test coverage (no change)
* CPU: fallback works, but without test coverage (new in this PR)
* other SM versions and other backends: will probably already work, but
  let's leave this to future PRs
* float32/float16: will probably already work, but let's leave this to
  future PRs

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: pytorch#161717
Approved by: https://github.com/ngimel, https://github.com/drisspg
ghstack dependencies: pytorch#161407
…62059)

Summary:

Enables `torch.float32` and `torch.float16` options in
`torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`,
`mat_b`, and `out_dtype` are `torch.bfloat16`.

Saving for future PRs:
1. enabling testing on more platforms
2. supporting out_dtype != mat_a.dtype
3. opinfo
4. better compile support

Test Plan:

```bash
// on A100 and H100
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
// on H100
pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: pytorch#162059
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: pytorch#161407, pytorch#161717
…pytorch#161688)

Fixes pytorch#161080
torch.export.export fails with TypeError: expand() got an unexpected keyword argument 'implicit' when calling torch.expand_copy(..., implicit=True). This happened because expand_copy = _make_copy_from_view(aten.expand) register aten. expand as the decomposition path for aten.expand_copy, which doesn’t accept the implicit argument.

I have added an explicit a decomposition for aten.expand_copy in torch/_decomp/decompositions.py to ignore the implicit argument, and a simple unit test to demonstrate the bug being fixed.
Pull Request resolved: pytorch#161688
Approved by: https://github.com/angelayi, https://github.com/can-gaa-hou
pytorch#161951)

…h.is_complex.

The PR proposes adding a simple, self-explanatory example to the documentation page. The example demonstrates the function's output for tensors with various data types, showing both True and False return values.

Fixes pytorch#161859

Pull Request resolved: pytorch#161951
Approved by: https://github.com/zou3519
Update cpp-httplib with better error handling, bugfixes, and performance. Header only library update.
Pull Request resolved: pytorch#162181
Approved by: https://github.com/jansel
Summary:
att
Test Plan:
ci
Rollback Plan:

Reviewed By: minjang

Differential Revision: D80828148

Pull Request resolved: pytorch#161798
Approved by: https://github.com/minjang, https://github.com/SherlockNoMad
This reverts commit 2c03f0a.

Reverted pytorch#162007 on behalf of https://github.com/jeanschmidt due to Breaks internal builds see [D81588372](https://www.internalfb.com/diff/D81588372), @malfet may you help the author? ([comment](pytorch#162007 (comment)))
jataylo and others added 24 commits September 9, 2025 15:44
fbgemm_gpu was failing to clone due to missing submodule commit.
```
+ pushd fbgemm/fbgemm_gpu
~/pytorch/fbgemm/fbgemm_gpu ~/pytorch
+ git checkout 7f1de94a4c2d14f59ad4ca84538c36084ea6b2c8 --recurse-submodules
fatal: failed to unpack tree object b1281b8b08d973a7064f864f47eeb30f3e2596e9
error: Submodule 'external/composable_kernel' could not be updated.
error: Cannot update submodule:
	external/composable_kernel
```
Log File
[inductor-periodic · pytorch/pytorch@5babb4d](https://github.com/pytorch/pytorch/actions/runs/17536630806/job/49802458834)
Pull Request resolved: pytorch#162385
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <[email protected]>
Implements mps sparse mul operation as well as enables other operations such as:
1. copy_
2. div
3. sum
4. floor
5. power
6. sub
7. floor_divide

Pull Request resolved: pytorch#162349
Approved by: https://github.com/pearu, https://github.com/malfet

Co-authored-by: Nikita Shulga <[email protected]>
All these UTs are working as is, just removing the skip
- test_p2p_ipc
- test_repros.py: working, added fp8 support
- test_activation_checkpointing.py
- test_content_store.py
- test_cuda_multigpu.py
- test_compute_comm_reordering.py
- test_segment_reductions.py
- test_dataloader.py
- test_math_ops.py
- test_loop_ordering.py
- test_control_flow.py
- distributed_test.py
- test_mem_tracker.py
- test_fsdp_optim_state.py
- test_fully_shard_mixed_precision.py: skippped for < ROCm7.0
- test_aot_inductor_custom_ops.py
- test_c10d_ops_nccl.py
- test_eager_transforms.py
- test_sparse_csr.py
- test_inductor_collectives.py
- test_fake_tensor.py
- test_cupy_as_tensor.py
- test_cuda.py: enable UTs that are working
- test_matmul_cuda.py: enable UTs that are working

Fixes #ISSUE_NUMBER

Pull Request resolved: pytorch#161715
Approved by: https://github.com/msaroufim

Co-authored-by: Mark Saroufim <[email protected]>
Our compiler is generating inefficient code for the offsetCalc in certain situations.
The root-cause for this needs to be identified. For now specialized unrolling based on 'dims' notably helps perf.

Pull Request resolved: pytorch#161700
Approved by: https://github.com/jeffdaily
many tests taking >30 min and causing timeouts

Pull Request resolved: pytorch#162496
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <[email protected]>
# why

- gather everything up to make choices, without running
  potentially expensive generators
- enables overrides where we toss the entire list of configs
  from inductor, without having to enumrate it (expensive)

# what

- add a holding class that just gets all the components necessary
  to generate a ChoiceCaller
- use that class to generate ChoiceCallers
- this does not (yet) add the override function, but just prepares
  the scene

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569)
Pull Request resolved: pytorch#161347
Approved by: https://github.com/eellison
…pytorch#161348)

\# why

- every callsite just executes the generator on the spot
- previous pr adds the ability to add an override before expensive
  generators are executed, so we don't need this generator anymore

\# what

- rather than yielding the ChoiceCaller, just return the list of all
  valid ChoiceCallers

\# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574)
Pull Request resolved: pytorch#161348
Approved by: https://github.com/eellison
ghstack dependencies: pytorch#161347
# why

- enable us to override the default configs, or fall back to them
  through subclassing InductorChoices

# what

- override (private) function
- default implementationt takes the kernel template choice (ktc)
  generator for every template and just executes the generator
- future overrides can decide to replace those generators, or filter
  out choices

- the 2nd expensive step (maybe_append_choices, choice_or_none) is
  handled outside this function, in the main V.choices.get_mm_configs
  this means that any overriding benefits from not generating expensive
  templates that aren't going to be used

# testing

```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```

Differential Revision: [D81520570](https://our.internmc.facebook.com/intern/diff/D81520570)
Pull Request resolved: pytorch#161349
Approved by: https://github.com/eellison
ghstack dependencies: pytorch#161347, pytorch#161348
…pytorch#162238)

# why

- unnecessary as we only ever need to know the dtype and maybe the
  device
- we already take in the kernel inputs which have the device
- enable us to specify the layout after finding all the configs
  but before generating the ChoiceCallers

# what

- replace all calls in template_heuristics that used to take Layout
  with now just taking out_dtype

# testing

ci

Differential Revision: [D81820115](https://our.internmc.facebook.com/intern/diff/D81820115)
Pull Request resolved: pytorch#162238
Approved by: https://github.com/eellison
ghstack dependencies: pytorch#161347, pytorch#161348, pytorch#161349
When visualizing the schedules using `_PipelineScheduleExecution`, we don't provide any spacing between dependencies, so when visualizing `DualPipeV` it looks like this:

<img width="3168" height="486" alt="image" src="https://github.com/user-attachments/assets/d2c881ad-4ee0-46b6-ac03-13e5600b5a55" />

While it has the correct order of operations, it does not show the dependencies correctly. As shown in the original implementation, it should look something like this:

<img width="3542" height="384" alt="image" src="https://github.com/user-attachments/assets/c930fa98-848e-4951-a58b-c81f41092d14" />

This allows an option to add spacing to the visualizer, so it is easier to see dependencies. After change:

<img width="3633" height="486" alt="image" src="https://github.com/user-attachments/assets/7708367e-bdb4-46e8-a7c4-f19e18047f59" />

Pull Request resolved: pytorch#160474
Approved by: https://github.com/fegin
Correct the rounding scheme for nearest in quantile.

Pull Request resolved: pytorch#162423
Approved by: https://github.com/soulitzer
…ec is ok (pytorch#160580)

My goal right now is to try to make the "vanilla" AccumulateGrad path for DTensor (that just calls detach) fast. I'm doing this in two steps:

(1) [this PR]: hardcode aten.detach in DTensor to re-use the input tensor's DTensorSpec, instead of running "real" sharding prop.

(2) [assuming success of 1]: move the detach() call into C++, try adding a DTensor dispatch key, and avoid dispatching back to python entirely (except for some code that probably needs to allocate a pyobject for the output DTensor, from C++)

I'm pushing this PR first to confirm that I don't break anything with my detach fastpath. I did some manual local testing to confirm that for normal usages of detach, the input and output DTensor have equal DTensorSpec objects. Technically, we previously would allocate a fresh DTensorSpec, and with this change we are just re-using the input tensor's DTensorSpec. So I'm mostly hoping that DTensorSpecs don't generally get mutated

This by itself does seem to speed up `alias` by quite a bit (roughly 2.5x speedup, from ~336us -> 133us):

**aten.detach(plain_tensor)**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f8da2921790>
_ = x.detach()
  4.80 us
  1 measurement, 100000 runs , 1 thread
```

**aten.detach(DTensor) [before this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f47cd68e750>
_ = x_dt.detach()
  336.40 us
  1 measurement, 1000 runs , 1 thread
```

**aten.detach(DTensor) [after this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f0a34c05520>
_ = x_dt.detach()
  Median: 133.45 us
  2 measurements, 1000 runs per measurement, 1 thread
```

benchmark script:
```
import torch
import torch.distributed as dist
from torch.distributed.tensor import DeviceMesh, DTensor, Partial, Replicate, Shard
from torch.testing._internal.distributed.fake_pg import FakeStore
import torch.utils.benchmark as benchmark

fake_store = FakeStore()
dist.init_process_group("fake", store=fake_store, rank=0, world_size=2)

mesh = torch.distributed.device_mesh.init_device_mesh('cuda', (2,))
x = torch.randn(4, 4, requires_grad=True)
x_dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False)

t0 = benchmark.Timer(
    stmt='_ = x_dt.detach()',
    globals={'x_dt': x_dt},
)
print(t0.blocked_autorange())

dist.destroy_process_group()
```

Pull Request resolved: pytorch#160580
Approved by: https://github.com/ezyang
This reverts commit 3ea6868.

Reverted pytorch#162349 on behalf of https://github.com/malfet due to Fails trunk tests, with uint8 sum ([comment](pytorch#162349 (comment)))
…ytorch#161394)

This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are:

- Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit`
- Added `--benchmark-name` argument for customizing the benchmark name in output
- Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name

Sample command to run a single operator:
`python -m pt.mm_test --use-compile`
Pull Request resolved: pytorch#161394
Approved by: https://github.com/jbschlosser
…2432)

We update it to call write_atomic instead of file.write

Pull Request resolved: pytorch#162432
Approved by: https://github.com/oulgen
Fixes pytorch#162357
Fixes pytorch#160970
Fixes pytorch#161038
Fixes pytorch#160951
Fixes pytorch#161698

These tests were introduced in pytorch#160765 and they are all flaky when `torch._inductor.aot_compile` uses multiple threads (the default option).  The issue could be reproduced by running them locally multiple times.  For example,

```
pytest --flake-runs 10 --flake-finder -v inductor/test_fxir_backend.py -k test_aoti_fx_add
(output logs at P1938386961)
...
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
--------------------------------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------------------------------
inductor [('async_compile_cache_miss', 2), ('async_compile_cache_hit', 1)]
graph_break []
================================================================================================================================================= short test summary info ==================================================================================================================================================
FAILED [0.4834s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
FAILED [0.4576s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
FAILED [0.4613s] inductor/test_fxir_backend.py::AOTFxirTestCase::test_aoti_fx_add - AttributeError: 'NoneType' object has no attribute '__code__'
=============================================================================================================================================== 3 failed, 7 passed in 12.89s ===============================================================================================================================================
```

Setting `compile_threads` to 1 will get rid of the test flakiness, but there might be underlying issues from pytorch#160765.

Pull Request resolved: pytorch#162472
Approved by: https://github.com/angelayi, https://github.com/Skylion007
I confirmed that the tracing was correct i.e. NamedTupleVariable had the correct dynamic attribute added to it.

The problem was that NamedTupleVariable was always marked as immutable. This does not reflect the behavior of namedtuple.

Subclasses of namedtuple may be mutable, so when a NamedTupleVariable is derived from a subclass that is mutable, I made NamedTupleVariable mutable as well. Then side_effects correctly updates the returned object.

Fixes pytorch#161610

Pull Request resolved: pytorch#161645
Approved by: https://github.com/anijain2305, https://github.com/StrongerXi
…sting_IFU_2025-09-09

# Conflicts:
#	.ci/docker/ci_commit_pins/triton.txt
#	.ci/docker/requirements-ci.txt
#	aten/src/ATen/Context.cpp
#	aten/src/ATen/cuda/detail/OffsetCalculator.cuh
#	aten/src/ATen/cuda/tunable/GemmHipblaslt.h
#	aten/src/ATen/native/ConvUtils.h
#	aten/src/ATen/native/Convolution.cpp
#	aten/src/ATen/native/Normalization.cpp
#	aten/src/ATen/native/cuda/Blas.cpp
#	aten/src/ATen/native/miopen/Conv_miopen.cpp
#	requirements.txt
#	test/distributed/_tools/test_fsdp2_mem_tracker.py
#	test/distributed/tensor/parallel/test_tp_examples.py
#	test/dynamo/test_activation_checkpointing.py
#	test/dynamo/test_structured_trace.py
#	test/inductor/test_aot_inductor.py
#	test/inductor/test_combo_kernels.py
#	test/test_matmul_cuda.py
#	test/test_sparse.py
#	torch/_higher_order_ops/triton_kernel_wrap.py
#	torch/_inductor/choices.py
#	torch/_inductor/codegen/triton.py
#	torch/testing/_internal/common_cuda.py
@pragupta pragupta merged commit 492e246 into rocm7.1_internal_testing Sep 24, 2025
88 of 90 checks passed
@pragupta pragupta deleted the rocm7.1_internal_testing_IFU_2025-09-09 branch September 24, 2025 21:27
@rocm-repo-management-api
Copy link

Jenkins build for 9e7df766290def1ac0112fc758a6fa1ea126e95a commit is in progress
Links: Blue Ocean view / Build artifacts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.