Skip to content

Conversation

pragupta
Copy link
Owner

@pragupta pragupta commented Oct 2, 2025

Fixes #ISSUE_NUMBER

pianpwk and others added 30 commits September 30, 2025 01:15
Summary:
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.

Test Plan:
contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/56218d85e2da09d9ede3809718ec989c2151632c

Rollback Plan:

Differential Revision: D80948073

Pull Request resolved: pytorch#161414
Approved by: https://github.com/laithsakka
…#164001)

The CUDACachingAllocator already does this, so there is precedent.
Pull Request resolved: pytorch#164001
Approved by: https://github.com/eqy
…torch#163988)

See also pytorch#163972, which was intended to be this PR.

Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help pytorch#163801 when users run on new devices like THOR and Spark.

Fixes pytorch#163801

Test Plan:

Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: pytorch#119750 and pytorch/builder@5c814e2

Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then  https://github.com/triton-lang/triton/blob/c6ad34f7eb42630533412d93ca2cc00a4b4f8f3c/python/triton/knobs.py#L216 would still complain ptxas not found (if removed - it won't know this new one available)

Pull Request resolved: pytorch#163988
Approved by: https://github.com/atalman
Upgrade all the ROCm docker image to ROCm 7.0 release version.

Pull Request resolved: pytorch#163140
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <[email protected]>
----

- `cmake_dependent_option` condition should be `USE_ROCM OR (USE_CUDA AND NOT MSVC)` (similar to the one for flash attention)
- Default settings should be user overridable, i.e. even if one builds for SM_10, they should be able to pass `USE_FBGEMM_GENAI=0` and skip the build

Pull Request resolved: pytorch#164165
Approved by: https://github.com/Skylion007
…orch#163794)

Summary: Add a OSS user manual for AOTI intermediate debug printer so we can link it in the Pytorch conference poster.

Test Plan: N/A

Differential Revision: D83171374

Pull Request resolved: pytorch#163794
Approved by: https://github.com/yushangdi
Every time viable strict is updated
Pull Request resolved: pytorch#164183
Approved by: https://github.com/seemethere
Fixes invalid f-strings detected by `ruff`.

Pull Request resolved: pytorch#164112
Approved by: https://github.com/Skylion007, https://github.com/mlazos
This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing.

Subsequent prs will handle:
- use of exposure analysis to do bucketing
- make sure inductor respects comm/compute overlapping done at fx level
- non-profiling mm estimation/rank broadcasting of profile results

Other mis:
- Validate accuracy of nccl estimations  ( use ruisi's profiling instead ?)

For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives.

fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3

bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51

Pull Request resolved: pytorch#163215
Approved by: https://github.com/IvanKobzarev
Preparatory refactory

Pull Request resolved: pytorch#163754
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: pytorch#163215
In comm-compute overlap we will have a graph with:

```
def foo(...):
     ag = all_gather(...)
     hiding_compute = mm(...)
     wait(ag)
```

There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap.

Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set.

This pr adds `AugmentedGraphHelper` that adds the apis, and allows querying for dependency with this augmented graph.

Pull Request resolved: pytorch#163959
Approved by: https://github.com/v0i0, https://github.com/IvanKobzarev
ghstack dependencies: pytorch#163215, pytorch#163754
tl;dr performs bucketing while preserving comm-compute overlap.

In comm-compute overlap we will have a graph with:

```
def foo(...):
     ag = all_gather(...)
     hiding_compute = mm(...)
     wait(ag)
```

There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap.

Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set.

We perform bucketing while augmenting the graph with these relationships. This can be done separably from comm-compute overlap, so long as the hiding compute relationships are passed in.

TODO:
- need to instrument fx graph so inductor respects these relationships.
- the compile time of the bucketing search can be sped up significantly by limiting what portion of the graph we traverse through
- more memory aware handling

Pull Request resolved: pytorch#163960
Approved by: https://github.com/ruisizhang123, https://github.com/v0i0, https://github.com/IvanKobzarev
ghstack dependencies: pytorch#163215, pytorch#163754, pytorch#163959
Summary:
Original commit changeset: 06888d7ebff0

Original Phabricator Diff: D82932788

Restricted the test to SM90 for scaled_grouped_mm

Test Plan: TBD (will share the linux CI results)

Differential Revision: D83283991

Pull Request resolved: pytorch#163905
Approved by: https://github.com/angelayi
…puts (pytorch#163609)

I experimented with 3 paths to get joint graph for DTensorized module and input

1. strict_export + aot_export_joint_with_descriptors
2. graph_capture + aot_export_joint_with_descriptors
3. aot_export_joint_with_descriptors alone

Added test to guard them.

1 doesn't work, as bw graph region is missing from the joint graph.
I am leaning towards making 2 the recommended path.
If 2 doesn't work going forward, we can fallback to 3.

Pull Request resolved: pytorch#163609
Approved by: https://github.com/tugsbayasgalan

Co-authored-by: suo <[email protected]>
Fixes
```
[4] ValueError: both buffer length (0) and count (-1) must not be 0
```

Test plan:

```
pytest test/distributed/test_serialization.py
```

Pull Request resolved: pytorch#164198
Approved by: https://github.com/amirafzali
…ch#157859)

Fixes pytorch#156052 and pytorch#156444.

This PR setup the privateuseone key in Python to be used as a python backend for pytorch.
Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor.

Changes done in this PR:

1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine.
2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops.
3. python function that invokes the other incantations to setup the privateusekey backend.

This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot.
Pull Request resolved: pytorch#157859
Approved by: https://github.com/albanD
This is a simple refactor that just moves some logic in `_precompile_config` to two new functions for separation of concerns. This will allow subclasses e.g. out of tree to configure options and metadata for triton.compile.

Pull Request resolved: pytorch#162406
Approved by: https://github.com/exclamaforte
Fixes pytorch#161089. Added '0' as the acceptable value for compute mode in _meta_registrations.py. Also, added a test case in test_export.py file.

Pull Request resolved: pytorch#161724
Approved by: https://github.com/albanD, https://github.com/angelayi
Fix part of pytorch#158917

Add AMP integration document and OpenReg code as example to explain steps of integration.
Pull Request resolved: pytorch#162050
Approved by: https://github.com/albanD

Co-authored-by: FFFrog <[email protected]>
…h#164056)

The removed macros are not used in other places of the `pytorch` GitHub org.
Pull Request resolved: pytorch#164056
Approved by: https://github.com/albanD
This reverts commit 9f27b0c.

Reverted pytorch#164183 on behalf of https://github.com/malfet due to Hmm, didn't work that way ([comment](pytorch#164183 (comment)))
eellison and others added 28 commits October 3, 2025 00:19
scaled_mm already had `needs_exact_strides` in its op registration. also added a test showing these strides are being respected.

Pull Request resolved: pytorch#164481
Approved by: https://github.com/drisspg, https://github.com/mlazos
Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op.

The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root.

Pull Request resolved: pytorch#164517
Approved by: https://github.com/ngimel
Adds suppressions to pyrefly will typecheck clean: pytorch#163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

---
step 1: uncomment lines in the `pyrefly.toml` file
before: https://gist.github.com/maggiemoss/911b4d0bc88bf8cf3ab91f67184e9d46

after:
```
 INFO Checking project configured at `/Users/maggiemoss/python_projects/pytorch/pyrefly.toml`
 INFO 0 errors (1,152 ignored)
 ```

Pull Request resolved: pytorch#164513
Approved by: https://github.com/oulgen
Test Plan: Sandcastle

Differential Revision: D83492704

Pull Request resolved: pytorch#164159
Approved by: https://github.com/Skylion007, https://github.com/mlazos
…ch#163213)

We want to refactor the internal bookkeeping of DeviceMesh so that:
Simply the bookkeeping logics and make it generic enough so that it is easy to support new transformations like flatten noncontiguous dim, reshape and unflatten. (We leveraged the CuTe layout). This new layout also let us handle non-contiguous slicing, flatten, transpose possible.

Concretely, in this PR, we do the following:
1. Use the `_MeshLayout` to handle all index operations rather use a map to record mesh dims.
2. Removed `flatten_name_to_root_dims`, because now we can directly get layout from a flattened device mesh.
3. Replaced `_get_slice_mesh_dims` with `_get_slice_mesh_layout`.
4. Use the newly added function `check_overlap` to check layout overlap.
5. Use a new function `to_remapping_tensor` to use layout ranks as indices when the mesh tensor is not representable as CuTe. The reason is that layout acts as a backend of mesh tensor bookkeeping (indexing indices), it needs to be used as indices for remap back to the mesh tensor for new DeviceMesh generation and backend init. For example, in the case of 2K to 4K, the underlying layout is (2K, 1) but the actual value of the mesh tensor is [2K, 2K+1, ....,]. While flattening, slicing, we need to remap the layout back to the new mesh tensor so it maps the actual device allocation. For example, in the 2K to 4K case, if the shape is (1K, 1K) with dim_names ("dp", "tp"). Then when slicing "tp", the mesh tensor should be (2K, 2K+1, ..., 3K-1) or (3K, 3K+1, ... 4K-1). not the global ranks generated from the layout. (1K, 1).

Verified that loss curve is very close for DeepSeekV3 on torchtitan, note that exact same match is challenging because even if we run the baseline twice, the loss curve does not exactly match.

<img width="1113" height="490" alt="image" src="https://github.com/user-attachments/assets/7877b5a4-337e-4ad8-b878-2378f4f0f38d" />

The PR looks big indeed but we don't change any existing behavior of DeviceMesh, so it is a pure refactor.

With this refactoring we also enabled the slicing and flatten of non-contiguous dims of a device mesh which is hard to implement without cute layout.

This is a continue of pytorch#161106 (original one got messed with EasyCLA)

Pull Request resolved: pytorch#163213
Approved by: https://github.com/lw, https://github.com/fegin
This pull request adds support for running operator microbenchmarks on ROCm (AMD GPU) environments in the CI workflow. The main changes involve introducing new build and test jobs for ROCm in the `.github/workflows/operator_microbenchmark.yml` file.

Pull Request resolved: pytorch#164173
Approved by: https://github.com/huydhn
This PR moves the call to copy the generated code from `/tmp/...` so that it is still called if attempting to compile the generated code fails. In both cases now, the generated code will be copied across to `torch_compile_debug/run_.../torchinductor/output_code.py` which makes debugging bad generated code easier.

Pull Request resolved: pytorch#161615
Approved by: https://github.com/eellison
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

Differential Revision: D83714687

Pull Request resolved: pytorch#164512
Approved by: https://github.com/jananisriram
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: pytorch#164319
Approved by: https://github.com/pytorchbot

Co-authored-by: Huy Do <[email protected]>
…ytorch#164539)

Because torch.testing.test_allclose is deprecated.

Pull Request resolved: pytorch#164539
Approved by: https://github.com/mlazos
Mitigates pytorch#164574
Remove unused CUDA_CHANNEL var - this was used before when we had  pytorch install via conda.

Please note: CUDA 13.0 failures are expected since the CI tries to build against prod and CUDA 13.0 is not available in prod yet.
Pull Request resolved: pytorch#164575
Approved by: https://github.com/malfet, https://github.com/Camyll
PR pytorch#164481 added unit test test_scaled_mm_preserves_strides in test/inductor/test_fp8.py. It was missing the adjustment for ROCm's F8 types on MI300.

Pull Request resolved: pytorch#164578
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <[email protected]>
So this fixes at least two issues:
1) When we are invoking inductor backend, we apply pre-grad passes which try to find correct fake mode to use. In the nested case, we will run into clash when there is closure variable in the inductor region because non-strict would have fakified this variable before hand and inner torch.compile would have created a new fresh fake mode. This is not a problem in regular torch.compile because inner torch.compile gets ignored. I don't know if we are supposed to inherit fake mode from parent context in this case. But we can avoid this problem if we just default to eager backend which is fine in this case because the point of export is to capture aten operators. Going to inductor would mean we will lose inner torch.compile ops.
2) There is custom torch function modes in export that track number of torch fns executed and inner compile itself doesn't work because of guard failure as this mode state gets changed. I noticed torch.cond fixes this problem by carefully stashing the torch function mode and defer it in the backend. So the correct thing to do here is just re-use torch.cond implementation unconditionally.

So the things i did for fixing above were:
1) Always default to eager backend when compile is invoked inside export. I needed to make how torch.cond sets up the fresh tracing env into an util that can be shared.
2) The previous eager backend for torch.cond was wrong because the context managers didn't actually persist until the backend is invoked.
3) torch.cond used only disable TorchFunctionMetadata tf mode and stash it for later, but in fact, we should do both TorchFunctionMetadata and PreDispatchTorchFunctionMode.

With above fixes, we are able to export flex attention in export.

Pull Request resolved: pytorch#164171
Approved by: https://github.com/ydwu4
skips DTensorSpec.sizes/strides in metadata guard checks

Pull Request resolved: pytorch#163820
Approved by: https://github.com/azahed98
…ent to avoid slow paths (pytorch#164501)

Summary:
This diff adds the feature of allocating a large pinned memory segment upfront based on the provided config. This large segment is then used to serve all the small pinned memory requests to avoid expensive device level APIs (slow paths).

Example:

PYTORCH_CUDA_ALLOC_CONF=pinned_reserve_segment_size_mb:2048

This reserves a 2GB pinned memory segment for the process and then all incoming small requests are just served from this segment and no cudaHostAlloc/cudaHostRegister apis are being called.

Differential Revision: D83779074

Pull Request resolved: pytorch#164501
Approved by: https://github.com/yangw-dev
…sting_IFU_2025-10-03

# Conflicts:
#	.ci/docker/ci_commit_pins/triton.txt
#	.ci/docker/libtorch/build.sh
#	CMakeLists.txt
#	requirements-build.txt
#	test/test_matmul_cuda.py
#	torch/_inductor/runtime/triton_heuristics.py
#	torch/testing/_internal/common_utils.py
…5-10-03

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-10-03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.