Skip to content

Conversation

pragupta
Copy link
Owner

Merged latest changes from upstream/main into rocm7.1_internal_testing on 2025-08-22

laithsakka and others added 30 commits August 16, 2025 00:54
keep existing unbacked semantics unchanged, just use guard_or_false instead of guard_size_obl

Pull Request resolved: pytorch#160250
Approved by: https://github.com/ColinPeppler, https://github.com/jingsh
This reverts commit e0488d9.

Reverted pytorch#160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](pytorch#160458 (comment)))
Which is manylinux2_28 compatible, even on aarch64 platform

archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel
Should fix pytorch#160425
Pull Request resolved: pytorch#160458
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
…ytorch#159790)

This is a similar change to pytorch#153986, this time adding flags to the hipcc command under `cpp_extension.py`.

The `-Wno-ignored-attributes` flag in particular avoids about 200MB of warning spam when building torchvision, like these:
```
In file included from D:\b\vision_main\torchvision\csrc\ops\hip\deform_conv2d_kernel.hip:72:
In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ATen.h:13:
In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/Functions.h:386:
In file included from D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax.h:21:
D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\ATen/ops/_sparse_softmax_ops.h:18:8: warning: __declspec attribute 'dllimport' is not supported [-Wignored-attributes]
   18 | struct TORCH_API _sparse_softmax_int {
      |        ^~~~~~~~~
D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h:100:19: note: expanded from macro 'TORCH_API'
  100 | #define TORCH_API C10_IMPORT
      |                   ^~~~~~~~~~
D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\torch\include\torch/headeronly/macros/Export.h:53:31: note: expanded from macro 'C10_IMPORT'
   53 | #define C10_IMPORT __declspec(dllimport)
      |                               ^~~~~~~~~
```

The `-fms-extensions` flag just seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html.

See also this downstream issue where these changes were tested: ROCm/TheRock#910.

Pull Request resolved: pytorch#159790
Approved by: https://github.com/jeffdaily
Summary:

as title

This is requested by the zoomer team so they can add stack trace information to profiler result.

Test Plan:
```
buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r  stack_traces
```

Rollback Plan:

Differential Revision: D80050233

Pull Request resolved: pytorch#160779
Approved by: https://github.com/angelayi
Set dynamo=True and enable fallback.

1. Implemented the compatible behavior where BytesIO objects as `f` is accepted
2. Update tests to explicitly set dynamo=False

pytorch#151693

Pull Request resolved: pytorch#159646
Approved by: https://github.com/titaiwangms
Fixes pytorch#160650.

I added type ignore comment to `LeafSpec` class inheritance in `torch/utils/_cxx_pytree.py` to handle `PyTreeSpec` being marked as final in optree's type stubs.

Pull Request resolved: pytorch#160652
Approved by: https://github.com/Skylion007
…0635)

My proposal here is to use GitHub Dependabot to make sure that `transformers` version used in CI are always up-to-date.  To achieve this, this PR does 2 things:

1. Pin `transformers` version across all CI jobs to only one place at `.ci/docker/ci_commit_pins/huggingface.txt`.  This file is now a regular pip requirements instead of a pinned commit text.  There isn't any need to pin `transformers` to a specific commit and the file already refers to a stable version `v4.54.0`
2. Create `.github/dependabot.yml` to config the bot to update `transformers` automatically when there is a new version.  Those labels will ensure that the right reviewers from torch.compile and Dev Infra are notified.  I'm not sure how to test this out in PR, but it feels ok to land and test this in main.  If this works, we should see a PR to update `v4.54.0` to the current latest `v4.55.0`

### Reference
https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference
Pull Request resolved: pytorch#160635
Approved by: https://github.com/ZainRizvi
… add aten.sym_is_contiguous. (pytorch#159197)

This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling  is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context

we defined in this pytorch#157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.

when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()

one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
    at::MemoryFormat memory_format) const {
  if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
    return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
        this, memory_format);
  }

  return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);

This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.

so I had to define it for pyinterpreter, and then I had to override it for nested tensors.

Pull Request resolved: pytorch#159197
Approved by: https://github.com/ezyang
Differential Revision: D80201622

Pull Request resolved: pytorch#160599
Approved by: https://github.com/bdhirsh
…unner-mypy` (pytorch#160806)

Like `MYPY`, linter `MYPYSTRICT` will need `--all-files` too.

See also:

- pytorch#160652 (comment)

Pull Request resolved: pytorch#160806
Approved by: https://github.com/seemethere
Summary:
- Add TLParse artifact logging per op with output tensor shape, stride, and dtype for cross-rank aggregation.

Testing:
- Add test to verify structure and contents of tlparse artifiact

Pull Request resolved: pytorch#160132
Approved by: https://github.com/xmfan
ghstack dependencies: pytorch#160260
…ts (pytorch#159865)

Changes:
(1) Replace UserDefinedSetVariable by UserDefinedObjectVariable in all binop calls

Test plan:
(1) The three tests from CPython `test_collections.py` ensures that Dynamo can trace through a dunder method (e.g. __add__, __ixor__, etc) defined in a user defined class

Pull Request resolved: pytorch#159865
Approved by: https://github.com/mlazos
ghstack dependencies: pytorch#159365, pytorch#159366, pytorch#159368, pytorch#159483, pytorch#159902, pytorch#159864
…#160747)

Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.

Test Plan:
Testing the previously failing test `inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_welford_non_block_pointer_cuda`

Rollback Plan:

Differential Revision: D80348643

Pull Request resolved: pytorch#160747
Approved by: https://github.com/NikhilAPatel
Summary:
- Add TLParse artifact logging per op with output tensor shape, stride, and dtype for cross-rank aggregation.

Testing:
- Add test to verify structure and contents of tlparse artifiact

Pull Request resolved: pytorch#160132
Approved by: https://github.com/xmfan
Remove CONDA_CMAKE from `.ci/docker/build.sh`
Pull Request resolved: pytorch#160832
Approved by: https://github.com/malfet
Purely a refactor, improve typing and get rid of some type errors. Make certain fields as nonnull, since in general it's not empty.

The goal of this stack of PRs is to move the save/load logic of guard serialization into separate, flat phases, instead of being embedded in guard creation. This way, we can put a try/catch around it and fail safely if certain guards are not serializable.

Pull Request resolved: pytorch#160530
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
Because numpy 1.22.4 had reached EOL 3 years ago.
Pull Request resolved: pytorch#160836
Approved by: https://github.com/malfet
pytorchmergebot and others added 28 commits August 21, 2025 22:38
…addmm (pytorch#155357)"

This reverts commit ce048de.

Reverted pytorch#155357 on behalf of https://github.com/seemethere due to This is causing buck builds to fail since we didn't add the definition of AT_USE_EIGEN_SPARSE in the buckbuild.bzl file, will follow-up and re-land this. ([comment](pytorch#155357 (comment)))
Bumps [uv](https://github.com/astral-sh/uv) from 0.8.4 to 0.8.6.
- [Release notes](https://github.com/astral-sh/uv/releases)
- [Changelog](https://github.com/astral-sh/uv/blob/main/CHANGELOG.md)
- [Commits](astral-sh/uv@0.8.4...0.8.6)

---
updated-dependencies:
- dependency-name: uv
  dependency-version: 0.8.6
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ch#160205)

Parallelize reading of data behind thread_count argument to HFStorageReader
Test plan: ensure existing tests pass and run a job successfully with these changes

Differential Revision: [D79478188](https://our.internmc.facebook.com/intern/diff/D79478188/)

Pull Request resolved: pytorch#160205
Approved by: https://github.com/meetv18
Summary: att - changed one of the tests to get rid of torcharrow dep.

Test Plan:
```
buck2 test //caffe2/test/cpp/nativert:layout_planner_tests
Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Rollback Plan:

Reviewed By: SherlockNoMad

Differential Revision: D80108549

Pull Request resolved: pytorch#160942
Approved by: https://github.com/georgiaphillips, https://github.com/henryoier
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at.

Pull Request resolved: pytorch#161205
Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh
…61185)

Summary:
Removed `Model`, it's not being used anywhere so it's safe.

Removed `tensor_paths` and `constant_paths` fields in `ExportedProgram`
- BC: when the current deserializer load a previously serialized EP (that comes with empty `tensor_paths` and `constant_paths`), it will just ignore those two fields
- FC: when the old deserializer load a newly serialized EP (that doesn't come with `tensor_paths` and `constant_paths`, it will also ignore those two fields in `_dict_to_dataclass()`

Differential Revision: D80725094

Pull Request resolved: pytorch#161185
Approved by: https://github.com/SherlockNoMad
…ytorch#160373)

Following up on pytorch#152951 (comment), this removes a few lines added in that pull request, fixing link errors like
```
[7019/7028] Linking CXX shared library bin\torch_hip.dll
FAILED: [code=4294967295] bin/torch_hip.dll lib/torch_hip.lib
C:\Windows\system32\cmd.exe /C "cd . && D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\cmake\data\bin\cmake.exe -E vs_link_dll --msvc-ver=1942 --intdir=caffe2\CMakeFiles\torch_hip.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100261~1.0\x64\rc.exe --mt=C:\PROGRA~2\MICROS~2\2022\BUILDT~1\VC\Tools\Llvm\x64\bin\llvm-mt.exe --manifests  -- D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp  /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO && cd ."
LINK: command "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /MANIFEST:EMBED,ID=2" failed (exit code 1) with the following output:
lld-link: error: undefined symbol: __declspec(dllimport) class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::native::transform_bias_rescale_qkv_cuda(class at::Tensor const &, class at::Tensor const &, __int64)
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_CUDA___transform_bias_rescale_qkv(class 0xE9BF7323::Tensor const &, class 0xE9BF7323::Tensor const &, __int64))
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterNestedTensorCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_NestedTensorCUDA___transform_bias_rescale_qkv(class 0xEFEB5304::Tensor const &, class 0xEFEB5304::Tensor const &, __int64))
```

The `native_transformers_hip_hip` and `native_transformers_hip_cpp` sources are okay to define (and are required) even if accelerated versions of these operations are not available.

I've tested downstream builds of torch with ROCm on native Windows via https://github.com/ROCm/TheRock both with and without aotriton and these changes were needed for the build to succeed in both cases. I have _not_ tested Linux, WSL, or with the HIP SDK.

Pull Request resolved: pytorch#160373
Approved by: https://github.com/alugorey, https://github.com/jeffdaily
Note: Adding unit test for this is tricky as having errors in the specific unit test would cause test_utils.py to crash all together.

Tested as follows:
1. Added x = 1/0 after guarded_code = compile_inner(code, one_graph, hooks, transform) in convert_frame.py
2. Printed exception_stack_trace and got: ['Traceback (most recent call last):\n  File "/data/users/jovian/pytorch/torch/_dynamo/convert_frame.py", line 1207, in _compile\n    x = 1/0\n        ~^~\nZeroDivisionError: division by zero\n']

Pull Request resolved: pytorch#161096
Approved by: https://github.com/c00w
…59233)

Fixes pytorch#158076

Basically, the gemm template generates code like
```
cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>(
            &(X[static_cast<int64_t>(k_start + 196LL*m_start + 38416LL*ks_b_index)]),
            &(W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]),
            &(local_acc_buf[static_cast<int64_t>(Nr*nci + ((-1LL)*Nr*nc))]),
            static_cast<int64_t>(m_end + ((-1LL)*m_start)),
            static_cast<int64_t>(Nr),
            static_cast<int64_t>(k_end + ((-1LL)*k_start)),
            static_cast<int64_t>(196LL),
            static_cast<int64_t>(80LL),
            static_cast<int64_t>(Nc_blocks*Nr)
        );
```

However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is `2 * 200704000LL` away from `W.storage().data_ptr()`, which causes an out-of-bounds access.

The storage offset of `W` is introduced by [this patch](https://github.com/pytorch/pytorch/pull/136421/files), but I think it's a reasonable fix. So `cpp_gemm_template.py` should handle input matrices with storage offsets properly.

I think a good way to fix this issue is to create a new matrix that has no storage offset.

When `should_block_weights` is true, `block_weight()` creates a clean new matrix, so that branch is not affected by this issue.

BTW I've also examined the FX IRs generated by `torch.compile()`, as well as the generated python module, and they are correct.

The newly-added test in `test_cpu_select_algorithm.py` can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in pytorch#158076.

I ran CPU tests in `test_cpu_select_algorithm.py`, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test.

Pull Request resolved: pytorch#159233
Approved by: https://github.com/leslie-fang-intel, https://github.com/swolchok
…#161203)

Summary:
We use tempfile.NamedTemporaryFile to create a temporary pt2 file in `test_nativert.py`

However, it is not recognized as an allowed file format and a warning will be thrown.

Test Plan:
CI

Rollback Plan:

Differential Revision: D80740916

Pull Request resolved: pytorch#161203
Approved by: https://github.com/angelayi
…orch#161036)

Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: https://github.com/pytorch/pytorch/blob/5ee464db5c4293ac09521f9069fa7d2106680a7f/torch/_dynamo/variables/misc.py#L671

Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor.

FIXES pytorch#160901

It should also be the root cause behind the issue in pytorch/torchtitan#1604 @bdhirsh  @ruisizhang123

Pull Request resolved: pytorch#161036
Approved by: https://github.com/anijain2305
ghstack dependencies: pytorch#160805
…rch#161137)

This doesn't make sense to have this default to Maxwell, which is too old.  All other places in CI/CD needs to overwrite this value.  IMO, it makes more sense to not set this at all and let CI/CD jobs set it for their own use cases instead.  This is partly responsible for the build failure in pytorch#160988
Pull Request resolved: pytorch#161137
Approved by: https://github.com/msaroufim
Optimize [zero_grad doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html) format and description.

## Test Result

### Before

<img width="996" height="534" alt="image" src="https://github.com/user-attachments/assets/e1db973c-57e8-4525-90e7-0500cde2263d" />

### After

<img width="890" height="496" alt="image" src="https://github.com/user-attachments/assets/5579c4fb-a857-4030-9303-34770083d1a5" />

Pull Request resolved: pytorch#161239
Approved by: https://github.com/janeyx99
…#161196)

Enable max compatible to msvc for oneAPI headers.

The key context is `The /permissive- option is compatible with almost all of the header files from the latest Windows Kits` from https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170

Pull Request resolved: pytorch#161196
Approved by: https://github.com/jansel
Changes:
1. Math related build option is not supported by msvc, skip them on Windows.
2. Move all math related build option to `_get_ffast_math_flags` function.

Pull Request resolved: pytorch#161197
Approved by: https://github.com/jansel
# Motivation
pytorch#160505 enables background threads for XPU host allocator. However, it will hang on Windows during program exit. Now disable it until we narrow down the issue.

Pull Request resolved: pytorch#161242
Approved by: https://github.com/EikanWang
Removes a redundant if statement. Does not impact logic so no test changes needed.

Pull Request resolved: pytorch#161215
Approved by: https://github.com/StrongerXi
…58568)

Adds support for FlightRecorder in ProcessGroupXCCL.

See intel/torch-xpu-ops#1867 for XCCL implementation and more details.

Pull Request resolved: pytorch#158568
Approved by: https://github.com/guangyey, https://github.com/fduwjj
Add magma build 13.0 for Windows
Add cuda_install.bat 13.0 for Windows build
pytorch#159779

Pull Request resolved: pytorch#161073
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <[email protected]>
pytorch#159779

CUDA 13.0.0
NVSHMEM 3.3.20
CUDNN 9.12.0.46

Adding x86 linux builds for CUDA 13.
Adding libtorch docker.
Package naming changed for CUDA 13 (removed postfix -cu13 for some packages).

Preparation checklist:
1. Update index https://download.pytorch.org/whl/nightly/cu130 with pypi packages
2. Update packaging name based on https://pypi.org/project/cuda-toolkit/ metadata

Pull Request resolved: pytorch#160956
Approved by: https://github.com/atalman

Co-authored-by: atalman <[email protected]>
This reverts commit 523bffd.

Reverted pytorch#149218 on behalf of https://github.com/atalman due to Lets not use no-cache flags on test binaries ([comment](pytorch#149218 (comment)))
…sting_IFU_2025-08-22

# Conflicts:
#	.ci/docker/requirements-ci.txt
#	aten/src/ATen/Context.cpp
#	aten/src/ATen/cuda/tunable/GemmHipblaslt.h
#	aten/src/ATen/native/Normalization.cpp
#	aten/src/ATen/native/cuda/Blas.cpp
#	requirements.txt
#	test/distributed/_tools/test_fsdp2_mem_tracker.py
#	test/dynamo/test_activation_checkpointing.py
#	test/dynamo/test_structured_trace.py
#	test/inductor/test_combo_kernels.py
#	test/test_matmul_cuda.py
#	torch/_higher_order_ops/triton_kernel_wrap.py
#	torch/_inductor/choices.py
#	torch/_inductor/codegen/triton.py
#	torch/testing/_internal/common_cuda.py
@pragupta pragupta closed this Aug 29, 2025
@pragupta pragupta deleted the rocm7.1_internal_testing_IFU_2025-08-22 branch August 29, 2025 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.