Skip to content

Conversation

pragupta
Copy link
Owner

@pragupta pragupta commented Oct 2, 2025

Fixes #ISSUE_NUMBER

atalman and others added 30 commits September 9, 2024 14:05
* [RELEASE-ONLY CHANGES] Branch Cut for Release 2.5

* fix_lint
…t RC (pytorch#135517)

[Release only] Temp changes to build triton from pin
Use [email protected] for create_release.yml (pytorch#135528)

Fixes failure: https://github.com/pytorch/pytorch/actions/runs/10780281005/job/29895846007

Due broken sync
```
actions/upload-artifact@v2
and
actions/[email protected]
```

Pull Request resolved: pytorch#135528
Approved by: https://github.com/kit1980, https://github.com/malfet

(cherry picked from commit 9b76449)

Co-authored-by: atalman <[email protected]>
…hanges to build from pin (pytorch#135613)

* Revert "[RELEASE-ONLY CHANGES] Temp changes to build triton from pin for first RC (pytorch#135517)"

This reverts commit 4a3dabd.

* Build triton from release branch

* triton_pin

* fix

* Bump triton xpu pin and release version (pytorch#135638)

Similar with pytorch#135627

Pull Request resolved: pytorch#135638
Approved by: https://github.com/atalman

---------

Co-authored-by: chuanqiw <[email protected]>
The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter.

Pull Request resolved: pytorch#135372
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 5e14586)

Co-authored-by: Xavier Dupré <[email protected]>
…36276)

Revert "[Release only] Temporary disable triton xpu build (pytorch#136206)"

This reverts commit 6b14e6c.
…)" (#… (pytorch#135625)

Revert "[fx] Bypass custom __setattr__ in Node.__init__ (pytorch#135079)" (pytorch#135562)

This reverts commit 66da3b3.

pytorch#135079 breaks internal tests and needs to be reverted. Revert with mergebot doesn't work as this PR is technically part of the stack, but, according to @jansel, it should be possible to revert it individually.
Pull Request resolved: pytorch#135562
Approved by: https://github.com/jansel, https://github.com/seemethere

Co-authored-by: Ivan Zaitsev <[email protected]>
…ytorch#135868)

1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that
2. Update the `__array__` method so that it works for tensor on GPU

Fixes pytorch#135700
Pull Request resolved: pytorch#135786
Approved by: https://github.com/titaiwangms
…ch#135574) (pytorch#135935)

This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`.

Pull Request resolved: pytorch#135574
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
…rch#135561)

[inductor] [cpp] fix the input contiguous check in max-autotune (pytorch#134982)

## Description
Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm.

In this PR, we check whether input is contiguous using the following way:
If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous.

## Additional context
The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails:
https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4712-L4715

And it finally runs into this `copy_input` and returns a `FlexibleLayout`.
https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4722

When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model.
The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/mkldnn_lowerings.py#L1051) which calls [slice_nd](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/codegen/cpp_template_kernel.py#L150) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2288) invokes
[decide_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2135) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`.

Pull Request resolved: pytorch#134982
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
pytorch#136000)

[DCP] Fixes the stateless optimizer issue of distributed state_dict (pytorch#135535)

Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: pytorch#133415

Pull Request resolved: pytorch#135535
Approved by: https://github.com/wz337

Co-authored-by: Chien-Chin Huang <[email protected]>
) and Drop final None values as inputs for nodes in exporter graph (pytorch#135520) (pytorch#136005)

* [ONNX] Update fake mode usage in onnx docs (pytorch#135512)

Update fake mode usage in onnx docs
Pull Request resolved: pytorch#135512
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit 66db61f)

* [ONNX] Drop final None values as inputs for nodes in exporter graph (pytorch#135520)

When value for an optional input is not provided, it is defaulted to `None`, which gets translates to "" in the onnx graph. To avoid this, if we have a list of inputs and the final few are all `None`, strip them in the graph.
Pull Request resolved: pytorch#135520
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <[email protected]>
(cherry picked from commit e2f9a83)
update document for autocast on CPU
Update torch-xpu-ops pin (ATen XPU implementation) (pytorch#135647)

Release cycle for PyTorch 2.5
1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large.

Pull Request resolved: pytorch#135647
Approved by: https://github.com/EikanWang
…ytorch#136203)

[ONNX] Fix numpy method to return the correct type (pytorch#136162)

Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior.

This needs to be cherry-picked into torch 2.5
Pull Request resolved: pytorch#136162
Approved by: https://github.com/gramalingam, https://github.com/xadupre

(cherry picked from commit 67b14ce)
…orrectness check (pytorch#135932) (pytorch#136262)

Fix pytorch#135657.
Aligned with AMP BF16, using multiplier 3 for Inductor AMP FP16 benchmark correctness check

Pull Request resolved: pytorch#135932
Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel
…rch#135793)

Fixes pytorch#132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites.

Pull Request resolved: pytorch#135193
Approved by: https://github.com/chuanqi129, https://github.com/jansel

(cherry picked from commit 7ec17b4)
# Motivation
fix pytorch#135726
After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size.

# Additional Context
Add a UT to guard this scenario.

Pull Request resolved: pytorch#135818
Approved by: https://github.com/EikanWang

(cherry picked from commit e6b6835)
…#133852) (pytorch#136139)

Small bug fix - pytorch#124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic.

The original code was:
`if torch.version.hip is not None:`

Which was incorrectly replaced by:
`if self.device_props.type != "hip":`

Another occurence of pytorch#130617

Pull Request resolved: pytorch#133852
Approved by: https://github.com/masnesral, https://github.com/malfet

(cherry picked from commit da587de)
…s.h (pytorch#136426)

fix stride compare failed when size value equal to one in ForeachUtils.h (pytorch#134546)

When size value equal to one, tensor strides value need be skipped to compare.
@ezyang
Pull Request resolved: pytorch#134546
Approved by: https://github.com/janeyx99
* [ROCm] skip test_fp8_cast_and_t on non-MI300 machines (pytorch#135917)

Fixes #ISSUE_NUMBER

Pull Request resolved: pytorch#135917
Approved by: https://github.com/malfet

(cherry picked from commit 6cdc70b)

* Skip pointwise associative scan tests due to regression (changes based on PR pytorch#135995)

* Cherry-pick fix from pytorch#135702

---------

Co-authored-by: Prachi Gupta <[email protected]>
Co-authored-by: Jithun Nair <[email protected]>
Fix hardcoded ROCm paths in `Caffe2Targets.cmake` (pytorch#136283)

Fixes pytorch#131701

Use CMake imported targets more consistently to eliminate hardcode paths.

Here is the new relevant sections of Caffe2Targets.cmake:
```
set_target_properties(c10_hip PROPERTIES
  INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
  INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64"
)
```

```
set_target_properties(torch_hip PROPERTIES
  INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL"
  INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS"
  INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
  INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver"
)
```

HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects.

Pull Request resolved: pytorch#136283
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman

(cherry picked from commit e8f1dd6)

Co-authored-by: Nichols A. Romero <[email protected]>
Don't do push to https://ghcr.io/ on release branch: we don't need it and it fails with "unauthorized: unauthenticated: User cannot be authenticated with the token provided".
Disable iOS workflow (pytorch#136571)

See pytorch#136284
It's been broken for more than a week and it does not seem like anyone cares about fixing it.
Once it's landed I'll reassigned the issue on `oncall: mobile`
Pull Request resolved: pytorch#136571
Approved by: https://github.com/huydhn, https://github.com/kit1980

(cherry picked from commit 5340feb)

Co-authored-by: Nikita Shulga <[email protected]>
Make test_skip_data_serialization regex more flexible (pytorch#136580)

Some CI machines seem to throw "Can't get local object" rather than
"Can't pickle local object".
Pull Request resolved: pytorch#136580
Approved by: https://github.com/mikaylagawarecki

(cherry picked from commit a0c76ea)

Co-authored-by: Jez Ng <[email protected]>
dnikolaev-amd and others added 30 commits January 27, 2025 11:30
…OCm#1847)

Navi passes condition `torch.cuda.get_device_capability() >= (9, 4)` and
uses `default_workspace_size=128MB`, but it required only for MI300
Fix condition to use `("gfx94" in gcn_arch)` instead of
`torch.cuda.get_device_properties()` to detect MI300

(cherry picked from commit d4d0b07)
)

The default benchmark setting is now false. The new miopen behavior
means when benchmarking is disabled, for any shape that doesn't have a
find hit, then it will do a quick search (same behavior as the prior
default), and use that result. Now when benchmark is enabled, it will
perform an exhaustive search and update any DBs. miopen immediate mode
is still available and is used when deterministic is true and benchmark
is false.
… analysis (ROCm#141… (ROCm#1768)

Fixes pytorch#140800.

On AMD, backend-specific args like `matrix_instr_nonkdim`,
`waves_per_eu` and `kpack` are passed either direclty to the kernel or
via `triton.Config`, whereas they don't exist as kernel parameters.
Native Triton code handles those excessive args
[here](https://github.com/triton-lang/triton/blob/a6bb57d6285e723c58e87dd7cba263db6efff789/python/triton/runtime/jit.py#L594-L596).
In this PR, we add similar handling to the TTIR analysis code to avoid
bailing out. Pull Request resolved:
pytorch#141062 Approved by:
https://github.com/oulgen

(cherry picked from commit b740a1b)

Fixes #ISSUE_NUMBER

Co-authored-by: Adnan Akhundov <[email protected]>
…orch#144865) (ROCm#1869)

Fixes pytorch#144855

Follows approach in pytorch#141923 to
use int64 types to increase INT_MAX limits Pull Request resolved:
pytorch#144865 Approved by:
https://github.com/eqy

(cherry picked from commit 082fab0)
* Unroll loops manually to hide memory access latency
* Strided access for coalesced memory acesses

Co-authors: @akadutta @doru1004 @amd-hhashemi @carlobertolli
Tune 3D tensor sums when not using fastest dimension.
…Cm#1894)

Note that I had to copy and paste this one line change because the file
has changed so much that the cherry-pick command fails.

TunableOp UT will fail because the regular expression in the test will
not work for future versions of ROCm.

Pull Request resolved: pytorch#146548
Approved by: https://github.com/jeffdaily
Updating Apex 1.5.0 commit id for Fused gradient accumulator feature.
…covery (pytorch#144026) (ROCm#1895)

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292)

Reland of pytorch#140320 after failing test on trunk. Fixes potential
environment clobbering in test, makes ROCr+HIP devices (if specified
together) more robust to index errors.

Fixes pytorch#140318

Pull Request resolved: pytorch#142292
Approved by: https://github.com/jataylo, https://github.com/huydhn,
https://github.com/jeffdaily

Co-authored-by: Jack Taylor <[email protected]>
Co-authored-by: Jeff Daily <[email protected]>
(cherry picked from commit c0d7106)

Co-authored-by: Tal Ben-Nun <[email protected]>
(cherry picked from commit 23e390c)

Fixes #ISSUE_NUMBER

Co-authored-by: pytorchbot <[email protected]>
all Inductor tests completely disabled for Navi4

-
test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_*_cuda_bfloat16
- skipped on Navi due to triton bf16 to fp16 conversion error
- test_jit.py::TestFrozenOptimizations::test_freeze_conv_relu_fusion* -
skipped on Navi4, not supported by MIOpen
-
nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_cudnn_convolution_relu_cuda_float32
- skipped on Navi4, not supported by MIOpen
-
profiler/test_cpp_thread.py::CppThreadTest::test_without_enable_profiler_in_child_thread
- flaky on Navi
- test_ops.py::TestCommonCUDA::test_non_standard_bool_values_*
-
dynamo/test_activation_checkpointing.py::ActivationCheckpointingViaTagsTests::test_pattern_matcher
…se/OneDupOpt (ROCm#1897)" (ROCm#1918)

This reverts commit a6f1375. 
We have a better fix that is being validated
(pytorch#146420)
TopK performance on ROCm performs better on the test suite with the default config.

Pull Request resolved: pytorch#146387
Approved by: https://github.com/malfet, https://github.com/ngimel
In this approach, we are catching any lane within a wave that is
doing fastatomics to the same destination address and computing
the sum on the CU. This is leading to 3x improvement in scatter_add
performance end to end.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.