Pg scatter add dup fix 2.5 #21

pragupta · 2025-10-02T22:23:46Z

Fixes #ISSUE_NUMBER

* [RELEASE-ONLY CHANGES] Branch Cut for Release 2.5 * fix_lint

…t RC (pytorch#135517) [Release only] Temp changes to build triton from pin

Use [email protected] for create_release.yml (pytorch#135528) Fixes failure: https://github.com/pytorch/pytorch/actions/runs/10780281005/job/29895846007 Due broken sync ``` actions/upload-artifact@v2 and actions/[email protected] ``` Pull Request resolved: pytorch#135528 Approved by: https://github.com/kit1980, https://github.com/malfet (cherry picked from commit 9b76449) Co-authored-by: atalman <[email protected]>

…hanges to build from pin (pytorch#135613) * Revert "[RELEASE-ONLY CHANGES] Temp changes to build triton from pin for first RC (pytorch#135517)" This reverts commit 4a3dabd. * Build triton from release branch * triton_pin * fix * Bump triton xpu pin and release version (pytorch#135638) Similar with pytorch#135627 Pull Request resolved: pytorch#135638 Approved by: https://github.com/atalman --------- Co-authored-by: chuanqiw <[email protected]>

The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter. Pull Request resolved: pytorch#135372 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 5e14586) Co-authored-by: Xavier Dupré <[email protected]>

) Fixes pytorch#125158 Pull Request resolved: pytorch#135594 Approved by: https://github.com/justinchuby (cherry picked from commit e48ee2c)

…36276) Revert "[Release only] Temporary disable triton xpu build (pytorch#136206)" This reverts commit 6b14e6c.

@jansel

…)" (#… (pytorch#135625) Revert "[fx] Bypass custom __setattr__ in Node.__init__ (pytorch#135079)" (pytorch#135562) This reverts commit 66da3b3. pytorch#135079 breaks internal tests and needs to be reverted. Revert with mergebot doesn't work as this PR is technically part of the stack, but, according to @jansel, it should be possible to revert it individually. Pull Request resolved: pytorch#135562 Approved by: https://github.com/jansel, https://github.com/seemethere Co-authored-by: Ivan Zaitsev <[email protected]>

…ytorch#135868) 1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that 2. Update the `__array__` method so that it works for tensor on GPU Fixes pytorch#135700 Pull Request resolved: pytorch#135786 Approved by: https://github.com/titaiwangms

…ch#135574) (pytorch#135935) This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`. Pull Request resolved: pytorch#135574 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel

…rch#135561) [inductor] [cpp] fix the input contiguous check in max-autotune (pytorch#134982) ## Description Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm. In this PR, we check whether input is contiguous using the following way: If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous. ## Additional context The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails: https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4712-L4715 And it finally runs into this `copy_input` and returns a `FlexibleLayout`. https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4722 When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model. The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/mkldnn_lowerings.py#L1051) which calls [slice_nd](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/codegen/cpp_template_kernel.py#L150) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2288) invokes [decide_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2135) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`. Pull Request resolved: pytorch#134982 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel

pytorch#136000) [DCP] Fixes the stateless optimizer issue of distributed state_dict (pytorch#135535) Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues. fixes: pytorch#133415 Pull Request resolved: pytorch#135535 Approved by: https://github.com/wz337 Co-authored-by: Chien-Chin Huang <[email protected]>

) and Drop final None values as inputs for nodes in exporter graph (pytorch#135520) (pytorch#136005) * [ONNX] Update fake mode usage in onnx docs (pytorch#135512) Update fake mode usage in onnx docs Pull Request resolved: pytorch#135512 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 66db61f) * [ONNX] Drop final None values as inputs for nodes in exporter graph (pytorch#135520) When value for an optional input is not provided, it is defaulted to `None`, which gets translates to "" in the onnx graph. To avoid this, if we have a list of inputs and the final few are all `None`, strip them in the graph. Pull Request resolved: pytorch#135520 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit e2f9a83)

update document for autocast on CPU

Update torch-xpu-ops pin (ATen XPU implementation) (pytorch#135647) Release cycle for PyTorch 2.5 1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large. Pull Request resolved: pytorch#135647 Approved by: https://github.com/EikanWang

…ytorch#136203) [ONNX] Fix numpy method to return the correct type (pytorch#136162) Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior. This needs to be cherry-picked into torch 2.5 Pull Request resolved: pytorch#136162 Approved by: https://github.com/gramalingam, https://github.com/xadupre (cherry picked from commit 67b14ce)

…orrectness check (pytorch#135932) (pytorch#136262) Fix pytorch#135657. Aligned with AMP BF16, using multiplier 3 for Inductor AMP FP16 benchmark correctness check Pull Request resolved: pytorch#135932 Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel

…rch#135793) Fixes pytorch#132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites. Pull Request resolved: pytorch#135193 Approved by: https://github.com/chuanqi129, https://github.com/jansel (cherry picked from commit 7ec17b4)

# Motivation fix pytorch#135726 After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size. # Additional Context Add a UT to guard this scenario. Pull Request resolved: pytorch#135818 Approved by: https://github.com/EikanWang (cherry picked from commit e6b6835)

…#133852) (pytorch#136139) Small bug fix - pytorch#124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic. The original code was: `if torch.version.hip is not None:` Which was incorrectly replaced by: `if self.device_props.type != "hip":` Another occurence of pytorch#130617 Pull Request resolved: pytorch#133852 Approved by: https://github.com/masnesral, https://github.com/malfet (cherry picked from commit da587de)

@ezyang

…s.h (pytorch#136426) fix stride compare failed when size value equal to one in ForeachUtils.h (pytorch#134546) When size value equal to one, tensor strides value need be skipped to compare. @ezyang Pull Request resolved: pytorch#134546 Approved by: https://github.com/janeyx99

* [ROCm] skip test_fp8_cast_and_t on non-MI300 machines (pytorch#135917) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#135917 Approved by: https://github.com/malfet (cherry picked from commit 6cdc70b) * Skip pointwise associative scan tests due to regression (changes based on PR pytorch#135995) * Cherry-pick fix from pytorch#135702 --------- Co-authored-by: Prachi Gupta <[email protected]> Co-authored-by: Jithun Nair <[email protected]>

Pull Request resolved: pytorch#132555 Approved by: https://github.com/pruthvistony, https://github.com/malfet

) Upgrade ROCm CI builds to py3.10 Pull Request resolved: pytorch#134108 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/atalman Co-authored-by: Jack Taylor <[email protected]>

Fix hardcoded ROCm paths in `Caffe2Targets.cmake` (pytorch#136283) Fixes pytorch#131701 Use CMake imported targets more consistently to eliminate hardcode paths. Here is the new relevant sections of Caffe2Targets.cmake: ``` set_target_properties(c10_hip PROPERTIES INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64" ) ``` ``` set_target_properties(torch_hip PROPERTIES INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL" INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS" INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver" ) ``` HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects. Pull Request resolved: pytorch#136283 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman (cherry picked from commit e8f1dd6) Co-authored-by: Nichols A. Romero <[email protected]>

Don't do push to https://ghcr.io/ on release branch: we don't need it and it fails with "unauthorized: unauthenticated: User cannot be authenticated with the token provided".

Disable iOS workflow (pytorch#136571) See pytorch#136284 It's been broken for more than a week and it does not seem like anyone cares about fixing it. Once it's landed I'll reassigned the issue on `oncall: mobile` Pull Request resolved: pytorch#136571 Approved by: https://github.com/huydhn, https://github.com/kit1980 (cherry picked from commit 5340feb) Co-authored-by: Nikita Shulga <[email protected]>

Make test_skip_data_serialization regex more flexible (pytorch#136580) Some CI machines seem to throw "Can't get local object" rather than "Can't pickle local object". Pull Request resolved: pytorch#136580 Approved by: https://github.com/mikaylagawarecki (cherry picked from commit a0c76ea) Co-authored-by: Jez Ng <[email protected]>

…36734) This reverts commit aca4d6b.

…OCm#1847) Navi passes condition `torch.cuda.get_device_capability() >= (9, 4)` and uses `default_workspace_size=128MB`, but it required only for MI300 Fix condition to use `("gfx94" in gcn_arch)` instead of `torch.cuda.get_device_properties()` to detect MI300 (cherry picked from commit d4d0b07)

) The default benchmark setting is now false. The new miopen behavior means when benchmarking is disabled, for any shape that doesn't have a find hit, then it will do a quick search (same behavior as the prior default), and use that result. Now when benchmark is enabled, it will perform an exhaustive search and update any DBs. miopen immediate mode is still available and is used when deterministic is true and benchmark is false.

follow up to ROCm#1851

… analysis (ROCm#141… (ROCm#1768) Fixes pytorch#140800. On AMD, backend-specific args like `matrix_instr_nonkdim`, `waves_per_eu` and `kpack` are passed either direclty to the kernel or via `triton.Config`, whereas they don't exist as kernel parameters. Native Triton code handles those excessive args [here](https://github.com/triton-lang/triton/blob/a6bb57d6285e723c58e87dd7cba263db6efff789/python/triton/runtime/jit.py#L594-L596). In this PR, we add similar handling to the TTIR analysis code to avoid bailing out. Pull Request resolved: pytorch#141062 Approved by: https://github.com/oulgen (cherry picked from commit b740a1b) Fixes #ISSUE_NUMBER Co-authored-by: Adnan Akhundov <[email protected]>

…orch#144865) (ROCm#1869) Fixes pytorch#144855 Follows approach in pytorch#141923 to use int64 types to increase INT_MAX limits Pull Request resolved: pytorch#144865 Approved by: https://github.com/eqy (cherry picked from commit 082fab0)

@akadutta

* Unroll loops manually to hide memory access latency * Strided access for coalesced memory acesses Co-authors: @akadutta @doru1004 @amd-hhashemi @carlobertolli

Tune 3D tensor sums when not using fastest dimension.

Fixes SWDEV-501618

…Cm#1894) Note that I had to copy and paste this one line change because the file has changed so much that the cherry-pick command fails. TunableOp UT will fail because the regular expression in the test will not work for future versions of ROCm. Pull Request resolved: pytorch#146548 Approved by: https://github.com/jeffdaily

@amd-hhashemi

…pOpt (ROCm#1897) cherry-pick of pytorch#146448 Co-author: @amd-hhashemi

…ame (ROCm#1904) Fixes #ISSUE_NUMBER

Relates to: ROCm/builder#68 Validation: http://ml-ci-internal.amd.com:8080/job/pytorch/job/manylinux_rocm_wheels/530/

Updating Apex 1.5.0 commit id for Fused gradient accumulator feature.

…covery (pytorch#144026) (ROCm#1895) Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292) Reland of pytorch#140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes pytorch#140318 Pull Request resolved: pytorch#142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <[email protected]> Co-authored-by: Jeff Daily <[email protected]> (cherry picked from commit c0d7106) Co-authored-by: Tal Ben-Nun <[email protected]> (cherry picked from commit 23e390c) Fixes #ISSUE_NUMBER Co-authored-by: pytorchbot <[email protected]>

all Inductor tests completely disabled for Navi4 - test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_*_cuda_bfloat16 - skipped on Navi due to triton bf16 to fp16 conversion error - test_jit.py::TestFrozenOptimizations::test_freeze_conv_relu_fusion* - skipped on Navi4, not supported by MIOpen - nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_cudnn_convolution_relu_cuda_float32 - skipped on Navi4, not supported by MIOpen - profiler/test_cpp_thread.py::CppThreadTest::test_without_enable_profiler_in_child_thread - flaky on Navi - test_ops.py::TestCommonCUDA::test_non_standard_bool_values_* - dynamo/test_activation_checkpointing.py::ActivationCheckpointingViaTagsTests::test_pattern_matcher

…se/OneDupOpt (ROCm#1897)" (ROCm#1918) This reverts commit a6f1375. We have a better fix that is being validated (pytorch#146420)

…rch#145536) Also marginally improves cuda perf Pull Request resolved: pytorch#145536 Approved by: https://github.com/eqy

TopK performance on ROCm performs better on the test suite with the default config. Pull Request resolved: pytorch#146387 Approved by: https://github.com/malfet, https://github.com/ngimel

In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance end to end.

atalman and others added 30 commits September 9, 2024 14:05

[RELEASE-ONLY CHANGES] Branch Cut for Release 2.5 (pytorch#135506)

e130696

* [RELEASE-ONLY CHANGES] Branch Cut for Release 2.5 * fix_lint

[RELEASE-ONLY CHANGES] Temp changes to build triton from pin for firs…

4a3dabd

…t RC (pytorch#135517) [Release only] Temp changes to build triton from pin

[ONNX] Fix scaled_dot_product_attention with float scale (pytorch#135710

828d686

) Fixes pytorch#125158 Pull Request resolved: pytorch#135594 Approved by: https://github.com/justinchuby (cherry picked from commit e48ee2c)

[Release only] Temporary disable triton xpu build (pytorch#136206)

6b14e6c

Revert "[Release only] Temporary disable triton xpu build" (pytorch#1…

9e315fe

…36276) Revert "[Release only] Temporary disable triton xpu build (pytorch#136206)" This reverts commit 6b14e6c.

Update document for autocast on CPU (pytorch#136082)

1954439

update document for autocast on CPU

[ROCm][CI] upgrade CI to ROCm 6.2 (pytorch#132555) (pytorch#136467)

7c550fe

Pull Request resolved: pytorch#132555 Approved by: https://github.com/pruthvistony, https://github.com/malfet

[RELEASE-ONLY CHANGES] Don't push to https://ghcr.io/ (pytorch#136703)

4b030d4

Don't do push to https://ghcr.io/ on release branch: we don't need it and it fails with "unauthorized: unauthenticated: User cannot be authenticated with the token provided".

Revert "Trace fwd graph under no_grad mode pytorch#134872" (pytorch#1…

1715708

…36734) This reverts commit aca4d6b.

dnikolaev-amd and others added 30 commits January 27, 2025 11:30

missing miopen version.h (ROCm#1866)

47074cd

follow up to ROCm#1851

[ROCm] Improvements to non-vectorized elementwise kernels (ROCm#1872)

6a28181

* Unroll loops manually to hide memory access latency * Strided access for coalesced memory acesses Co-authors: @akadutta @doru1004 @amd-hhashemi @carlobertolli

[ROCm] Tune 3d tensor sums when not using fastest dimension (ROCm#1871)

8b75274

Tune 3D tensor sums when not using fastest dimension.

Skip non_standard_bool_values tests (ROCm#1879)

8b59eea

Fixes SWDEV-501618

[ROCm] Indexing perf optimization via Unroll/WideFetch/IdxReuse/OneDu…

a6f1375

…pOpt (ROCm#1897) cherry-pick of pytorch#146448 Co-author: @amd-hhashemi

[Release/2.5] Update Apex commit to append git commit to apex wheel n…

4b51542

…ame (ROCm#1904) Fixes #ISSUE_NUMBER

[Release/2.5] triton git commit (ROCm#1903)

8e47c5a

Relates to: ROCm/builder#68 Validation: http://ml-ci-internal.amd.com:8080/job/pytorch/job/manylinux_rocm_wheels/530/

Update related_commits (ROCm#1912)

23d1a81

Updating Apex 1.5.0 commit id for Fused gradient accumulator feature.

Revert "[ROCm] Indexing perf optimization via Unroll/WideFetch/IdxReu…

80e18c0

…se/OneDupOpt (ROCm#1897)" (ROCm#1918) This reverts commit a6f1375. We have a better fix that is being validated (pytorch#146420)

Removes threadfence from topk kernel to improve AMD performance (pyto…

1550e24

…rch#145536) Also marginally improves cuda perf Pull Request resolved: pytorch#145536 Approved by: https://github.com/eqy

[ROCm] TopK optimizations for AMD GPUs (pytorch#146387)

8ecf0f6

TopK performance on ROCm performs better on the test suite with the default config. Pull Request resolved: pytorch#146387 Approved by: https://github.com/malfet, https://github.com/ngimel

Opportunistic fastatomics

436ed6e

In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance end to end.

Create a utility function for opportunistic fastAtomic

2e20acb

move to __match_any

ceb669c

code cleanup

11edd41

add call for index_select ReduceAdd as well

0ca4465

Add -DHIP_ENABLE_WARP_SYNC_BUILTINS to cpp_extensions for torchvision

f1227f3

fix for wide dta types

44aa260

bug fix

65626aa

Skip non_standard_bool UT, fix in later PR

e821ccf

fix failing UTs

472356e

remove tol for induction/test_torchinductor_opinfo

b5f7317

fix inductor ut

61364d8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pg scatter add dup fix 2.5 #21

Pg scatter add dup fix 2.5 #21

Uh oh!

pragupta commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

43 participants

Pg scatter add dup fix 2.5 #21

Are you sure you want to change the base?

Pg scatter add dup fix 2.5 #21

Uh oh!

Conversation

pragupta commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

43 participants