forked from ROCm/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
Pg scatter add dup fix 2.5 #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pragupta
wants to merge
160
commits into
rocm7.1_internal_testing
Choose a base branch
from
pg-scatter-add-dup-fix-2.5
base: rocm7.1_internal_testing
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* [RELEASE-ONLY CHANGES] Branch Cut for Release 2.5 * fix_lint
…t RC (pytorch#135517) [Release only] Temp changes to build triton from pin
Use [email protected] for create_release.yml (pytorch#135528) Fixes failure: https://github.com/pytorch/pytorch/actions/runs/10780281005/job/29895846007 Due broken sync ``` actions/upload-artifact@v2 and actions/[email protected] ``` Pull Request resolved: pytorch#135528 Approved by: https://github.com/kit1980, https://github.com/malfet (cherry picked from commit 9b76449) Co-authored-by: atalman <[email protected]>
…hanges to build from pin (pytorch#135613) * Revert "[RELEASE-ONLY CHANGES] Temp changes to build triton from pin for first RC (pytorch#135517)" This reverts commit 4a3dabd. * Build triton from release branch * triton_pin * fix * Bump triton xpu pin and release version (pytorch#135638) Similar with pytorch#135627 Pull Request resolved: pytorch#135638 Approved by: https://github.com/atalman --------- Co-authored-by: chuanqiw <[email protected]>
The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter. Pull Request resolved: pytorch#135372 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 5e14586) Co-authored-by: Xavier Dupré <[email protected]>
) Fixes pytorch#125158 Pull Request resolved: pytorch#135594 Approved by: https://github.com/justinchuby (cherry picked from commit e48ee2c)
…36276) Revert "[Release only] Temporary disable triton xpu build (pytorch#136206)" This reverts commit 6b14e6c.
…)" (#… (pytorch#135625) Revert "[fx] Bypass custom __setattr__ in Node.__init__ (pytorch#135079)" (pytorch#135562) This reverts commit 66da3b3. pytorch#135079 breaks internal tests and needs to be reverted. Revert with mergebot doesn't work as this PR is technically part of the stack, but, according to @jansel, it should be possible to revert it individually. Pull Request resolved: pytorch#135562 Approved by: https://github.com/jansel, https://github.com/seemethere Co-authored-by: Ivan Zaitsev <[email protected]>
…ytorch#135868) 1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that 2. Update the `__array__` method so that it works for tensor on GPU Fixes pytorch#135700 Pull Request resolved: pytorch#135786 Approved by: https://github.com/titaiwangms
…ch#135574) (pytorch#135935) This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`. Pull Request resolved: pytorch#135574 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
…rch#135561) [inductor] [cpp] fix the input contiguous check in max-autotune (pytorch#134982) ## Description Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm. In this PR, we check whether input is contiguous using the following way: If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous. ## Additional context The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails: https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4712-L4715 And it finally runs into this `copy_input` and returns a `FlexibleLayout`. https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L4722 When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model. The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/mkldnn_lowerings.py#L1051) which calls [slice_nd](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/codegen/cpp_template_kernel.py#L150) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2288) invokes [decide_layout](https://github.com/pytorch/pytorch/blob/d14fe3ffeddff743af09ce7c8d91127940ddf7ed/torch/_inductor/ir.py#L2135) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`. Pull Request resolved: pytorch#134982 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
pytorch#136000) [DCP] Fixes the stateless optimizer issue of distributed state_dict (pytorch#135535) Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues. fixes: pytorch#133415 Pull Request resolved: pytorch#135535 Approved by: https://github.com/wz337 Co-authored-by: Chien-Chin Huang <[email protected]>
) and Drop final None values as inputs for nodes in exporter graph (pytorch#135520) (pytorch#136005) * [ONNX] Update fake mode usage in onnx docs (pytorch#135512) Update fake mode usage in onnx docs Pull Request resolved: pytorch#135512 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit 66db61f) * [ONNX] Drop final None values as inputs for nodes in exporter graph (pytorch#135520) When value for an optional input is not provided, it is defaulted to `None`, which gets translates to "" in the onnx graph. To avoid this, if we have a list of inputs and the final few are all `None`, strip them in the graph. Pull Request resolved: pytorch#135520 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <[email protected]> (cherry picked from commit e2f9a83)
update document for autocast on CPU
Update torch-xpu-ops pin (ATen XPU implementation) (pytorch#135647) Release cycle for PyTorch 2.5 1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large. Pull Request resolved: pytorch#135647 Approved by: https://github.com/EikanWang
…ytorch#136203) [ONNX] Fix numpy method to return the correct type (pytorch#136162) Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior. This needs to be cherry-picked into torch 2.5 Pull Request resolved: pytorch#136162 Approved by: https://github.com/gramalingam, https://github.com/xadupre (cherry picked from commit 67b14ce)
…orrectness check (pytorch#135932) (pytorch#136262) Fix pytorch#135657. Aligned with AMP BF16, using multiplier 3 for Inductor AMP FP16 benchmark correctness check Pull Request resolved: pytorch#135932 Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel
…rch#135793) Fixes pytorch#132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites. Pull Request resolved: pytorch#135193 Approved by: https://github.com/chuanqi129, https://github.com/jansel (cherry picked from commit 7ec17b4)
# Motivation fix pytorch#135726 After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size. # Additional Context Add a UT to guard this scenario. Pull Request resolved: pytorch#135818 Approved by: https://github.com/EikanWang (cherry picked from commit e6b6835)
…#133852) (pytorch#136139) Small bug fix - pytorch#124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic. The original code was: `if torch.version.hip is not None:` Which was incorrectly replaced by: `if self.device_props.type != "hip":` Another occurence of pytorch#130617 Pull Request resolved: pytorch#133852 Approved by: https://github.com/masnesral, https://github.com/malfet (cherry picked from commit da587de)
…s.h (pytorch#136426) fix stride compare failed when size value equal to one in ForeachUtils.h (pytorch#134546) When size value equal to one, tensor strides value need be skipped to compare. @ezyang Pull Request resolved: pytorch#134546 Approved by: https://github.com/janeyx99
* [ROCm] skip test_fp8_cast_and_t on non-MI300 machines (pytorch#135917) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#135917 Approved by: https://github.com/malfet (cherry picked from commit 6cdc70b) * Skip pointwise associative scan tests due to regression (changes based on PR pytorch#135995) * Cherry-pick fix from pytorch#135702 --------- Co-authored-by: Prachi Gupta <[email protected]> Co-authored-by: Jithun Nair <[email protected]>
Pull Request resolved: pytorch#132555 Approved by: https://github.com/pruthvistony, https://github.com/malfet
) Upgrade ROCm CI builds to py3.10 Pull Request resolved: pytorch#134108 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/atalman Co-authored-by: Jack Taylor <[email protected]>
Fix hardcoded ROCm paths in `Caffe2Targets.cmake` (pytorch#136283) Fixes pytorch#131701 Use CMake imported targets more consistently to eliminate hardcode paths. Here is the new relevant sections of Caffe2Targets.cmake: ``` set_target_properties(c10_hip PROPERTIES INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64" ) ``` ``` set_target_properties(torch_hip PROPERTIES INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL" INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS" INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include" INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver" ) ``` HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects. Pull Request resolved: pytorch#136283 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman (cherry picked from commit e8f1dd6) Co-authored-by: Nichols A. Romero <[email protected]>
Don't do push to https://ghcr.io/ on release branch: we don't need it and it fails with "unauthorized: unauthenticated: User cannot be authenticated with the token provided".
Disable iOS workflow (pytorch#136571) See pytorch#136284 It's been broken for more than a week and it does not seem like anyone cares about fixing it. Once it's landed I'll reassigned the issue on `oncall: mobile` Pull Request resolved: pytorch#136571 Approved by: https://github.com/huydhn, https://github.com/kit1980 (cherry picked from commit 5340feb) Co-authored-by: Nikita Shulga <[email protected]>
Make test_skip_data_serialization regex more flexible (pytorch#136580) Some CI machines seem to throw "Can't get local object" rather than "Can't pickle local object". Pull Request resolved: pytorch#136580 Approved by: https://github.com/mikaylagawarecki (cherry picked from commit a0c76ea) Co-authored-by: Jez Ng <[email protected]>
) The default benchmark setting is now false. The new miopen behavior means when benchmarking is disabled, for any shape that doesn't have a find hit, then it will do a quick search (same behavior as the prior default), and use that result. Now when benchmark is enabled, it will perform an exhaustive search and update any DBs. miopen immediate mode is still available and is used when deterministic is true and benchmark is false.
follow up to ROCm#1851
… analysis (ROCm#141… (ROCm#1768) Fixes pytorch#140800. On AMD, backend-specific args like `matrix_instr_nonkdim`, `waves_per_eu` and `kpack` are passed either direclty to the kernel or via `triton.Config`, whereas they don't exist as kernel parameters. Native Triton code handles those excessive args [here](https://github.com/triton-lang/triton/blob/a6bb57d6285e723c58e87dd7cba263db6efff789/python/triton/runtime/jit.py#L594-L596). In this PR, we add similar handling to the TTIR analysis code to avoid bailing out. Pull Request resolved: pytorch#141062 Approved by: https://github.com/oulgen (cherry picked from commit b740a1b) Fixes #ISSUE_NUMBER Co-authored-by: Adnan Akhundov <[email protected]>
…orch#144865) (ROCm#1869) Fixes pytorch#144855 Follows approach in pytorch#141923 to use int64 types to increase INT_MAX limits Pull Request resolved: pytorch#144865 Approved by: https://github.com/eqy (cherry picked from commit 082fab0)
* Unroll loops manually to hide memory access latency * Strided access for coalesced memory acesses Co-authors: @akadutta @doru1004 @amd-hhashemi @carlobertolli
Tune 3D tensor sums when not using fastest dimension.
Fixes SWDEV-501618
…Cm#1894) Note that I had to copy and paste this one line change because the file has changed so much that the cherry-pick command fails. TunableOp UT will fail because the regular expression in the test will not work for future versions of ROCm. Pull Request resolved: pytorch#146548 Approved by: https://github.com/jeffdaily
…pOpt (ROCm#1897) cherry-pick of pytorch#146448 Co-author: @amd-hhashemi
…ame (ROCm#1904) Fixes #ISSUE_NUMBER
Updating Apex 1.5.0 commit id for Fused gradient accumulator feature.
…covery (pytorch#144026) (ROCm#1895) Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292) Reland of pytorch#140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes pytorch#140318 Pull Request resolved: pytorch#142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <[email protected]> Co-authored-by: Jeff Daily <[email protected]> (cherry picked from commit c0d7106) Co-authored-by: Tal Ben-Nun <[email protected]> (cherry picked from commit 23e390c) Fixes #ISSUE_NUMBER Co-authored-by: pytorchbot <[email protected]>
all Inductor tests completely disabled for Navi4 - test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_*_cuda_bfloat16 - skipped on Navi due to triton bf16 to fp16 conversion error - test_jit.py::TestFrozenOptimizations::test_freeze_conv_relu_fusion* - skipped on Navi4, not supported by MIOpen - nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_cudnn_convolution_relu_cuda_float32 - skipped on Navi4, not supported by MIOpen - profiler/test_cpp_thread.py::CppThreadTest::test_without_enable_profiler_in_child_thread - flaky on Navi - test_ops.py::TestCommonCUDA::test_non_standard_bool_values_* - dynamo/test_activation_checkpointing.py::ActivationCheckpointingViaTagsTests::test_pattern_matcher
…se/OneDupOpt (ROCm#1897)" (ROCm#1918) This reverts commit a6f1375. We have a better fix that is being validated (pytorch#146420)
…rch#145536) Also marginally improves cuda perf Pull Request resolved: pytorch#145536 Approved by: https://github.com/eqy
TopK performance on ROCm performs better on the test suite with the default config. Pull Request resolved: pytorch#146387 Approved by: https://github.com/malfet, https://github.com/ngimel
In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance end to end.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #ISSUE_NUMBER