[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #4

pragupta · 2025-09-08T17:44:09Z

Merged latest changes from upstream/main into rocm7.1_internal_testing on 2025-09-08

Adds fallback commands for the following: * python setup.py install * python setup.py develop Ideally these should just work and should provide backwards compat. Thought process here is that multiple people rely on these commands and just because setuptools wants to drop support for this I don't think a lot of our downstream users who build from source are expecting these to be gone. This should provide some room for developers to move away from these commands until we have a unified frontend for doing all of these commands that should abstract most of these away. Signed-off-by: Eli Uriegas <[email protected]> Pull Request resolved: pytorch#162009 Approved by: https://github.com/clee2000, https://github.com/atalman

…accelerator (pytorch#161845) As the tile stated. As the document grows, the content will become more and more, so in order to make it easier for users to read and easier for developers to maintain, we have split this file into several separate files and placed them in a dedicated directory called "accelerator". Pull Request resolved: pytorch#161845 Approved by: https://github.com/albanD

…h#161903) As the title stated. Pull Request resolved: pytorch#161903 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161845

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#161928 Approved by: https://github.com/pytorchbot

… the hardware limit. (pytorch#161996) Summary: This is a re-land of [PR161040](pytorch#161040), which had previously caused test failures on AMD GPUs. The tests are now configured to target only NVIDIA GPUs. This diff removes configurations that exceed the hardware shared memory limit, which causes the following compilation error: ``` No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help. ``` Test Plan: ``` pytest test/inductor/test_max_autotune.py pytest test/inductor/test_triton_heuristics.py ``` Pull Request resolved: pytorch#161996 Approved by: https://github.com/coconutruben

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#161929 Approved by: https://github.com/pytorchbot

…ble (pytorch#161950) This PR is a followup to pytorch#149764. In that PR, it only forbids illegal view due to `Flatten`; this PR also forbids illegal view caused by `Split`. This PR also updates the error message to be less about internal implementation details, which users may find confusing. Pull Request resolved: pytorch#161950 Approved by: https://github.com/ezyang

Fixes pytorch#161483 When the whole `test/test_transformers.py` file is run, the case `test_default_priority_order` can pass because other xpu cases would call SDPA so that the priority order is set by https://github.com/pytorch/pytorch/blob/eec876deb659fe667aac2d97a48d7451c3e88dee/aten/src/ATen/native/mkldnn/xpu/Attention.cpp#L98-L112 However, when the case `test_default_priority_order` is run separately, the priority order is unset so that this case would fail. This PR fix this case. Pull Request resolved: pytorch#161690 Approved by: https://github.com/guangyey, https://github.com/drisspg

…61998) This function has come up in DTensor perf work, and I had a nitpick on pytorch#160256 so here it is. I have neither compiled nor measured this, but am reasonably confident it's better nonetheless. Pull Request resolved: pytorch#161998 Approved by: https://github.com/ezyang

Enable cat op for sparse on MPS Pull Request resolved: pytorch#162007 Approved by: https://github.com/malfet

…torch#161947) As the title stated. Pull Request resolved: pytorch#161947 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161845, pytorch#161903

Keeps SymInt::maybe_as_int small enough to inline. Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097) Pull Request resolved: pytorch#161466 Approved by: https://github.com/ezyang

If SymInt::maybe_as_int() returns non-empty, then we get an inline fast path. The philosophy here (as with the previous PR) is to preserve performance in the "plain old ints" case. Observed time spent in SymInt functions in computeStorageNBytes to drop (and not cost shift elsewhere in the function) after this change, profiling detach() using code similar to the benchmark from pytorch#160580 and Linux perf. Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107) Pull Request resolved: pytorch#161586 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466

Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci

…rch#159889) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#159889 Approved by: https://github.com/wconstab ghstack dependencies: pytorch#160449

Summary: we have ``` std::vector<size_t> constants_internal_offset( num_constants - num_folded_constants); ``` but the for loop does not consider it ``` for (size_t i = 0; i < num_constants; i++) { ... constants_internal_offset[i] ... ``` even in the for loop, it does ``` bool from_folded = this->constant_from_folded(i); if (from_folded) { continue; } ``` but `i` could still be wrong Rollback Plan: Differential Revision: D81425007 Pull Request resolved: pytorch#161887 Approved by: https://github.com/angelayi

Summary: Save the config args that Inductor burns into `inductor_metadata` so we can optionally pass them to any Jit Hooks that are set. This allows us to pass them to Tritonparse. Reviewed By: davidberard98, FindHao Differential Revision: D80994791 Pull Request resolved: pytorch#161953 Approved by: https://github.com/FindHao

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999

…0943)" This reverts commit bbedc71. Reverted pytorch#160943 on behalf of https://github.com/jeanschmidt due to See [D81486248](https://www.internalfb.com/diff/D81486248) for details on broken test ([comment](pytorch#160943 (comment)))

…h#161862)" This reverts commit 204697f. Reverted pytorch#161862 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D81522732 for more details ([comment](pytorch#161862 (comment)))

… nodes (pytorch#161339)" This reverts commit 90f50f7. Reverted pytorch#161339 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, check D81486248 for more details ([comment](pytorch#161339 (comment)))

…1261) In this pr, we port test/distributed/parallel 4 test files and test/distributed/debug 1 test file for Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. Use torch.accelerator for general gpu 2. Skip the case if running on xpu which has known issues Pull Request resolved: pytorch#161261 Approved by: https://github.com/guangyey, https://github.com/d4l3k

Summary: Inductor has the following configurations: config.comprehensive_padding config.padding_alignment_bytes config.padding_stride_threshold In the case of static shape by enabling these three options Inductor will generate code for Flexible layout tensors that tries to pad up all stride dimension to be a multiple of config.padding_alignment_bytes for strides above: config.padding_stride_threshold. In the case where dynamic shapes is enabled no padding is done today. This PR introduces the following configuration which allows the user to specify they wish to generated a padded stride even in the case of dynamic shape operations. This is mainly done so we don't break the previous behaviour of not padding up dynamic shape use cases. The config.padding_stride_threshold does not apply since the values of the strides are dynamic. config.pad_dynamic_shapes In addition to this a new mode "python_slow" has been added to launch grid calculation which achieves the same ceildiv behaviour that is generally applicable to integer division. This is done to prevent test regressions and make wrapper_fxir codegen more generic. Test Plan: CI Rollback Plan: Differential Revision: D80468808 Pull Request resolved: pytorch#160997 Approved by: https://github.com/blaine-rister, https://github.com/jansel

…2058) fix split_aot_inductor_output_path on Windows. Pull Request resolved: pytorch#162058 Approved by: https://github.com/angelayi

Pull Request resolved: pytorch#161799 Approved by: https://github.com/anijain2305

Pull Request resolved: pytorch#161800 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#161799

## Summary Adds a subgraph decomposition for addmm and mm that performs well on large `K` compared to `M` and `N`, and functions well as an alternative to `split-k` on AMD (transposed only), which does not support AMD currently. ## Background On AMD (MI300x), for a matmul A * B, if B is non-contiguous, the resulting matmul is quite a bit slower. For example: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[1, 178176])) )) ``` is a lot slower than: ``` args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda:0', torch.float16, size=[1024, 178176], stride=[178176, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float16, size=[178176, 6144], stride=[6144, 1])) )) ``` This PR adds a subgraph decomposition to test out whether making B contiguous is faster than just using the normal kernels. ## Data I ran this on unique non-contiguous shapes from torchbench/huggingface and got these speedups: ``` Parsed 420 unique shapes from benchmark output addmm improvements when best: addmm_16448x512x2048: +0.14% addmm_128x2048x2048: +0.01% addmm_128x768x1000: +0.75% addmm_12672x3072x768: +1.08% addmm_512x768x32000: +0.62% addmm_12608x384x384: +0.00% addmm_4160x1024x4096: +0.90% addmm_16x768x2: +0.56% addmm_12608x3072x768: +0.09% addmm_64x4096x1000: +2.77% addmm_256x1024x512: +1.99% addmm_30x256x256: +1.12% addmm_100480x128x384: +0.91% addmm_6400x2048x512: +0.25% addmm_61568x1024x256: +0.08% addmm_1x768x768: +0.93% addmm_12544x384x384: +0.19% addmm_128x512x1000: +0.77% addmm_2048x128x128: +1.32% addmm_128x3072x1000: +0.24% addmm_7936x512x2048: +0.07% addmm_8192x512x2048: +0.33% addmm_64x1024x1000: +1.43% addmm_128x2304x1000: +0.01% addmm_32768x256x2: +0.75% addmm_64x384x1152: +0.79% addmm_64x640x1000: +0.01% addmm_100480x128x128: +0.87% addmm_1152x3072x768: +1.13% addmm_8192x256x2048: +1.40% addmm_4096x128x768: +0.01% addmm_128x2560x1000: +0.01% addmm_12544x2048x512: +0.43% addmm_200704x24x96: +0.14% addmm_8448x512x2048: +0.96% addmm_50176x256x1024: +0.62% addmm_4160x4096x1024: +0.22% addmm_4096x768x768: +0.32% addmm_220x2048x512: +0.56% addmm_8x2048x1000: +1.12% addmm_256x197951x512: +26.99% addmm_401536x64x192: +0.60% addmm_2040x2048x512: +0.47% addmm_512x1024x256: +1.32% addmm_128x4096x1000: +1.67% addmm_12672x768x768: +0.34% addmm_128x368x1000: +0.77% addmm_96x1280x1000: +0.01% addmm_12544x512x2048: +0.41% addmm_6272x320x1280: +0.76% addmm_12544x3072x768: +0.09% addmm_64x384x1000: +0.39% mm improvements when best: mm_200704x128x512: +1.29% mm_663552x16x16: +0.80% mm_4096x768x768: +0.51% mm_131072x64x31: +0.24% mm_12544x1152x384: +0.11% mm_128x2048x2: +0.46% mm_262144x16x23: +0.62% mm_50176x576x192: +0.37% mm_131072x16x31: +0.26% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 247 Average Subgraph placement: 3.38 Median Subgraph placement: 2.0 Subgraph is best choice: 52/247 shapes (21.1%) Average improvement when best: 1.15% Median improvement when best: 0.58% Largest improvement when best: +26.99% Operation: bmm ---------------------------------------- Total shapes analyzed: 85 Average Subgraph placement: 24.00 Median Subgraph placement: 21.0 Subgraph is best choice: 0/85 shapes (0.0%) Average improvement when best: N/A (never best) Median improvement when best: N/A (never best) Largest improvement when best: N/A (never best) Operation: mm ---------------------------------------- Total shapes analyzed: 88 Average Subgraph placement: 15.08 Median Subgraph placement: 4.0 Subgraph is best choice: 9/88 shapes (10.2%) Average improvement when best: 0.52% Median improvement when best: 0.46% Largest improvement when best: +1.29% ``` ## Results The largest shape gain, `256,197951,512`, seemed to be driven by a case where the extern kernel is way faster than the best triton configs on the recursive autotune: ``` addmm,Extern,extern_kernels.addmm,256,197951,512,0.38024500012397766 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.005444049835205 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.04189395904541 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.1911399364471436 addmm,Triton,256,197951,512,64,128,32,2,4,8,2.496040105819702 addmm,Triton,256,197951,512,64,128,64,2,8,16,2.9306790828704834 addmm,Triton,256,197951,512,64,64,32,2,4,8,3.0347819328308105 ... ``` Compared to the non-transposed autotune: ``` addmm,Subgraph,contiguous_addmm_1384,256,197951,512,0.5024129748344421 addmm,Extern,extern_kernels.addmm,256,197951,512,0.6881489753723145 addmm,Triton,256,197951,512,32,256,16,2,2,4,2.5115010738372803 addmm,Triton,256,197951,512,32,128,32,2,4,8,2.5167479515075684 addmm,Triton,256,197951,512,64,128,16,2,4,8,2.9507460594177246 addmm,Triton,256,197951,512,64,256,64,2,8,4,2.9673290252685547 addmm,Triton,256,197951,512,64,128,64,2,8,16,3.3906331062316895 addmm,Triton,256,197951,512,64,128,32,2,4,8,3.496859073638916 ``` It seems to perform really well for high values of `K` vs `N` and `M`. Testing this hypothesis with some custom shapes: ``` Parsed 64 unique shapes from benchmark output addmm improvements when best: addmm_128x16384x128: +0.18% addmm_128x262144x256: +38.24% addmm_128x200000x512: +14.76% addmm_256x800000x128: +0.06% addmm_131072x128x256: +0.27% addmm_128x256x131072: +0.25% addmm_2048x200000x64: +12.45% mm improvements when best: mm_128x16384x128: +0.18% mm_128x262144x256: +38.05% mm_128x200000x512: +9.47% mm_256x800000x128: +0.99% mm_512x6400000x256: +3.17% mm_524288x64x64: +0.29% mm_2048x200000x64: +11.19% mm_8192x1000000x256: +34.14% mm_128x4096x100000: +0.40% mm_128x3072x150000: +0.27% ================================================================================ BENCHMARK ANALYSIS RESULTS ================================================================================ Operation: addmm ---------------------------------------- Total shapes analyzed: 33 Average Subgraph placement: 4.39 Median Subgraph placement: 2.0 Subgraph is best choice: 7/33 shapes (21.2%) Average improvement when best: 9.46% Median improvement when best: 0.27% Largest improvement when best: +38.24% Operation: mm ---------------------------------------- Total shapes analyzed: 30 Average Subgraph placement: 7.63 Median Subgraph placement: 2.0 Subgraph is best choice: 10/30 shapes (33.3%) Average improvement when best: 9.81% Median improvement when best: 2.08% Largest improvement when best: +38.05% ``` ## Conclusion Contiguous Subgraph Decompositionseems worthwhile for `mm` and `addmm`, but not `bmm`, and has a very large improvment on low `M`, low `N`, and high `K` shapes. Data gathering scripts: https://gist.github.com/exclamaforte/4a896c064d301b27bf5ca0a4f8fc3866 ## Test Plan: New unit tests. Differential Revision: D80771648 Pull Request resolved: pytorch#161241 Approved by: https://github.com/eellison

Enable python 3.13t, 3.14 and 3.14t on s390x for nightly binaries Fixes pytorch#161515 Pull Request resolved: pytorch#161920 Approved by: https://github.com/malfet

…ytorch#160510) Pull Request resolved: pytorch#160510 Approved by: https://github.com/ezyang

…eCallers (pytorch#161348)" This reverts commit c321111. Reverted pytorch#161348 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, see [D81520569](https://www.internalfb.com/diff/D81520569) ([comment](pytorch#161347 (comment)))

This reverts commit 9a8d454. Reverted pytorch#161347 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, see [D81520569](https://www.internalfb.com/diff/D81520569) ([comment](pytorch#161347 (comment)))

This reverts commit 6087ef4. Reverted pytorch#162001 on behalf of https://github.com/jeanschmidt due to breaks internal ads signal, see [D81845017](https://www.internalfb.com/diff/D81845017) ([comment](pytorch#162001 (comment)))

… GPU (pytorch#159118)" This reverts commit 5c473e9. Reverted pytorch#159118 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert pytorch#159473 ([comment](pytorch#159118 (comment)))

This reverts commit b9ba612. Reverted pytorch#161715 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert pytorch#159473, feel free to merge it back once conflicts are cleared ([comment](pytorch#161715 (comment)))

@d4l3k

…GPU (pytorch#159473)" This reverts commit 040d00a. Reverted pytorch#159473 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal signals, @d4l3k please help the author to have this change landed. [D81718444](https://www.internalfb.com/diff/D81718444) ([comment](pytorch#159473 (comment)))

…60483) Pull Request resolved: pytorch#160483 Approved by: https://github.com/zou3519

) The goal of this PR stack is to be able to implement `aot_compile_module`, which AOT precompiles a torch.nn.Module. Step 1 is a simple refactor to make CompileArtifacts itself the callable, which makes it easier to use directly. Pull Request resolved: pytorch#162169 Approved by: https://github.com/zhxchen17

This PR hooks up the python wrapper inductor backend to aot_compile. This is *not* the best way for us to grab the output of AOTAutograd; that involves a refactor to make AOTAutograd itself return a serializable callable. I'll do that refactor soon, but I want a basic interface to test with for now. In the medium term, we'll want aot_compile to call AOTAutograd directly, instead of using the TorchInductorWrapper's callback through compile_fx. Pull Request resolved: pytorch#162170 Approved by: https://github.com/zhxchen17 ghstack dependencies: pytorch#162169

This PR implements the Autograd feature of the associative_scan. Pull Request resolved: pytorch#139939 Approved by: https://github.com/ydwu4

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#162315 Approved by: https://github.com/pytorchbot

…lt (pytorch#159889)" This reverts commit 01edcd4. Reverted pytorch#159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](pytorch#160449 (comment)))

This reverts commit de893e9. Reverted pytorch#160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](pytorch#160449 (comment)))

Follow-up to pytorch#161768. Context: ProcessPool pickles the outputs before sending them back to the main process. Triton kernels have some un-pickleable fields, so `prepare_for_pickle()` is used to strip out those fields. Previously, in the standard case (without triton_bundler.py), `prepare_for_pickle()` would strip out the un-pickleable fields and they would never be added back after unpickling, because the un-pickleable fields were not actually needed after compilation finished. In pytorch#161768 updated `prepare_for_pickle` to also strip out the `fn._hash_lock` field, a newly added field in JITCallable instances which is a `threading.RLock()`, which is not pickleable. It turns out that we do need to restore the `fn._hash_lock` field, even in the non-triton_bundler case - the MultiKernel case uses the hash lock. To do this, we add `restore_after_unpickle()` which will restore fields (or if the old fields are not provided, initialize just the hash_lock) Compile time benchmarks look good, maybe a very minor regression (see the comment below on the PR) Pull Request resolved: pytorch#162244 Approved by: https://github.com/atalman

…ytorch#162309) Fixes static cuda launcher after triton-lang/triton#7866. Static cuda launcher checks to make sure that no hook knobs are set (and if they are, it throws an error). But Triton has changed the semantics of hooks so that "empty hooks" are now represented by empty `HookChain`s instead of being represented by `None`. This PR changes the way we define "empty hooks" to account for HookChains. Pull Request resolved: pytorch#162309 Approved by: https://github.com/aakhundov ghstack dependencies: pytorch#162244

@ngimel

The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect. Removing the alpha and beta fixes the issue. Thanks @ngimel to figure out the root cause. Pull Request resolved: pytorch#162040 Approved by: https://github.com/danielvegamyhre

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: pytorch#162372 Approved by: https://github.com/pytorchbot

This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: pytorch#161395 Approved by: https://github.com/pytorchbot

Update PyTorch to the latest Triton release candidate branch (release/3.5.x in triton-lang/triton) Notably: * this does *not* include the version number bump from 3.4 -> 3.5 (we'll do that in a follow-up PR) * sam_fast is still failing, so we've disabled it temporarily pytorch#162282 and we are committed to fixing it, ideally before the branch cut but possibly as a cherry-pick into the release branch. Pull Request resolved: pytorch#162278 Approved by: https://github.com/atalman ghstack dependencies: pytorch#162244, pytorch#162309

… mode (pytorch#161405) During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms. Adding optional usage of: - c10d.time_estimator for collectives, which is based on NCCL estimator Benchmark mode only for matmuls, as they are highly dependent on mm backend - The logic mostly copied from Ruisi's PRs for inductor simple_fsdp pytorch#157572 This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()` Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294) Pull Request resolved: pytorch#161405 Approved by: https://github.com/eellison

)" This reverts commit c9ac8c2. Reverted pytorch#162315 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if this introduced the failure https://github.com/pytorch/pytorch/actions/runs/17539536914/job/49810513700 ([comment](pytorch#162315 (comment)))

This reverts commit 3770337. Reverted pytorch#161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](pytorch#161649 (comment)))

…#162041) The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: pytorch#162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: pytorch#162040

…lready know (pytorch#161591) We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap. Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099) Pull Request resolved: pytorch#161591 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590

…ytorch#161595) This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better. Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095) Pull Request resolved: pytorch#161595 Approved by: https://github.com/ezyang, https://github.com/bdhirsh ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591

…rch#161058) This PR is the first split PR of pytorch#156272, only contains the OneDNN code. Please help review. Pending on OneDNN v3.9 commit update. Don't merge. Pull Request resolved: pytorch#161058 Approved by: https://github.com/guangyey, https://github.com/EikanWang

LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel

…sting_IFU_2025-09-08 # Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/requirements-ci.txt # aten/src/ATen/Context.cpp # aten/src/ATen/cuda/tunable/GemmHipblaslt.h # aten/src/ATen/native/ConvUtils.h # aten/src/ATen/native/Convolution.cpp # aten/src/ATen/native/Normalization.cpp # aten/src/ATen/native/cuda/Blas.cpp # aten/src/ATen/native/miopen/Conv_miopen.cpp # requirements.txt # test/distributed/_tools/test_fsdp2_mem_tracker.py # test/distributed/tensor/parallel/test_tp_examples.py # test/dynamo/test_activation_checkpointing.py # test/dynamo/test_structured_trace.py # test/inductor/test_aot_inductor.py # test/inductor/test_combo_kernels.py # test/test_matmul_cuda.py # test/test_sparse.py # torch/_higher_order_ops/triton_kernel_wrap.py # torch/_inductor/choices.py # torch/_inductor/codegen/triton.py # torch/testing/_internal/common_cuda.py

seemethere and others added 30 commits September 3, 2025 02:56

Using pip3 install instead of python setup.py develop/install (pytorc…

dac8a4b

…h#161903) As the title stated. Pull Request resolved: pytorch#161903 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161845

[MPS] enable cat op for sparse (pytorch#162007)

2c03f0a

Enable cat op for sparse on MPS Pull Request resolved: pytorch#162007 Approved by: https://github.com/malfet

Using get_paths() to get correct installation path for PYTHONPATY (py…

827f0d4

…torch#161947) As the title stated. Pull Request resolved: pytorch#161947 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161845, pytorch#161903

Outline SymInt::maybe_as_int_slow_path (pytorch#161466)

fa1514a

Keeps SymInt::maybe_as_int small enough to inline. Differential Revision: [D81530097](https://our.internmc.facebook.com/intern/diff/D81530097) Pull Request resolved: pytorch#161466 Approved by: https://github.com/ezyang

Always build USE_DISTRIBUTED. (pytorch#160449)

90b0864

Signed-off-by: Edward Yang <[email protected]> Pull Request resolved: pytorch#160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci

[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (pytorch#161999)

02c83f1

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

[inductor] fix split_aot_inductor_output_path on Windows. (pytorch#16…

451ed93

…2058) fix split_aot_inductor_output_path on Windows. Pull Request resolved: pytorch#162058 Approved by: https://github.com/angelayi

Add CPython test test_range (pytorch#161799)

889f01e

Pull Request resolved: pytorch#161799 Approved by: https://github.com/anijain2305

Add range_iterator (pytorch#161800)

eb18d32

Pull Request resolved: pytorch#161800 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#161799

S390x: build nightly binaries for new pythons (pytorch#161920)

71992dd

Enable python 3.13t, 3.14 and 3.14t on s390x for nightly binaries Fixes pytorch#161515 Pull Request resolved: pytorch#161920 Approved by: https://github.com/malfet

stop suggesting using guard_size_oblivious on data dependent errors (p…

3559c35

…ytorch#160510) Pull Request resolved: pytorch#160510 Approved by: https://github.com/ezyang

pytorchmergebot and others added 28 commits September 7, 2025 20:39

[while_loop][autograd] support autograd_key of while_loop (pytorch#1…

ec2e368

…60483) Pull Request resolved: pytorch#160483 Approved by: https://github.com/zou3519

[associative_scan] Autograd separated (pytorch#139939)

103f725

This PR implements the Autograd feature of the associative_scan. Pull Request resolved: pytorch#139939 Approved by: https://github.com/ydwu4

Update slow tests (pytorch#161395)

e101411

This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: pytorch#161395 Approved by: https://github.com/pytorchbot

Revert "Use vectorized stores for all dtypes in cat (pytorch#161649)"

a92773e

This reverts commit 3770337. Reverted pytorch#161649 on behalf of https://github.com/ngimel due to reverted internally ([comment](pytorch#161649 (comment)))

pragupta closed this Sep 9, 2025

pragupta deleted the rocm7.1_internal_testing_IFU_2025-09-08 branch September 9, 2025 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #4

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #4

Uh oh!

pragupta commented Sep 8, 2025

Uh oh!

Uh oh!

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #4

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-08 #4

Uh oh!

Conversation

pragupta commented Sep 8, 2025

Uh oh!

Uh oh!