[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 #12

pragupta · 2025-10-01T02:55:23Z

Fixes #ISSUE_NUMBER

… LAMBDA_GUARD (pytorch#162525)" This reverts commit 5f630d2. Reverted pytorch#162525 on behalf of https://github.com/anijain2305 due to internal tests fail ([comment](pytorch#162525 (comment)))

…rsion (pytorch#162695)" This reverts commit a8432bc. Reverted pytorch#162695 on behalf of https://github.com/anijain2305 due to internal failure at https://fburl.com/workplace/qiitdlp6 ([comment](pytorch#162695 (comment)))

Summary: This PR is extracted from pytorch#162542, to make the original PR easier to review. This PR only contains cosmetic changes. Pull Request resolved: pytorch#163115 Approved by: https://github.com/tianyu-l ghstack dependencies: pytorch#162539, pytorch#162540, pytorch#162541

Summary: This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA. Motivation: Same as pytorch#159325. Pull Request resolved: pytorch#160938 Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168

… /.ci/docker/ci_commit_pins (pytorch#162063) * [Dependabot] Update(deps): Bump transformers Bumps [transformers](https://github.com/huggingface/transformers) from 4.54.0 to 4.56.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.54.0...v4.56.0) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.56.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * Refresh results Signed-off-by: Huy Do <[email protected]> * Another round of updates Signed-off-by: Huy Do <[email protected]> * Another round of update Signed-off-by: Huy Do <[email protected]> * Hopefully the last round of update Signed-off-by: Huy Do <[email protected]> * Plz Signed-off-by: Huy Do <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Huy Do <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Huy Do <[email protected]>

…torch#163205) It seems `TEST_CUDA` is set to true even for ROCm (MI200) jobs. Changing if TEST_CUDA to an else condition to avoid running symmetric memory UTs on MI200. For other non-rocm arch, it should return true and can be skipped using other skip decorators. Pull Request resolved: pytorch#163205 Approved by: https://github.com/ezyang Co-authored-by: Jeff Daily <[email protected]>

…ch#163127) PR pytorch#151360 added mx fp8 and fp4 support on ROCm. 1. However, on recent upstream, scaling function in Blas.cpp along with test_matmul_cuda changes triggered failures. This patch corrects is_blockwise_1x32_scaling function code. 2. Fixes the m, n, k dimensions for ROCm mx case. 3. Modify FP4E2M1FN_LARGEST_POW2 (largest power of 2 representable in `torch.float4_e2m1fn_x2`) to 2. This resulted in higher SQNR value for mx fp4 test. Testing result on gfx950 w/ ROCm7.0 PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 452 tests in 22.698s OK passed 111 This is same as before. (when PR 151360 was merged) Pull Request resolved: pytorch#163127 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

…n H100 (pytorch#162022) only cuBLAS supports float32 output and cuBLAS only supports rowwise for SM 9.0 Intended to land after pytorch#161305 Pull Request resolved: pytorch#162022 Approved by: https://github.com/ngimel

…onfig (pytorch#163318) ```Shell Up to 4x perf boost 🔝 Top 5 Performance Differences (by absolute %): shape: (5, 7) ┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035 ┆ 532.580435 ┆ 4.268325 ┆ 326.832527 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557 ┆ 519.798488 ┆ 4.175271 ┆ 317.527078 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189 ┆ 512.877391 ┆ 4.136635 ┆ 313.663544 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 122.827725 ┆ 496.195958 ┆ 4.039772 ┆ 303.977164 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738 ┆ 484.244647 ┆ 3.910663 ┆ 291.066303 │ └───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘ 🔺 Top 5 Cases Where better_configs (change) is Faster than base (baseline): shape: (5, 7) ┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035 ┆ 532.580435 ┆ 4.268325 ┆ 326.832527 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557 ┆ 519.798488 ┆ 4.175271 ┆ 317.527078 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189 ┆ 512.877391 ┆ 4.136635 ┆ 313.663544 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 122.827725 ┆ 496.195958 ┆ 4.039772 ┆ 303.977164 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738 ┆ 484.244647 ┆ 3.910663 ┆ 291.066303 │ └───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘ 🔻 Top 5 Cases Where better_configs (change) is Slower than base (baseline): shape: (5, 7) ┌───────────────┬────────────────┬───────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════════╪════════════════╪═══════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪═══════════╡ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 267.502004 ┆ 250.728732 ┆ 0.937297 ┆ -6.270335 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 4, 8192, 128) ┆ 248.510516 ┆ 235.210874 ┆ 0.946483 ┆ -5.351742 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, 16384, 128) ┆ 282.856295 ┆ 271.806926 ┆ 0.960936 ┆ -3.906354 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 64) ┆ 282.212695 ┆ 280.519092 ┆ 0.993999 ┆ -0.600116 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 32768, 4, 32768, 128) ┆ 295.864073 ┆ 294.477894 ┆ 0.995315 ┆ -0.468519 │ └───────────────┴────────────────┴───────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴───────────┘ 📊 Performance Summary: ============================================================ Baseline: base Change: better_configs Geometric Mean Speedup (change over baseline): 1.9954x Geometric Mean % Change: +99.54% Median Speedup (change over baseline): 2.1590x Speedup Std Dev: 0.9800 Valid Comparisons: 60/60 ``` Pull Request resolved: pytorch#163318 Approved by: https://github.com/BoyuanFeng

For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: pytorch#163227 Approved by: https://github.com/zou3519

…TMA template for GEMMs (pytorch#163147) Summary: X-link: meta-pytorch/tritonbench#432 Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template. This diff also adds a benchmark for the scaled Blackwell persistent + TMA template to TritonBench `fp8_gemm`. Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation. In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff. Test Plan: Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel. Benchmarking command: ``` TRITON_PRINT_AUTOTUNING=1 TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor TRITON_CACHE_DIR=~/personal/cache_dir_triton TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op fp8_gemm --only torch_fp8_gemm,blackwell_pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/fp8_shapes_testing.json --scaling_rowwise --output="/home/jananisriram/personal/fp8_shapes_testing_results.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/fp8_shapes_testing.log ``` Rollback Plan: Differential Revision: D82597111 Pull Request resolved: pytorch#163147 Approved by: https://github.com/njriasan

As in title The auto pin update was merged without running vllm workflow Pull Request resolved: pytorch#163353 Approved by: https://github.com/malfet, https://github.com/wdvr

…ytorch#162772)" This reverts commit 49d30f9. Reverted pytorch#162772 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#162772 (comment)))

This reverts commit c9b80c4. Reverted pytorch#162590 on behalf of https://github.com/malfet due to This breaks CUDA 13 builds ([comment](pytorch#162590 (comment)))

pytorch#155989) …ght and kernel_width that overflows to be exactly 0 Fixes [pytorch#155981](pytorch#155981) Pull Request resolved: pytorch#155989 Approved by: https://github.com/malfet

Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets Fixes pytorch#163342 Pull Request resolved: pytorch#163349 Approved by: https://github.com/seemethere

This code is a delicious spaghetti: Sometimes python version is defined in jinja template (see pytorch#162297 ) sometimes in shell script (see pytorch#162877 ), but this time around it's in a python file (and there is another one called `generate_binary_build_matrix.py` that defines `FULL_PYTHON_VERSIONS`) Pull Request resolved: pytorch#163339 Approved by: https://github.com/clee2000

Fixes pytorch#156740 Adds explicit `Any` typing to `*args` and `**kwargs` in `nn.Module.__init__()` to fix type checker errors in strict mode. Pull Request resolved: pytorch#157389 Approved by: https://github.com/Skylion007, https://github.com/Raman-RH

Improves error message reported on pytorch#163321 Pull Request resolved: pytorch#163350 Approved by: https://github.com/Skylion007, https://github.com/xmfan

…e_format in compile (pytorch#163017) Fixes pytorch#161010 by making `clone_meta` match the semantics of strides for eager mode. This is: * Case 1: Tensor is_non_overlapping_and_dense; in this case, stride should match input tensor stride * Case 2: Otherwise, stride should be contiguous computed from input tensor using `compute_elementwise_output_strides` Pull Request resolved: pytorch#163017 Approved by: https://github.com/williamwen42, https://github.com/xmfan Co-authored-by: morrison-turnansky <[email protected]>

Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase Pull Request resolved: pytorch#163341 Approved by: https://github.com/clee2000 ghstack dependencies: pytorch#163339

This change may also resolve pytorch#161789, though verification is still needed. PR pytorch#130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: pytorch#162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed

…d on hardware libraries (pytorch#162245)" This reverts commit 35d7b32. Reverted pytorch#162245 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#162245 (comment)))

As titled. Avoiding a potential hang when running dispatch and combine in subgroups. The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change) Pull Request resolved: pytorch#163298 Approved by: https://github.com/fegin

Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: pytorch#162680 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#163298

@guilhermeleobas

The issue cannot be reproduced using the original repro code provided in the issue description. However, the underlying issue mentioned by the maintainer (missing functions in `builder.py` and `trace_rules.py`) was never addressed and can still be reproduced with this test case: ```python import torch from torch.nn.attention import _cur_sdpa_kernel_backends @torch.compile(fullgraph=True) def test_function_that_triggers_error(): return _cur_sdpa_kernel_backends() print("Calling torch.compile function...") try: result = test_function_that_triggers_error() print(f"Success: {result}") except Exception as e: print(f"ERROR: {e}") print(f"Error type: {type(e)}") ``` The original repro likely no longer triggers the issue due to code path changes in the SDPA implementation, while the direct call to `_cur_sdpa_kernel_backends()` exposes the underlying problem where certain torch._C functions returning non-Tensor values aren't properly handled by dynamo tracing. I have implemented the changes by adding the missing functions to both `builder.py` and `trace_rules.py` to properly handle these cases during compilation. @guilhermeleobas Pull Request resolved: pytorch#161169 Approved by: https://github.com/guilhermeleobas, https://github.com/StrongerXi

Previously in merge_loops, we have to construct LoopBody twice to make sure we can use the same symbol prefix as before. This PR change it to create LoopBody only once by allowing using the same symbol prefix for the new LoopBody. In looks like it's ok to have duplicate symbols in sympy replacement: ``` >>> x, y = sympy.symbols("x y") >>> (x + y).xreplace({x: 0, y: x + 1}) x + 1 >>> (x + y).xreplace({x: y * y, y: x + 1}) x + y**2 + 1 >>> (x + y + x * x).xreplace({x: 0, y: x}) x ``` UPDATE: add the same optimization for LoopBody.reorder_iter_loops Pull Request resolved: pytorch#162101 Approved by: https://github.com/jansel, https://github.com/eellison

I see torch.compile spend 2% of time on sympy_str when compiling the bwd graph for MobileBertForQuestionAnswering. Most time sympy_str is called when extracting read/write dependencies. But when we extracting read/writer deps, the result of sympy_str is just discarded (correct me if I'm wrong). To make things simple, I just remove those calls. But if people think it may be useful for debugging, I can add a flag to only call sympy_str when it's explicitly set. <img width="667" height="409" alt="Screenshot 2025-09-03 at 6 21 52 PM" src="https://github.com/user-attachments/assets/a5929473-873d-4540-8f1e-c29f92be7125" /> (scuba link: https://fburl.com/scuba/pyperf_experimental/on_demand/3k2rduh9 ) Pull Request resolved: pytorch#162126 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162101

Previous LOAF after fusion algorithm is not guaranteed to create more fusion opportunities even if loop reordering happens. I can not find an example that LOAF reduce the amount of fusion, but here is an example that reordering loops does not add more fusions: https://github.com/pytorch/pytorch/blob/a1f7639922ee0470bd7109bab6fe62989cf5000d/test/inductor/test_loop_ordering.py#L612-L641 Move LOAF to a separate final round of fusion so that we are guaranteed to not reducing the amount of fusions. Hopefully this also helps compilation time since LOAF kicks in when there are less nodes. Pull Request resolved: pytorch#162355 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162101, pytorch#162126

@malfet

Fixes pytorch#161014 This bug fix introduces a fix that is consistent with the exception handling. Outlined in issue pytorch#161014, there is an edge case where the negative padding does not make the tensor size negative but still triggers the exception that the size is negative. The fix is simply adding `new_dim >=0` to include the zero dim and letting the operator return an empty tensor. In the PR I have added the edge case where the test will now check the negative padding where the dimension gets reduced to zero. But the sample is only for the `constant` type of padding. I would like some feedback if it is necessary to put the same sample on the `reduce` type as well. This is my first PR to contribute to PyTorch and any help/feedback will be welcome! Thank you! @malfet @manuelcandales @janeyx99 @ezyang Pull Request resolved: pytorch#161639 Approved by: https://github.com/manuelcandales

This reverts commit a8cd437. See pytorch#163481 (comment) This PR might also cause issues with cudagraphs. Pull Request resolved: pytorch#163737 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#163386, pytorch#163398, pytorch#163387, pytorch#163414, pytorch#163415, pytorch#163419, pytorch#163434, pytorch#163393, pytorch#163412, pytorch#163422, pytorch#163481, pytorch#163520, pytorch#163482

…pytorch#163740) Summary: Sets the default configs for the Blackwell Matmul Templates. Test Plan: NFC Differential Revision: D83116342 Pull Request resolved: pytorch#163740 Approved by: https://github.com/jananisriram

TestMemoryProfilerE2E.test_memory_timeline is failing on AArch64, this fixes it and enables it in the opt-in list of tests for AArch64. Fixes pytorch#142371 Pull Request resolved: pytorch#145260 Approved by: https://github.com/fadara01, https://github.com/sraikund16

…#163661) Preload logic no longer works with CUDA 13.0 See the installation path: ``` ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/cu13/lib/ libcheckpoint.so libcudadevrt.a libcufft.so.12 libcufile_rdma.so.1 libcusolver.so.12 libnvJitLink.so.13 libnvperf_target.so libnvrtc.alt.so.13 libpcsamplingutil.so libcublas.so.13 libcudart.so.13 libcufftw.so.12 libcupti.so.13 libcusolverMg.so.12 libnvblas.so.13 libnvrtc-builtins.alt.so.13.0 libnvrtc.so.13 libcublasLt.so.13 libcudart_static.a libcufile.so.0 libcurand.so.10 libcusparse.so.12 libnvperf_host.so libnvrtc-builtins.so.13.0 libnvtx3interop.so.1 ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/ cu13 cudnn cusparselt nccl nvshmem ``` Test using script from : pytorch#162367 ``` Kernel test passed! ``` Pull Request resolved: pytorch#163661 Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/Camyll

Fixes pytorch#162854 Pull Request resolved: pytorch#163077 Approved by: https://github.com/huydhn

…63642) Fixes pytorch#162367 Pull Request resolved: pytorch#163642 Approved by: https://github.com/msaroufim

…capture (pytorch#163242) Many extensions (including pybind helpers) call `Tensor.__dlpack__()` without a stream argument. Before pytorch#150217, `stream=None` behaved like “no cross-stream sync” and was safe inside CUDA Graph capture. After pytorch#150217, `stream=None` maps to the legacy default stream, adding a cross-stream wait that invalidates capture when running on a non-default stream. See this example ``` import torch s = torch.cuda.Stream() x = torch.randn(8, device="cuda") g = torch.cuda.CUDAGraph() with torch.cuda.stream(s): with torch.cuda.graph(g): _ = x + 1 cap = x.__dlpack__() _ = torch.utils.dlpack.from_dlpack(cap) ``` This PR partially reverts pytorch#150217 that stream=None defaults to no sync. Pull Request resolved: pytorch#163242 Approved by: https://github.com/ngimel

Explicit redistribute_local_tensor API call could also results in communication, record it! Pull Request resolved: pytorch#163704 Approved by: https://github.com/ezyang

…dynamic (pytorch#163639) Differential Revision: D83053287 Pull Request resolved: pytorch#163639 Approved by: https://github.com/blaine-rister

Add less warps to ensure proper vectorization + memory coalescing for inner reductions, prefer more work per thread <img width="1717" height="731" alt="Screenshot 2025-09-17 at 10 03 25 AM" src="https://github.com/user-attachments/assets/7b1f4a30-62f2-4bee-bb9c-122501bde63e" /> Pull Request resolved: pytorch#162447 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314

…#163461) Summary: What: Unskip the CUDA path for test_int8_weight_only_quant in test_torchinductor.py as the kernel was added by pytorch#159325. Why: Confirm CUDA backend for _weight_int8pack_mm is registered. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda ``` https://www.internalfb.com/intern/testinfra/testrun/2533275104869494 Differential Revision: D82926440 Pull Request resolved: pytorch#163461 Approved by: https://github.com/jerryzh168

This PR optimize `extract_file` functions: 1. `normalize_path_separator` the dest path for Windows. 2. Add verbose error message: a. On Linux, add mz_zip error string. b. On Windows, add mz_zip error string and Windows error code. For the UT `test_package_user_managed_weight`: <img width="1910" height="442" alt="image" src="https://github.com/user-attachments/assets/6a63eda1-70ce-40fb-9681-adc955463884" /> It still have issue with error code `32`, checked https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499- and find the verbose is `ERROR_SHARING_VIOLATION`. It is a little complex to debug, I will continue to working on it in further PR. Pull Request resolved: pytorch#163718 Approved by: https://github.com/desertfire

…63712) Fixes pytorch#163483 Pull Request resolved: pytorch#163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501

…torch#163783) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#163783 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

…rch#163619) Fixes pytorch#162923 ## Test Result ### Before <img width="985" height="889" alt="image" src="https://github.com/user-attachments/assets/41de5cfa-7b25-4ba4-ade8-a6df745dcb30" /> ### After <img width="913" height="977" alt="image" src="https://github.com/user-attachments/assets/b6c06860-8db3-4b5d-9d46-31ece01fb04d" /> Pull Request resolved: pytorch#163619 Approved by: https://github.com/jbschlosser

Related to pytorch#161167 Pull Request resolved: pytorch#163778 Approved by: https://github.com/malfet

…025-09-09 rocm7.1_internal_testing_IFU_2025-09-09

…nelFunction (pytorch#160764)" This reverts commit 30384ab. (cherry picked from commit cd45fe7)

…158393)" This reverts commit 1196bb1. (cherry picked from commit a975dfe)

Moving triton commit pin to ToT of triton's pytorch/rocm7.1_internal_testing. The newer commits are updating triton_kernels package which has helped vllm team see perf boost in their models. (cherry picked from commit b5abe5e)

…m_magma.sh (ROCm#2651) Fixes #ISSUE_NUMBER --------- Co-authored-by: AMD <[email protected]> (cherry picked from commit 7ea3967)

cherry-pick of 8d42697 (cherry picked from commit 2a07dfa)

…2674) * cherry-pick of pytorch/pytorch@2aadcea (cherry picked from commit a7dc2b0)

…sting_IFU_2025-09-24 # Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/common/install_rocm.sh # .ci/docker/requirements-ci.txt # CMakeLists.txt # aten/src/ATen/native/Normalization.cpp # aten/src/ATen/native/miopen/BatchNorm_miopen.cpp # requirements-build.txt # test/nn/test_convolution.py # test/test_binary_ufuncs.py # test/test_nn.py # torch/_inductor/runtime/triton_heuristics.py # torch/testing/_internal/common_utils.py

pytorchmergebot and others added 30 commits September 19, 2025 06:15

Revert "[dynamo][guards] Do not construct entire framelocals dict for…

1302637

… LAMBDA_GUARD (pytorch#162525)" This reverts commit 5f630d2. Reverted pytorch#162525 on behalf of https://github.com/anijain2305 due to internal tests fail ([comment](pytorch#162525 (comment)))

[ez][CI] Run vllm workflow on vllm pin updates (pytorch#163353)

2984bfe

As in title The auto pin update was merged without running vllm workflow Pull Request resolved: pytorch#163353 Approved by: https://github.com/malfet, https://github.com/wdvr

Revert "Fix boxcox to return same result for same input in one batch (p…

a3b68c7

…ytorch#162772)" This reverts commit 49d30f9. Reverted pytorch#162772 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#162772 (comment)))

Revert "[ROCm] Bump FBGEMM commit to avoid CK errors (pytorch#162590)"

607469b

This reverts commit c9b80c4. Reverted pytorch#162590 on behalf of https://github.com/malfet due to This breaks CUDA 13 builds ([comment](pytorch#162590 (comment)))

Handling overflow for long int overflow for the product of kernel_hei… (

a0d2d84

pytorch#155989) …ght and kernel_width that overflows to be exactly 0 Fixes [pytorch#155981](pytorch#155981) Pull Request resolved: pytorch#155989 Approved by: https://github.com/malfet

[CD] Simplify NVIDIA driver installation step (pytorch#163349)

b8c5ec5

Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets Fixes pytorch#163342 Pull Request resolved: pytorch#163349 Approved by: https://github.com/seemethere

Realize LazyVariableTracker before raising exception (pytorch#163350)

bc7b17a

Improves error message reported on pytorch#163321 Pull Request resolved: pytorch#163350 Approved by: https://github.com/Skylion007, https://github.com/xmfan

jansel and others added 29 commits September 24, 2025 07:33

update test_quantization tests to run weekly (pytorch#163077)

3b73841

Fixes pytorch#162854 Pull Request resolved: pytorch#163077 Approved by: https://github.com/huydhn

Use cuda nvrtc so file based on cuda version used by torch (pytorch#1…

9d0d98a

…63642) Fixes pytorch#162367 Pull Request resolved: pytorch#163642 Approved by: https://github.com/msaroufim

Record redistribute_local_tensor in DebugMode (pytorch#163704)

4c2c401

Explicit redistribute_local_tensor API call could also results in communication, record it! Pull Request resolved: pytorch#163704 Approved by: https://github.com/ezyang

Revert to old behaviour of not padding strides if shape or stride is …

9341ede

…dynamic (pytorch#163639) Differential Revision: D83053287 Pull Request resolved: pytorch#163639 Approved by: https://github.com/blaine-rister

[dist] handle discontiguous allgather/reducescatter inputs (pytorch#1…

71eec6a

…63712) Fixes pytorch#163483 Pull Request resolved: pytorch#163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501

[ROCm][CI] adjust tf32 tolerance for test_compile_kernel_advanced (py…

0dce2af

…torch#163783) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#163783 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

Remove Python 3.9 for Triton builds (pytorch#163778)

1495b35

Related to pytorch#161167 Pull Request resolved: pytorch#163778 Approved by: https://github.com/malfet

Merge pull request ROCm#2677 from ROCm/rocm7.1_internal_testing_IFU_2…

492e246

…025-09-09 rocm7.1_internal_testing_IFU_2025-09-09

Revert "Decrease number of bytes used by uninitialized tokens_ in Ker…

3004b5d

…nelFunction (pytorch#160764)" This reverts commit 30384ab. (cherry picked from commit cd45fe7)

Revert "Add utility to get computed kernel in torch.library (pytorch#…

d1c2fe4

…158393)" This reverts commit 1196bb1. (cherry picked from commit a975dfe)

[rocm7.1_internal_testing][SWDEV-554101] Fix bad merge of install_roc…

de59104

…m_magma.sh (ROCm#2651) Fixes #ISSUE_NUMBER --------- Co-authored-by: AMD <[email protected]> (cherry picked from commit 7ea3967)

[ROCm] Fix indexing_backward_kernel perf (ROCm#2673)

c762a89

cherry-pick of 8d42697 (cherry picked from commit 2a07dfa)

[ROCm] Improve perf for elementwise broadcast with mixed dtype (ROCm#…

48cac8f

…2674) * cherry-pick of pytorch/pytorch@2aadcea (cherry picked from commit a7dc2b0)

Fix merge conflicts

f3e8213

Address review comments wrt triton_heuristics and install_rocm

0ad8381

update related_commits

63fcd9b

Fix more conflicts with triton_heuristics.py

77f4534

Merge branch 'main' into rocm7.1_internal_testing_IFU_2025-09-24

92f34a7

pragupta merged commit 42aac17 into main Oct 1, 2025
209 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 #12

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 #12

Uh oh!

pragupta commented Oct 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

97 participants

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 #12

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 #12

Uh oh!

Conversation

pragupta commented Oct 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

97 participants