[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 #8

pragupta · 2025-10-01T02:33:27Z

Fixes #ISSUE_NUMBER

… C++ (pytorch#161695) I initially didn't see good results porting this, but it was apparently because of pybind11 function calling overhead. (pybind11's object-handling primitives seem fine enough.) I'm interested in setting up nanobind, but this demonstrates it's not blocking. Differential Revision: [D81530102](https://our.internmc.facebook.com/intern/diff/D81530102) Pull Request resolved: pytorch#161695 Approved by: https://github.com/ezyang

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#163304 Approved by: https://github.com/pytorchbot

Pull Request resolved: pytorch#162310 Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: pytorch#162862

) Benchmark script: ``` import time import numpy as np import torch def main() -> None: for i in range(10): block_indices = np.arange(16384, dtype=np.int32) block_indices = block_indices.reshape(-1).clip(max=255) batch_indices = np.zeros(16384, dtype=np.int64) virtual_batches = 32 block_table = torch.randn(32, 256) start = time.perf_counter() block_table[batch_indices, block_indices].view(virtual_batches, -1) end = time.perf_counter() time_elapsed_ms = (end - start) * 1000 print(f"Function execution time: {time_elapsed_ms:.1f}ms") if __name__ == "__main__": main() ``` Before: ``` (a) [[email protected] ~/local/b/pytorch] python ben.py Function execution time: 28.5ms Function execution time: 12.9ms Function execution time: 12.6ms Function execution time: 13.5ms Function execution time: 12.0ms Function execution time: 13.4ms Function execution time: 12.9ms Function execution time: 12.9ms Function execution time: 13.1ms Function execution time: 13.0ms ``` After: ``` Function execution time: 17.8ms Function execution time: 2.5ms Function execution time: 1.3ms Function execution time: 2.5ms Function execution time: 2.3ms Function execution time: 1.3ms Function execution time: 2.4ms Function execution time: 2.5ms Function execution time: 2.5ms Function execution time: 2.4ms ``` Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#163280 Approved by: https://github.com/SherlockNoMad, https://github.com/cyyever

Fixes pytorch#163035 Pull Request resolved: pytorch#163036 Approved by: https://github.com/kulinseth, https://github.com/malfet Co-authored-by: Nikita Shulga <[email protected]>

This reverts commit 3016616. Reverted pytorch#162310 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](pytorch#162862 (comment)))

This reverts commit 2dcd153. Reverted pytorch#162862 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](pytorch#162862 (comment)))

…k) (pytorch#161571) Summary: dispatch MTIA to function foreach_tensor_maximum_scalar_kernel_mtia_ Test Plan: CI Rollback Plan: Differential Revision: D81086607 Pull Request resolved: pytorch#161571 Approved by: https://github.com/malfet

… LAMBDA_GUARD (pytorch#162525)" This reverts commit 5f630d2. Reverted pytorch#162525 on behalf of https://github.com/anijain2305 due to internal tests fail ([comment](pytorch#162525 (comment)))

…rsion (pytorch#162695)" This reverts commit a8432bc. Reverted pytorch#162695 on behalf of https://github.com/anijain2305 due to internal failure at https://fburl.com/workplace/qiitdlp6 ([comment](pytorch#162695 (comment)))

Summary: This PR is extracted from pytorch#162542, to make the original PR easier to review. This PR only contains cosmetic changes. Pull Request resolved: pytorch#163115 Approved by: https://github.com/tianyu-l ghstack dependencies: pytorch#162539, pytorch#162540, pytorch#162541

Summary: This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA. Motivation: Same as pytorch#159325. Pull Request resolved: pytorch#160938 Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168

… /.ci/docker/ci_commit_pins (pytorch#162063) * [Dependabot] Update(deps): Bump transformers Bumps [transformers](https://github.com/huggingface/transformers) from 4.54.0 to 4.56.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.54.0...v4.56.0) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.56.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * Refresh results Signed-off-by: Huy Do <[email protected]> * Another round of updates Signed-off-by: Huy Do <[email protected]> * Another round of update Signed-off-by: Huy Do <[email protected]> * Hopefully the last round of update Signed-off-by: Huy Do <[email protected]> * Plz Signed-off-by: Huy Do <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Huy Do <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Huy Do <[email protected]>

…torch#163205) It seems `TEST_CUDA` is set to true even for ROCm (MI200) jobs. Changing if TEST_CUDA to an else condition to avoid running symmetric memory UTs on MI200. For other non-rocm arch, it should return true and can be skipped using other skip decorators. Pull Request resolved: pytorch#163205 Approved by: https://github.com/ezyang Co-authored-by: Jeff Daily <[email protected]>

…ch#163127) PR pytorch#151360 added mx fp8 and fp4 support on ROCm. 1. However, on recent upstream, scaling function in Blas.cpp along with test_matmul_cuda changes triggered failures. This patch corrects is_blockwise_1x32_scaling function code. 2. Fixes the m, n, k dimensions for ROCm mx case. 3. Modify FP4E2M1FN_LARGEST_POW2 (largest power of 2 representable in `torch.float4_e2m1fn_x2`) to 2. This resulted in higher SQNR value for mx fp4 test. Testing result on gfx950 w/ ROCm7.0 PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 452 tests in 22.698s OK passed 111 This is same as before. (when PR 151360 was merged) Pull Request resolved: pytorch#163127 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

…n H100 (pytorch#162022) only cuBLAS supports float32 output and cuBLAS only supports rowwise for SM 9.0 Intended to land after pytorch#161305 Pull Request resolved: pytorch#162022 Approved by: https://github.com/ngimel

…onfig (pytorch#163318) ```Shell Up to 4x perf boost 🔝 Top 5 Performance Differences (by absolute %): shape: (5, 7) ┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035 ┆ 532.580435 ┆ 4.268325 ┆ 326.832527 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557 ┆ 519.798488 ┆ 4.175271 ┆ 317.527078 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189 ┆ 512.877391 ┆ 4.136635 ┆ 313.663544 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 122.827725 ┆ 496.195958 ┆ 4.039772 ┆ 303.977164 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738 ┆ 484.244647 ┆ 3.910663 ┆ 291.066303 │ └───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘ 🔺 Top 5 Cases Where better_configs (change) is Faster than base (baseline): shape: (5, 7) ┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035 ┆ 532.580435 ┆ 4.268325 ┆ 326.832527 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557 ┆ 519.798488 ┆ 4.175271 ┆ 317.527078 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189 ┆ 512.877391 ┆ 4.136635 ┆ 313.663544 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 122.827725 ┆ 496.195958 ┆ 4.039772 ┆ 303.977164 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738 ┆ 484.244647 ┆ 3.910663 ┆ 291.066303 │ └───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘ 🔻 Top 5 Cases Where better_configs (change) is Slower than base (baseline): shape: (5, 7) ┌───────────────┬────────────────┬───────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════════╪════════════════╪═══════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪═══════════╡ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 267.502004 ┆ 250.728732 ┆ 0.937297 ┆ -6.270335 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 4, 8192, 128) ┆ 248.510516 ┆ 235.210874 ┆ 0.946483 ┆ -5.351742 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, 16384, 128) ┆ 282.856295 ┆ 271.806926 ┆ 0.960936 ┆ -3.906354 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 64) ┆ 282.212695 ┆ 280.519092 ┆ 0.993999 ┆ -0.600116 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 32768, 4, 32768, 128) ┆ 295.864073 ┆ 294.477894 ┆ 0.995315 ┆ -0.468519 │ └───────────────┴────────────────┴───────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴───────────┘ 📊 Performance Summary: ============================================================ Baseline: base Change: better_configs Geometric Mean Speedup (change over baseline): 1.9954x Geometric Mean % Change: +99.54% Median Speedup (change over baseline): 2.1590x Speedup Std Dev: 0.9800 Valid Comparisons: 60/60 ``` Pull Request resolved: pytorch#163318 Approved by: https://github.com/BoyuanFeng

For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: pytorch#163227 Approved by: https://github.com/zou3519

…TMA template for GEMMs (pytorch#163147) Summary: X-link: meta-pytorch/tritonbench#432 Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template. This diff also adds a benchmark for the scaled Blackwell persistent + TMA template to TritonBench `fp8_gemm`. Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation. In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff. Test Plan: Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel. Benchmarking command: ``` TRITON_PRINT_AUTOTUNING=1 TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor TRITON_CACHE_DIR=~/personal/cache_dir_triton TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op fp8_gemm --only torch_fp8_gemm,blackwell_pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/fp8_shapes_testing.json --scaling_rowwise --output="/home/jananisriram/personal/fp8_shapes_testing_results.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/fp8_shapes_testing.log ``` Rollback Plan: Differential Revision: D82597111 Pull Request resolved: pytorch#163147 Approved by: https://github.com/njriasan

As in title The auto pin update was merged without running vllm workflow Pull Request resolved: pytorch#163353 Approved by: https://github.com/malfet, https://github.com/wdvr

…ytorch#162772)" This reverts commit 49d30f9. Reverted pytorch#162772 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#162772 (comment)))

This reverts commit c9b80c4. Reverted pytorch#162590 on behalf of https://github.com/malfet due to This breaks CUDA 13 builds ([comment](pytorch#162590 (comment)))

pytorch#155989) …ght and kernel_width that overflows to be exactly 0 Fixes [pytorch#155981](pytorch#155981) Pull Request resolved: pytorch#155989 Approved by: https://github.com/malfet

Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets Fixes pytorch#163342 Pull Request resolved: pytorch#163349 Approved by: https://github.com/seemethere

This code is a delicious spaghetti: Sometimes python version is defined in jinja template (see pytorch#162297 ) sometimes in shell script (see pytorch#162877 ), but this time around it's in a python file (and there is another one called `generate_binary_build_matrix.py` that defines `FULL_PYTHON_VERSIONS`) Pull Request resolved: pytorch#163339 Approved by: https://github.com/clee2000

Fixes pytorch#156740 Adds explicit `Any` typing to `*args` and `**kwargs` in `nn.Module.__init__()` to fix type checker errors in strict mode. Pull Request resolved: pytorch#157389 Approved by: https://github.com/Skylion007, https://github.com/Raman-RH

Improves error message reported on pytorch#163321 Pull Request resolved: pytorch#163350 Approved by: https://github.com/Skylion007, https://github.com/xmfan

…e_format in compile (pytorch#163017) Fixes pytorch#161010 by making `clone_meta` match the semantics of strides for eager mode. This is: * Case 1: Tensor is_non_overlapping_and_dense; in this case, stride should match input tensor stride * Case 2: Otherwise, stride should be contiguous computed from input tensor using `compute_elementwise_output_strides` Pull Request resolved: pytorch#163017 Approved by: https://github.com/williamwen42, https://github.com/xmfan Co-authored-by: morrison-turnansky <[email protected]>

Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase Pull Request resolved: pytorch#163341 Approved by: https://github.com/clee2000 ghstack dependencies: pytorch#163339

This change may also resolve pytorch#161789, though verification is still needed. PR pytorch#130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: pytorch#162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed

Fixes pytorch#161324 by adding `is_non_overlapping_and_dense` check. Pull Request resolved: pytorch#163719 Approved by: https://github.com/ngimel

As the title stated. Pull Request resolved: pytorch#163626 Approved by: https://github.com/Skylion007, https://github.com/albanD

As the title stated. Pull Request resolved: pytorch#163627 Approved by: https://github.com/jansel ghstack dependencies: pytorch#163626

As the title stated. Pull Request resolved: pytorch#163629 Approved by: https://github.com/albanD ghstack dependencies: pytorch#163626, pytorch#163627

As the title stated. Pull Request resolved: pytorch#163643 Approved by: https://github.com/albanD ghstack dependencies: pytorch#163626, pytorch#163627, pytorch#163629

As the title stated. Pull Request resolved: pytorch#163644 Approved by: https://github.com/jansel ghstack dependencies: pytorch#163626, pytorch#163627, pytorch#163629, pytorch#163643

As the title stated. Pull Request resolved: pytorch#163645 Approved by: https://github.com/albanD ghstack dependencies: pytorch#163626, pytorch#163627, pytorch#163629, pytorch#163643, pytorch#163644

As the title stated. Pull Request resolved: pytorch#163646 Approved by: https://github.com/jansel ghstack dependencies: pytorch#163626, pytorch#163627, pytorch#163629, pytorch#163643, pytorch#163644, pytorch#163645

This reverts commit a8cd437. See pytorch#163481 (comment) This PR might also cause issues with cudagraphs. Pull Request resolved: pytorch#163737 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#163386, pytorch#163398, pytorch#163387, pytorch#163414, pytorch#163415, pytorch#163419, pytorch#163434, pytorch#163393, pytorch#163412, pytorch#163422, pytorch#163481, pytorch#163520, pytorch#163482

…pytorch#163740) Summary: Sets the default configs for the Blackwell Matmul Templates. Test Plan: NFC Differential Revision: D83116342 Pull Request resolved: pytorch#163740 Approved by: https://github.com/jananisriram

TestMemoryProfilerE2E.test_memory_timeline is failing on AArch64, this fixes it and enables it in the opt-in list of tests for AArch64. Fixes pytorch#142371 Pull Request resolved: pytorch#145260 Approved by: https://github.com/fadara01, https://github.com/sraikund16

…#163661) Preload logic no longer works with CUDA 13.0 See the installation path: ``` ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/cu13/lib/ libcheckpoint.so libcudadevrt.a libcufft.so.12 libcufile_rdma.so.1 libcusolver.so.12 libnvJitLink.so.13 libnvperf_target.so libnvrtc.alt.so.13 libpcsamplingutil.so libcublas.so.13 libcudart.so.13 libcufftw.so.12 libcupti.so.13 libcusolverMg.so.12 libnvblas.so.13 libnvrtc-builtins.alt.so.13.0 libnvrtc.so.13 libcublasLt.so.13 libcudart_static.a libcufile.so.0 libcurand.so.10 libcusparse.so.12 libnvperf_host.so libnvrtc-builtins.so.13.0 libnvtx3interop.so.1 ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/ cu13 cudnn cusparselt nccl nvshmem ``` Test using script from : pytorch#162367 ``` Kernel test passed! ``` Pull Request resolved: pytorch#163661 Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/Camyll

Fixes pytorch#162854 Pull Request resolved: pytorch#163077 Approved by: https://github.com/huydhn

…63642) Fixes pytorch#162367 Pull Request resolved: pytorch#163642 Approved by: https://github.com/msaroufim

…capture (pytorch#163242) Many extensions (including pybind helpers) call `Tensor.__dlpack__()` without a stream argument. Before pytorch#150217, `stream=None` behaved like “no cross-stream sync” and was safe inside CUDA Graph capture. After pytorch#150217, `stream=None` maps to the legacy default stream, adding a cross-stream wait that invalidates capture when running on a non-default stream. See this example ``` import torch s = torch.cuda.Stream() x = torch.randn(8, device="cuda") g = torch.cuda.CUDAGraph() with torch.cuda.stream(s): with torch.cuda.graph(g): _ = x + 1 cap = x.__dlpack__() _ = torch.utils.dlpack.from_dlpack(cap) ``` This PR partially reverts pytorch#150217 that stream=None defaults to no sync. Pull Request resolved: pytorch#163242 Approved by: https://github.com/ngimel

Explicit redistribute_local_tensor API call could also results in communication, record it! Pull Request resolved: pytorch#163704 Approved by: https://github.com/ezyang

…dynamic (pytorch#163639) Differential Revision: D83053287 Pull Request resolved: pytorch#163639 Approved by: https://github.com/blaine-rister

Add less warps to ensure proper vectorization + memory coalescing for inner reductions, prefer more work per thread <img width="1717" height="731" alt="Screenshot 2025-09-17 at 10 03 25 AM" src="https://github.com/user-attachments/assets/7b1f4a30-62f2-4bee-bb9c-122501bde63e" /> Pull Request resolved: pytorch#162447 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314

…#163461) Summary: What: Unskip the CUDA path for test_int8_weight_only_quant in test_torchinductor.py as the kernel was added by pytorch#159325. Why: Confirm CUDA backend for _weight_int8pack_mm is registered. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda ``` https://www.internalfb.com/intern/testinfra/testrun/2533275104869494 Differential Revision: D82926440 Pull Request resolved: pytorch#163461 Approved by: https://github.com/jerryzh168

This PR optimize `extract_file` functions: 1. `normalize_path_separator` the dest path for Windows. 2. Add verbose error message: a. On Linux, add mz_zip error string. b. On Windows, add mz_zip error string and Windows error code. For the UT `test_package_user_managed_weight`: <img width="1910" height="442" alt="image" src="https://github.com/user-attachments/assets/6a63eda1-70ce-40fb-9681-adc955463884" /> It still have issue with error code `32`, checked https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499- and find the verbose is `ERROR_SHARING_VIOLATION`. It is a little complex to debug, I will continue to working on it in further PR. Pull Request resolved: pytorch#163718 Approved by: https://github.com/desertfire

…63712) Fixes pytorch#163483 Pull Request resolved: pytorch#163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501

…torch#163783) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#163783 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

…rch#163619) Fixes pytorch#162923 ## Test Result ### Before <img width="985" height="889" alt="image" src="https://github.com/user-attachments/assets/41de5cfa-7b25-4ba4-ade8-a6df745dcb30" /> ### After <img width="913" height="977" alt="image" src="https://github.com/user-attachments/assets/b6c06860-8db3-4b5d-9d46-31ece01fb04d" /> Pull Request resolved: pytorch#163619 Approved by: https://github.com/jbschlosser

Related to pytorch#161167 Pull Request resolved: pytorch#163778 Approved by: https://github.com/malfet

…sting_IFU_2025-09-24 # Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/common/install_rocm.sh # .ci/docker/requirements-ci.txt # CMakeLists.txt # aten/src/ATen/native/Normalization.cpp # aten/src/ATen/native/miopen/BatchNorm_miopen.cpp # requirements-build.txt # test/nn/test_convolution.py # test/test_binary_ufuncs.py # test/test_nn.py # torch/_inductor/runtime/triton_heuristics.py # torch/testing/_internal/common_utils.py

swolchok and others added 30 commits September 19, 2025 04:07

[BE] Update Python min version to 3.10 (pytorch#162310)

3016616

Pull Request resolved: pytorch#162310 Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: pytorch#162862

Fix invalid indices bug for max_unpool2d/3d on MPS (pytorch#163036)

ce5637b

Fixes pytorch#163035 Pull Request resolved: pytorch#163036 Approved by: https://github.com/kulinseth, https://github.com/malfet Co-authored-by: Nikita Shulga <[email protected]>

Revert "[BE] Update Python min version to 3.10 (pytorch#162310)"

5780478

This reverts commit 3016616. Reverted pytorch#162310 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](pytorch#162862 (comment)))

Revert "[CI] Move Windows build/tests to Python-3.10 (pytorch#162862)"

1708120

This reverts commit 2dcd153. Reverted pytorch#162862 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](pytorch#162862 (comment)))

Revert "[dynamo][guards] Do not construct entire framelocals dict for…

1302637

… LAMBDA_GUARD (pytorch#162525)" This reverts commit 5f630d2. Reverted pytorch#162525 on behalf of https://github.com/anijain2305 due to internal tests fail ([comment](pytorch#162525 (comment)))

[ez][CI] Run vllm workflow on vllm pin updates (pytorch#163353)

2984bfe

As in title The auto pin update was merged without running vllm workflow Pull Request resolved: pytorch#163353 Approved by: https://github.com/malfet, https://github.com/wdvr

Revert "Fix boxcox to return same result for same input in one batch (p…

a3b68c7

…ytorch#162772)" This reverts commit 49d30f9. Reverted pytorch#162772 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#162772 (comment)))

Revert "[ROCm] Bump FBGEMM commit to avoid CK errors (pytorch#162590)"

607469b

This reverts commit c9b80c4. Reverted pytorch#162590 on behalf of https://github.com/malfet due to This breaks CUDA 13 builds ([comment](pytorch#162590 (comment)))

Handling overflow for long int overflow for the product of kernel_hei… (

a0d2d84

pytorch#155989) …ght and kernel_width that overflows to be exactly 0 Fixes [pytorch#155981](pytorch#155981) Pull Request resolved: pytorch#155989 Approved by: https://github.com/malfet

[CD] Simplify NVIDIA driver installation step (pytorch#163349)

b8c5ec5

Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets Fixes pytorch#163342 Pull Request resolved: pytorch#163349 Approved by: https://github.com/seemethere

Realize LazyVariableTracker before raising exception (pytorch#163350)

bc7b17a

Improves error message reported on pytorch#163321 Pull Request resolved: pytorch#163350 Approved by: https://github.com/Skylion007, https://github.com/xmfan

kwen2501 and others added 29 commits September 24, 2025 06:58

[c10d] P2P tensors must be dense (pytorch#163719)

11a231e

Fixes pytorch#161324 by adding `is_non_overlapping_and_dense` check. Pull Request resolved: pytorch#163719 Approved by: https://github.com/ngimel

[Code Clean] Remove deadcodes about Python3.9 [1/N] (pytorch#163626)

bf0747c

As the title stated. Pull Request resolved: pytorch#163626 Approved by: https://github.com/Skylion007, https://github.com/albanD

[Code Clean] Remove deadcodes about Python3.9 [2/N] (pytorch#163627)

0bca779

As the title stated. Pull Request resolved: pytorch#163627 Approved by: https://github.com/jansel ghstack dependencies: pytorch#163626

[Code Clean] Remove deadcodes about Python3.9 [3/N] (pytorch#163629)

33aabdd

As the title stated. Pull Request resolved: pytorch#163629 Approved by: https://github.com/albanD ghstack dependencies: pytorch#163626, pytorch#163627

[Code Clean] Remove deadcodes about Python3.9 [4/N] (pytorch#163643)

ec0cd81

As the title stated. Pull Request resolved: pytorch#163643 Approved by: https://github.com/albanD ghstack dependencies: pytorch#163626, pytorch#163627, pytorch#163629

[Code Clean] Remove deadcodes about Python3.9 [5/N] (pytorch#163644)

6f34cc0

As the title stated. Pull Request resolved: pytorch#163644 Approved by: https://github.com/jansel ghstack dependencies: pytorch#163626, pytorch#163627, pytorch#163629, pytorch#163643

[Code Clean] Remove deadcodes about Python3.9 [6/N] (pytorch#163645)

a635505

As the title stated. Pull Request resolved: pytorch#163645 Approved by: https://github.com/albanD ghstack dependencies: pytorch#163626, pytorch#163627, pytorch#163629, pytorch#163643, pytorch#163644

update test_quantization tests to run weekly (pytorch#163077)

3b73841

Fixes pytorch#162854 Pull Request resolved: pytorch#163077 Approved by: https://github.com/huydhn

Use cuda nvrtc so file based on cuda version used by torch (pytorch#1…

9d0d98a

…63642) Fixes pytorch#162367 Pull Request resolved: pytorch#163642 Approved by: https://github.com/msaroufim

Record redistribute_local_tensor in DebugMode (pytorch#163704)

4c2c401

Explicit redistribute_local_tensor API call could also results in communication, record it! Pull Request resolved: pytorch#163704 Approved by: https://github.com/ezyang

Revert to old behaviour of not padding strides if shape or stride is …

9341ede

…dynamic (pytorch#163639) Differential Revision: D83053287 Pull Request resolved: pytorch#163639 Approved by: https://github.com/blaine-rister

[dist] handle discontiguous allgather/reducescatter inputs (pytorch#1…

71eec6a

…63712) Fixes pytorch#163483 Pull Request resolved: pytorch#163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501

[ROCm][CI] adjust tf32 tolerance for test_compile_kernel_advanced (py…

0dce2af

…torch#163783) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#163783 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

Remove Python 3.9 for Triton builds (pytorch#163778)

1495b35

Related to pytorch#161167 Pull Request resolved: pytorch#163778 Approved by: https://github.com/malfet

Fix merge conflicts

f3e8213

Address review comments wrt triton_heuristics and install_rocm

0ad8381

update related_commits

63fcd9b

Fix more conflicts with triton_heuristics.py

77f4534

pragupta merged commit 9717f4d into rocm7.1_internal_testing Oct 1, 2025
213 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 #8

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 #8

Uh oh!

pragupta commented Oct 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

98 participants

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 #8

[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 #8

Uh oh!

Conversation

pragupta commented Oct 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

98 participants