Skip to content

Conversation

pragupta
Copy link
Owner

@pragupta pragupta commented Oct 1, 2025

Fixes #ISSUE_NUMBER

swolchok and others added 30 commits September 19, 2025 04:07
… C++ (pytorch#161695)

I initially didn't see good results porting this, but it was apparently because of pybind11 function calling overhead. (pybind11's object-handling primitives seem fine enough.) I'm interested in setting up nanobind, but this demonstrates it's not blocking.

Differential Revision: [D81530102](https://our.internmc.facebook.com/intern/diff/D81530102)

Pull Request resolved: pytorch#161695
Approved by: https://github.com/ezyang
)

Benchmark script:

```
import time
import numpy as np
import torch

def main() -> None:
    for i in range(10):
        block_indices = np.arange(16384, dtype=np.int32)
        block_indices = block_indices.reshape(-1).clip(max=255)
        batch_indices = np.zeros(16384, dtype=np.int64)
        virtual_batches = 32
        block_table = torch.randn(32, 256)
        start = time.perf_counter()
        block_table[batch_indices, block_indices].view(virtual_batches, -1)
        end = time.perf_counter()
        time_elapsed_ms = (end - start) * 1000
        print(f"Function execution time: {time_elapsed_ms:.1f}ms")

if __name__ == "__main__":
    main()
```

Before:

```
(a) [[email protected] ~/local/b/pytorch] python ben.py
Function execution time: 28.5ms
Function execution time: 12.9ms
Function execution time: 12.6ms
Function execution time: 13.5ms
Function execution time: 12.0ms
Function execution time: 13.4ms
Function execution time: 12.9ms
Function execution time: 12.9ms
Function execution time: 13.1ms
Function execution time: 13.0ms
```

After:

```
Function execution time: 17.8ms
Function execution time: 2.5ms
Function execution time: 1.3ms
Function execution time: 2.5ms
Function execution time: 2.3ms
Function execution time: 1.3ms
Function execution time: 2.4ms
Function execution time: 2.5ms
Function execution time: 2.5ms
Function execution time: 2.4ms
```

Signed-off-by: Edward Z. Yang <[email protected]>
Pull Request resolved: pytorch#163280
Approved by: https://github.com/SherlockNoMad, https://github.com/cyyever
This reverts commit 3016616.

Reverted pytorch#162310 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](pytorch#162862 (comment)))
…k) (pytorch#161571)

Summary: dispatch MTIA to function foreach_tensor_maximum_scalar_kernel_mtia_

Test Plan:
CI

Rollback Plan:

Differential Revision: D81086607

Pull Request resolved: pytorch#161571
Approved by: https://github.com/malfet
Summary:
This PR is extracted from pytorch#162542, to make the original PR
easier to review. This PR only contains cosmetic changes.

Pull Request resolved: pytorch#163115
Approved by: https://github.com/tianyu-l
ghstack dependencies: pytorch#162539, pytorch#162540, pytorch#162541
Summary:
This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA.

Motivation:
Same as pytorch#159325.

Pull Request resolved: pytorch#160938
Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168
… /.ci/docker/ci_commit_pins (pytorch#162063)

* [Dependabot] Update(deps): Bump transformers

Bumps [transformers](https://github.com/huggingface/transformers) from 4.54.0 to 4.56.0.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](huggingface/transformers@v4.54.0...v4.56.0)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 4.56.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* Refresh results

Signed-off-by: Huy Do <[email protected]>

* Another round of updates

Signed-off-by: Huy Do <[email protected]>

* Another round of update

Signed-off-by: Huy Do <[email protected]>

* Hopefully the last round of update

Signed-off-by: Huy Do <[email protected]>

* Plz

Signed-off-by: Huy Do <[email protected]>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Huy Do <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Huy Do <[email protected]>
…torch#163205)

It seems `TEST_CUDA` is set to true even for ROCm (MI200) jobs. Changing if TEST_CUDA to an else condition to avoid running symmetric memory UTs on MI200. For other non-rocm arch, it should return true and can be skipped using other skip decorators.

Pull Request resolved: pytorch#163205
Approved by: https://github.com/ezyang

Co-authored-by: Jeff Daily <[email protected]>
…ch#163127)

PR pytorch#151360 added mx fp8 and fp4 support on ROCm.
1. However, on recent upstream, scaling function in Blas.cpp along with test_matmul_cuda changes triggered failures.
This patch corrects is_blockwise_1x32_scaling function code.

2. Fixes the m, n, k dimensions for ROCm mx case.

3.  Modify FP4E2M1FN_LARGEST_POW2 (largest power of 2 representable in `torch.float4_e2m1fn_x2`) to 2.
This resulted in higher SQNR value for mx fp4 test.

Testing result on gfx950 w/ ROCm7.0

PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 452 tests in 22.698s
OK passed 111
This is same as before. (when PR 151360 was merged)

Pull Request resolved: pytorch#163127
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <[email protected]>
…n H100 (pytorch#162022)

only cuBLAS supports float32 output and cuBLAS only supports rowwise for SM 9.0

Intended to land after pytorch#161305

Pull Request resolved: pytorch#162022
Approved by: https://github.com/ngimel
…onfig (pytorch#163318)

```Shell
Up to 4x perf boost

🔝 Top 5 Performance Differences (by absolute %):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔺 Top 5 Cases Where better_configs (change) is Faster than base (baseline):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔻 Top 5 Cases Where better_configs (change) is Slower than base (baseline):
shape: (5, 7)
┌───────────────┬────────────────┬───────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬───────────┐
│ attn_type     ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)         ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │
│ ---           ┆ ---            ┆ ---                           ┆ ---               ┆ ---                         ┆ ---                             ┆ ---       │
│ str           ┆ str            ┆ str                           ┆ f64               ┆ f64                         ┆ f64                             ┆ f64       │
╞═══════════════╪════════════════╪═══════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪═══════════╡
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)  ┆ 267.502004        ┆ 250.728732                  ┆ 0.937297                        ┆ -6.270335 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 4, 8192, 128)   ┆ 248.510516        ┆ 235.210874                  ┆ 0.946483                        ┆ -5.351742 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, 16384, 128) ┆ 282.856295        ┆ 271.806926                  ┆ 0.960936                        ┆ -3.906354 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 64)   ┆ 282.212695        ┆ 280.519092                  ┆ 0.993999                        ┆ -0.600116 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 32768, 4, 32768, 128) ┆ 295.864073        ┆ 294.477894                  ┆ 0.995315                        ┆ -0.468519 │
└───────────────┴────────────────┴───────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴───────────┘

📊 Performance Summary:
============================================================
Baseline: base
Change:   better_configs
Geometric Mean Speedup (change over baseline): 1.9954x
Geometric Mean % Change: +99.54%
Median Speedup (change over baseline): 2.1590x
Speedup Std Dev: 0.9800
Valid Comparisons: 60/60

```

Pull Request resolved: pytorch#163318
Approved by: https://github.com/BoyuanFeng
For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```

If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.

However, when there are mutating args, we don't see `del buf1` immediately.

```python
@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
```

<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />

Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982

According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)

Pull Request resolved: pytorch#163227
Approved by: https://github.com/zou3519
…TMA template for GEMMs (pytorch#163147)

Summary:
X-link: meta-pytorch/tritonbench#432

Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template.

This diff also adds a benchmark for the scaled Blackwell persistent + TMA template to TritonBench `fp8_gemm`.

Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation.

In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff.

Test Plan:
Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel.

Benchmarking command:
```
TRITON_PRINT_AUTOTUNING=1 TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor TRITON_CACHE_DIR=~/personal/cache_dir_triton TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op fp8_gemm --only torch_fp8_gemm,blackwell_pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/fp8_shapes_testing.json --scaling_rowwise --output="/home/jananisriram/personal/fp8_shapes_testing_results.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/fp8_shapes_testing.log
```

Rollback Plan:

Differential Revision: D82597111

Pull Request resolved: pytorch#163147
Approved by: https://github.com/njriasan
As in title

The auto pin update was merged without running vllm workflow
Pull Request resolved: pytorch#163353
Approved by: https://github.com/malfet, https://github.com/wdvr
Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets

Fixes pytorch#163342
Pull Request resolved: pytorch#163349
Approved by: https://github.com/seemethere
This code is a delicious spaghetti: Sometimes python version is defined in jinja template (see pytorch#162297 ) sometimes in shell script (see pytorch#162877 ), but this time around it's in a python file (and there is another one called `generate_binary_build_matrix.py` that defines `FULL_PYTHON_VERSIONS`)

Pull Request resolved: pytorch#163339
Approved by: https://github.com/clee2000
Fixes pytorch#156740

Adds explicit `Any` typing to `*args` and `**kwargs` in `nn.Module.__init__()` to fix type checker errors in strict mode.
Pull Request resolved: pytorch#157389
Approved by: https://github.com/Skylion007, https://github.com/Raman-RH
…e_format in compile (pytorch#163017)

Fixes pytorch#161010 by making `clone_meta` match the semantics of strides for eager mode.

This is:
  * Case 1: Tensor is_non_overlapping_and_dense; in this case, stride should match input tensor stride
  * Case 2: Otherwise, stride should be contiguous computed from input tensor using `compute_elementwise_output_strides`

Pull Request resolved: pytorch#163017
Approved by: https://github.com/williamwen42, https://github.com/xmfan

Co-authored-by: morrison-turnansky <[email protected]>
Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase
Pull Request resolved: pytorch#163341
Approved by: https://github.com/clee2000
ghstack dependencies: pytorch#163339
This change may also resolve pytorch#161789, though verification is still needed.

PR pytorch#130472 would introduced the problem of  freeing the same address without clean metadata. according to the below discussion, reverted it.
Pull Request resolved: pytorch#162950
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed
kwen2501 and others added 29 commits September 24, 2025 06:58
Fixes pytorch#161324
by adding `is_non_overlapping_and_dense` check.

Pull Request resolved: pytorch#163719
Approved by: https://github.com/ngimel
…pytorch#163740)

Summary: Sets the default configs for the Blackwell Matmul Templates.

Test Plan: NFC

Differential Revision: D83116342

Pull Request resolved: pytorch#163740
Approved by: https://github.com/jananisriram
TestMemoryProfilerE2E.test_memory_timeline is failing on AArch64, this fixes it and enables it in the opt-in list of tests for AArch64.

Fixes pytorch#142371

Pull Request resolved: pytorch#145260
Approved by: https://github.com/fadara01, https://github.com/sraikund16
…#163661)

Preload logic no longer works with CUDA 13.0
See the installation path:
```
ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/cu13/lib/
libcheckpoint.so   libcudadevrt.a      libcufft.so.12   libcufile_rdma.so.1  libcusolver.so.12    libnvJitLink.so.13  libnvperf_target.so            libnvrtc.alt.so.13    libpcsamplingutil.so
libcublas.so.13    libcudart.so.13     libcufftw.so.12  libcupti.so.13       libcusolverMg.so.12  libnvblas.so.13     libnvrtc-builtins.alt.so.13.0  libnvrtc.so.13
libcublasLt.so.13  libcudart_static.a  libcufile.so.0   libcurand.so.10      libcusparse.so.12    libnvperf_host.so   libnvrtc-builtins.so.13.0      libnvtx3interop.so.1

ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/
cu13  cudnn  cusparselt  nccl  nvshmem
```

Test using script from : pytorch#162367
```
Kernel test passed!
```
Pull Request resolved: pytorch#163661
Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/Camyll
…capture (pytorch#163242)

Many extensions (including pybind helpers) call `Tensor.__dlpack__()` without a stream argument. Before pytorch#150217, `stream=None` behaved like “no cross-stream sync” and was safe inside CUDA Graph capture. After pytorch#150217, `stream=None` maps to the legacy default stream, adding a cross-stream wait that invalidates capture when running on a non-default stream.

See this example

```
import torch
s = torch.cuda.Stream()
x = torch.randn(8, device="cuda")
g = torch.cuda.CUDAGraph()

with torch.cuda.stream(s):
    with torch.cuda.graph(g):
        _ = x + 1
        cap = x.__dlpack__()
        _ = torch.utils.dlpack.from_dlpack(cap)
```

This PR partially reverts pytorch#150217 that stream=None defaults to no sync.

Pull Request resolved: pytorch#163242
Approved by: https://github.com/ngimel
Explicit redistribute_local_tensor API call could also results in communication, record it!

Pull Request resolved: pytorch#163704
Approved by: https://github.com/ezyang
Add less warps to ensure proper vectorization + memory coalescing for inner reductions, prefer more work per thread

<img width="1717" height="731" alt="Screenshot 2025-09-17 at 10 03 25 AM" src="https://github.com/user-attachments/assets/7b1f4a30-62f2-4bee-bb9c-122501bde63e" />

Pull Request resolved: pytorch#162447
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
…#163461)

Summary:
What: Unskip the CUDA path for test_int8_weight_only_quant in test_torchinductor.py as the kernel was added by pytorch#159325.

Why: Confirm CUDA backend for _weight_int8pack_mm is registered.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda
```
https://www.internalfb.com/intern/testinfra/testrun/2533275104869494

Differential Revision: D82926440

Pull Request resolved: pytorch#163461
Approved by: https://github.com/jerryzh168
This PR optimize `extract_file` functions:
1. `normalize_path_separator` the dest path for Windows.
2. Add verbose error message:
a. On Linux, add mz_zip error string.
b. On Windows, add mz_zip error string and Windows error code.

For the UT `test_package_user_managed_weight`:
<img width="1910" height="442" alt="image" src="https://github.com/user-attachments/assets/6a63eda1-70ce-40fb-9681-adc955463884" />

It still have issue with error code `32`, checked https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499- and find the verbose is `ERROR_SHARING_VIOLATION`.

It is a little complex to debug, I will continue to working on it in further PR.

Pull Request resolved: pytorch#163718
Approved by: https://github.com/desertfire
…sting_IFU_2025-09-24

# Conflicts:
#	.ci/docker/ci_commit_pins/triton.txt
#	.ci/docker/common/install_rocm.sh
#	.ci/docker/requirements-ci.txt
#	CMakeLists.txt
#	aten/src/ATen/native/Normalization.cpp
#	aten/src/ATen/native/miopen/BatchNorm_miopen.cpp
#	requirements-build.txt
#	test/nn/test_convolution.py
#	test/test_binary_ufuncs.py
#	test/test_nn.py
#	torch/_inductor/runtime/triton_heuristics.py
#	torch/testing/_internal/common_utils.py
@pragupta pragupta merged commit 9717f4d into rocm7.1_internal_testing Oct 1, 2025
213 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.