[PyTorch] Refactor C++ quantizer infrastructure #1952

timmoon10 · 2025-07-15T09:02:18Z

Description

This PR makes three changes to the quantizer infrastructure in the transformer_engine_torch extensions:

Consolidate recipe-specific quantization logic in the Quantizer::quantize function. Previously this was duplicated in functions for quantization, activations, normalization, etc.
Force quantized tensors to match the quantizer's usages in the Quantizer::convert_and_update_tensor function, similar to Make quantize_ respect the usages of the quantizer #1836.
Change Quantizer::create_tensor to always return an uninitialized tensor, removing the need for an unnecessary scale reciprocal. For backward compatibility, some quantizer subclasses provide functions for creating initialized tensors.

Arguments for removing the `rowwise_data` arg from `Quantize::create_tensor`

rowwise_data provides an option to provide an already-initialized data buffer. This was implemented to support some use-cases with attention involving QKV fusing and with the Userbuffers buffer (no longer needed after #1711). However, this design has numerous problems:

There is no option to provide column-wise data.
There is no option to provide scales. For FP8 delayed scaling, this means we need to compute the reciprocal of the quantizer's scale, which adds CPU and GPU overhead that is often unnecessary. For FP8 current scaling, MXFP8, and DSv3, this makes this API unusable.
Scales are not consistent between recipes. FP8 requires a per-tensor scale that is shared between the data and transpose. MXFP8 and DSv3 require separate row-wise and column-wise scales. Hierarchical scaling recipes involve multiple scale tensors for both row-wise and column-wise data. Maintaining a API that can generically cover all cases will be... not fun.
The required buffers are recipe-specific and machine-specific, so if you are using this API you have to already know the recipe. You can just dynamic_cast the quantizer to specific concrete class, so there is not much benefit in a generic API.

This PR removes the rowwise_data arg entirely from the base class, so calling create_tensor will create a tensor with pure uninitialized buffers. NoneQuantizer and Float8Quantizer still expose variants of create_tensor for providing pre-initialized buffers, with better recipe-specific logic.

#1950 is an alternative attempt to avoid the problems of the rowwise_data API in the FP8 current-scaling quantizer. #1836 adds an optional out arg to Quantizer::create_tensor and will force any provided tensor to match the quantizer's usages.

Closes #1836. Closes #1950.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Make Quantizer::create_tensor construct uninitialized tensors, with some sub-class variants for constructing initialized tensors
tex.quantize forces quantized tensors to match the quantizer's usages
Consolidate quantization logic in Quantizer::quantize
Support all quantization schemes in activation forward and backward

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: zhongboz <[email protected]>

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2025-07-16T01:55:14Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2025-07-17T00:13:02Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2025-07-17T06:25:19Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2025-07-18T03:14:57Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2025-07-19T02:20:40Z

/te-ci pytorch L1

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2025-07-21T23:17:19Z

/te-ci pytorch L1

Avoid problems with in-place ops after quantizer usages are changed externally. Signed-off-by: Tim Moon <[email protected]>

transformer_engine/pytorch/csrc/common.h

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2025-07-24T05:57:36Z

transformer_engine/common/util/rtc.h

@@ -59,6 +59,7 @@ class Kernel {
  template <typename... ArgTs>
  void launch(int device_id, const dim3 grid_dim, const dim3 block_dim,
              unsigned int shared_mem_bytes, cudaStream_t stream, ArgTs &&...args) {
+    cuda_driver::ensure_context_exists();


This PR exposed a bug in our NVRTC infrastructure. Three facts:

The CUDA driver maintains a thread-local stack of CUDA contexts.

PyTorch will initialize the CUDA context if needed for jitting.

PyTorch performs autograd on a separate thread.

By removing unnecessary at::reciprocals from create_tensor, I experienced some cases where the backward pass launched an NVRTC kernel before launching any PyTorch ops (namely in the FP8 linear op with UB). Since the autograd thread's context stack was empty, this resulted in "invalid device context" errors.

This is interesting, thanks!

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2025-07-24T19:33:10Z

/te-ci core

zhongbozhu

LGTM.

Will save us a lot of work for NVFP4 if we rebase on this PR.

ksivaman · 2025-07-29T01:37:33Z

/te-ci pytorch L0 L1

…A#1952 instead of manual quantization Signed-off-by: Jan Bielak <[email protected]>

…#2006) Refactor normalization.cpp to use quantizer logic introduced in #1952 instead of manual quantization Signed-off-by: Jan Bielak <[email protected]>

zhongbozhu and others added 2 commits July 14, 2025 15:44

remove reciprocal op

c6efd90

Signed-off-by: zhongboz <[email protected]>

Refactor Quantizer::create_tensor function

2fdef53

Signed-off-by: Tim Moon <[email protected]>

timmoon10 mentioned this pull request Jul 15, 2025

[PyTorch][FP8 CS] Remove the unnecessary torch reciprocal op in fp8 current scaling code path #1950

Closed

13 tasks

Merge branch 'main' into refactor-quantizer-create-tensor-func

bd5e1dd

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the refactor-quantizer-create-tensor-func branch from baaef38 to bd5e1dd Compare July 16, 2025 01:54

[pre-commit.ci] auto fixes from pre-commit.com hooks

1338edf

for more information, see https://pre-commit.ci

timmoon10 and others added 2 commits July 17, 2025 00:11

Fix bug when constructing FP8 tensor

6d30bb9

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5fca0a0

for more information, see https://pre-commit.ci

timmoon10 added 6 commits July 17, 2025 02:08

Add quantize function to C++ quantizers

dc6fae5

Signed-off-by: Tim Moon <[email protected]>

Prototype function to coerce Python quantized tensors to match quantizer

7ac091d

Signed-off-by: Tim Moon <[email protected]>

Use quantizer class in tex.quantize

b30a4b4

Signed-off-by: Tim Moon <[email protected]>

Add FP8 current scaling support for activation backward

23be7be

Signed-off-by: Tim Moon <[email protected]>

Disable quantized GEMM output with FP8 current scaling

302a77d

Signed-off-by: Tim Moon <[email protected]>

Add coerce_tensor functions for MXFP8 and DSv3

952333a

Signed-off-by: Tim Moon <[email protected]>

timmoon10 changed the title ~~[PyTorch] Refactor Quantizer::create_tensor function~~ [PyTorch] Refactor C++ quantizer infrastructure Jul 17, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

86af34c

for more information, see https://pre-commit.ci

timmoon10 added 3 commits July 17, 2025 23:59

Merge branch 'main' into refactor-quantizer-create-tensor-func

d0479a9

Avoid quantizing empty tensors

596ead5

Signed-off-by: Tim Moon <[email protected]>

Use consistent shapes for FP8 transposes

c4270b3

Signed-off-by: Tim Moon <[email protected]>

timmoon10 mentioned this pull request Jul 19, 2025

[PyTorch] Reset FP8 weight workspace if usages are invalid #1972

Merged

13 tasks

timmoon10 and others added 3 commits July 19, 2025 02:13

In attention impl, construct FP8 tensors with pre-initialized scale-invs

34d1fde

Signed-off-by: Tim Moon <[email protected]>

Initialize MXFP8 scales to zero

a49cb5e

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ba68676

for more information, see https://pre-commit.ci

timmoon10 added 2 commits July 21, 2025 22:20

Merge branch 'main' into refactor-quantizer-create-tensor-func

0a79048

Store copy of quantizer when creating quantized tensors

76d2d53

Signed-off-by: Tim Moon <[email protected]>

Fix linter warnings

c54d821

Signed-off-by: Tim Moon <[email protected]>

timmoon10 requested a review from ptrendx July 21, 2025 23:15

timmoon10 marked this pull request as ready for review July 21, 2025 23:15

timmoon10 added 2 commits July 22, 2025 05:28

Merge branch 'main' into refactor-quantizer-create-tensor-func

c5d0e46

Make sure quantized tensors have private quantizer

c252dc0

Avoid problems with in-place ops after quantizer usages are changed externally. Signed-off-by: Tim Moon <[email protected]>

zhongbozhu reviewed Jul 22, 2025

View reviewed changes

transformer_engine/pytorch/csrc/common.h Outdated Show resolved Hide resolved

timmoon10 and others added 3 commits July 22, 2025 21:03

Merge branch 'main' into refactor-quantizer-create-tensor-func

c3c1df3

Signed-off-by: Tim Moon <[email protected]>

Rename "coerce_tensor" to "convert_and_update_tensor"

df6313c

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

27cf92a

for more information, see https://pre-commit.ci

timmoon10 mentioned this pull request Jul 23, 2025

[C][PyTorch] NVFP4 forward MXFP8 backward recipe #1970

Draft

13 tasks

timmoon10 and others added 2 commits July 24, 2025 05:41

Make sure CUDA context is available when launching NVRTC kernel

3e7dbb1

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into refactor-quantizer-create-tensor-func

6bdbb12

timmoon10 commented Jul 24, 2025

View reviewed changes

timmoon10 added 2 commits July 24, 2025 19:28

Expose CUDA context creation function externally

261f60f

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into refactor-quantizer-create-tensor-func

970e54d

zhongbozhu approved these changes Jul 29, 2025

View reviewed changes

Merge branch 'main' into refactor-quantizer-create-tensor-func

2cd7fb2

ksivaman merged commit cb5013b into NVIDIA:main Jul 29, 2025
35 of 37 checks passed

janekb04 added a commit to janekb04/TransformerEngine that referenced this pull request Jul 29, 2025

Refactor normalization.cpp to use quantizer logic introduced in NVIDI…

7a77ab4

…A#1952 instead of manual quantization Signed-off-by: Jan Bielak <[email protected]>

janekb04 mentioned this pull request Jul 29, 2025

Refactor normalization.cpp to use quantizer logic introduced in #1952 #2006

Merged

13 tasks

timmoon10 mentioned this pull request Jul 30, 2025

[PyTorch] Disable fused dbias-quantize kernel for unsupported recipes #2007

Merged

13 tasks

hxbai mentioned this pull request Aug 1, 2025

[PyTorch] fix input_quantizer usage for save_original_input; fix blockwise FP8 convert_and_update_tensor #1978

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch] Refactor C++ quantizer infrastructure #1952

[PyTorch] Refactor C++ quantizer infrastructure #1952

Uh oh!

timmoon10 commented Jul 15, 2025 •

edited

Loading

Uh oh!

timmoon10 commented Jul 16, 2025

Uh oh!

timmoon10 commented Jul 17, 2025

Uh oh!

timmoon10 commented Jul 17, 2025

Uh oh!

timmoon10 commented Jul 18, 2025

Uh oh!

timmoon10 commented Jul 19, 2025

Uh oh!

timmoon10 commented Jul 21, 2025

Uh oh!

Uh oh!

timmoon10 Jul 24, 2025 •

edited

Loading

Uh oh!

zhongbozhu Jul 25, 2025

Uh oh!

timmoon10 commented Jul 24, 2025

Uh oh!

zhongbozhu left a comment

Uh oh!

ksivaman commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

[PyTorch] Refactor C++ quantizer infrastructure #1952

[PyTorch] Refactor C++ quantizer infrastructure #1952

Uh oh!

Conversation

timmoon10 commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 commented Jul 16, 2025

Uh oh!

timmoon10 commented Jul 17, 2025

Uh oh!

timmoon10 commented Jul 17, 2025

Uh oh!

timmoon10 commented Jul 18, 2025

Uh oh!

timmoon10 commented Jul 19, 2025

Uh oh!

timmoon10 commented Jul 21, 2025

Uh oh!

Uh oh!

timmoon10 Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Jul 24, 2025

Uh oh!

zhongbozhu left a comment

Choose a reason for hiding this comment

Uh oh!

ksivaman commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

timmoon10 commented Jul 15, 2025 •

edited

Loading

timmoon10 Jul 24, 2025 •

edited

Loading