Skip to content

Conversation

timmoon10
Copy link
Collaborator

@timmoon10 timmoon10 commented Jul 15, 2025

Description

This PR makes three changes to the quantizer infrastructure in the transformer_engine_torch extensions:

  1. Consolidate recipe-specific quantization logic in the Quantizer::quantize function. Previously this was duplicated in functions for quantization, activations, normalization, etc.
  2. Force quantized tensors to match the quantizer's usages in the Quantizer::convert_and_update_tensor function, similar to Make quantize_ respect the usages of the quantizer #1836.
  3. Change Quantizer::create_tensor to always return an uninitialized tensor, removing the need for an unnecessary scale reciprocal. For backward compatibility, some quantizer subclasses provide functions for creating initialized tensors.
Arguments for removing the `rowwise_data` arg from `Quantize::create_tensor`

rowwise_data provides an option to provide an already-initialized data buffer. This was implemented to support some use-cases with attention involving QKV fusing and with the Userbuffers buffer (no longer needed after #1711). However, this design has numerous problems:

  • There is no option to provide column-wise data.
  • There is no option to provide scales. For FP8 delayed scaling, this means we need to compute the reciprocal of the quantizer's scale, which adds CPU and GPU overhead that is often unnecessary. For FP8 current scaling, MXFP8, and DSv3, this makes this API unusable.
  • Scales are not consistent between recipes. FP8 requires a per-tensor scale that is shared between the data and transpose. MXFP8 and DSv3 require separate row-wise and column-wise scales. Hierarchical scaling recipes involve multiple scale tensors for both row-wise and column-wise data. Maintaining a API that can generically cover all cases will be... not fun.
  • The required buffers are recipe-specific and machine-specific, so if you are using this API you have to already know the recipe. You can just dynamic_cast the quantizer to specific concrete class, so there is not much benefit in a generic API.

This PR removes the rowwise_data arg entirely from the base class, so calling create_tensor will create a tensor with pure uninitialized buffers. NoneQuantizer and Float8Quantizer still expose variants of create_tensor for providing pre-initialized buffers, with better recipe-specific logic.

#1950 is an alternative attempt to avoid the problems of the rowwise_data API in the FP8 current-scaling quantizer. #1836 adds an optional out arg to Quantizer::create_tensor and will force any provided tensor to match the quantizer's usages.

Closes #1836. Closes #1950.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Make Quantizer::create_tensor construct uninitialized tensors, with some sub-class variants for constructing initialized tensors
  • tex.quantize forces quantized tensors to match the quantizer's usages
  • Consolidate quantization logic in Quantizer::quantize
  • Support all quantization schemes in activation forward and backward

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@timmoon10 timmoon10 force-pushed the refactor-quantizer-create-tensor-func branch from baaef38 to bd5e1dd Compare July 16, 2025 01:54
@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10 timmoon10 changed the title [PyTorch] Refactor Quantizer::create_tensor function [PyTorch] Refactor C++ quantizer infrastructure Jul 17, 2025
@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10
Copy link
Collaborator Author

/te-ci pytorch L1

Signed-off-by: Tim Moon <[email protected]>
@timmoon10 timmoon10 requested a review from ptrendx July 21, 2025 23:15
@timmoon10 timmoon10 marked this pull request as ready for review July 21, 2025 23:15
@timmoon10
Copy link
Collaborator Author

/te-ci pytorch L1

Avoid problems with in-place ops after quantizer usages are changed externally.

Signed-off-by: Tim Moon <[email protected]>
@@ -59,6 +59,7 @@ class Kernel {
template <typename... ArgTs>
void launch(int device_id, const dim3 grid_dim, const dim3 block_dim,
unsigned int shared_mem_bytes, cudaStream_t stream, ArgTs &&...args) {
cuda_driver::ensure_context_exists();
Copy link
Collaborator Author

@timmoon10 timmoon10 Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR exposed a bug in our NVRTC infrastructure. Three facts:

  1. The CUDA driver maintains a thread-local stack of CUDA contexts.
  2. PyTorch will initialize the CUDA context if needed for jitting.
  3. PyTorch performs autograd on a separate thread.

By removing unnecessary at::reciprocals from create_tensor, I experienced some cases where the backward pass launched an NVRTC kernel before launching any PyTorch ops (namely in the FP8 linear op with UB). Since the autograd thread's context stack was empty, this resulted in "invalid device context" errors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, thanks!

@timmoon10
Copy link
Collaborator Author

/te-ci core

Copy link
Collaborator

@zhongbozhu zhongbozhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Will save us a lot of work for NVFP4 if we rebase on this PR.

@ksivaman
Copy link
Member

/te-ci pytorch L0 L1

@ksivaman ksivaman merged commit cb5013b into NVIDIA:main Jul 29, 2025
35 of 37 checks passed
janekb04 added a commit to janekb04/TransformerEngine that referenced this pull request Jul 29, 2025
timmoon10 pushed a commit that referenced this pull request Jul 30, 2025
…#2006)

Refactor normalization.cpp to use quantizer logic introduced in #1952 instead of manual quantization

Signed-off-by: Jan Bielak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants