[PyTorch] Quantizer as API #2039

negvet · 2025-08-07T10:10:48Z

Description

Expose quantizer as an API.
Main objective - let users build custom quantizers.

Currently, Quantizer contains TE-specific logic (quantize() with autograd function, calibrate(), _get_compatible_recipe()).
I propose to extract the most generic interfaces/implementations into QuantizerBase and expose it as a first-class API.

Usage example:

 from transformer_engine.pytorch import QuantizerBase
 
 class MyCustomQuantizer(QuantizerBase):
     def quantize(self, tensor, **kwargs):
         # Custom quantization logic, e.g. python-based or pure silicon
         pass

Custom quantizers can be used:

Externally
In TE, with some TE-required implementations (such as update_quantized())

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
[] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Introduced QuantizerBase
Exposed QuantizerBase as an API

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
[] I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Evgeny <[email protected]>

negvet · 2025-08-07T10:13:54Z

/te-ci pytorch

negvet · 2025-08-07T10:26:38Z

About inconsistencies of columnwise_usage and rowwise_usage.

Naming: parameters rowwise and columnwise, but instance attributes: rowwise_usage and columnwise_usage. Plus, the meaning of "usage" is unclear.
Semantic ambiguity: Currently, rowwise_usage and columnwise_usage are independent boolean flags, but they are different aspects of the tensor layout (semantically close).

I would propose to reconsider this design.

Renaming would already improve the situation.
Another option is to move towards JAX Implementation (enum-based):

class QuantizeLayout(Enum):
    ROWWISE = "rowwise"
    COLUMNWISE = "columnwise"  
    ROWWISE_COLWISE = "both"

Although this introduces a new class for users to learn.

negvet · 2025-08-07T10:38:58Z

@ptrendx I propose to keep update_quantized in the concrete Quantizer class, but not in QuantizerBase.

This is TE-specific optimization (weight workspace caching + cuda graph support etc.)
Custom quantizers for other use cases might not need in-place updates:

Research quantizers might only need one-time quantization
Inference-only quantizers might not need parameter updates
etc.

timmoon10 · 2025-08-07T17:59:15Z

I agree "usages" is not the best term for the concept we're describing, but we use it consistently throughout the codebase. If we find something better, we should properly document it and change it everywhere.
There's a semantic difference between "row-wise/column-wise usage" and "row-wise/column-wise data". "Row-wise/column-wise usages" indicates intent, and each usage is completely orthogonal since you may or may not use the same tensor for multiple operations. "Row-wise/column-wise data" is not orthogonal and is highly recipe-dependent, since some buffers can be used for multiple usages, e.g. FP8 data on Blackwell.
We should keep in mind that we might add more usages in the future. We are seriously considering adding usages for communication. For example, the FP8 wgrad GEMM currently does an all-gather followed by transpose. Instead of quantizing with row-wise usage, it would be more natural to quantize with all-gather-column-wise usage. We have similar considerations if we want to support MXFP8 with pre-swizzled scales.
The enum approach is not scalable since its size will grow with 2^n.
Who knows what GEMM assumptions future architectures will have? Row-wise and column-wise currently rely on the fact that Hopper/Blackwell Tensor Cores use the same data format for A and B in a TN GEMM. Could we require 4 usages in the future (A, A^T, B, B^T)? What if we need to support convolutions?

timmoon10

Exposing a pure Python interface will be quite nice. Ideally we'd design the quantizers in such a way that a C++ or Python impl is an implementation detail, and we don't need any special logic in the modules.

One question is how to handle the C++ quantizer infrastructure, e.g. in tex.quantize. Options:

Only use the C++ quantizer as a perf optimization. We'll need to add checks to avoid passing a pure-Python quantizer into tex functions (e.g. for norms or activations).
Add a C++ quantizer that calls a Python function (or modify the C++ quantizer base class). The C++ quantizer has to deal with both Python tensor classes (as pybind objects) and NVTETensor, and handling NVTETensors from Python will be challenging.

I think the pure Python approach in this PR is more straightforward.

timmoon10 · 2025-08-07T20:12:04Z

transformer_engine/pytorch/tensor/quantized_tensor.py

+            self.columnwise_usage = columnwise
+
+
+class Quantizer(QuantizerBase):


I don't think the distinction between Quantizer and QuantizerBase is logical. This PR is trying to distinguish between quantizers that call tex.quantize and those that call a Python impl, but that's a quantizer-specific implementation detail. QuantizerBase is also haphazardly removing parts of the quantizer API, like the ability to construct empty tensors, PyTorch autograd support, etc.

I think the right design is not to add an unnecessary QuantizerBase class, but to decouple Quantizer from tex.quantize. We can add an abstract quantize_impl function that is called within quantize. The existing quantizers should call tex.quantize, but future quantizers could use a pure Python impl.

timmoon10 · 2025-08-07T20:28:40Z

@ptrendx I propose to keep update_quantized in the concrete Quantizer class, but not in QuantizerBase.

This is TE-specific optimization (weight workspace caching + cuda graph support etc.) Custom quantizers for other use cases might not need in-place updates:

Research quantizers might only need one-time quantization

Inference-only quantizers might not need parameter updates

etc.

I don't think this is a good reason to change the API. If we want to cut corners by not implementing things (fair enough for experimentation), we can throw NotImplementedErrors. The problem with this approach is that we'll need to do isinstance(quantizer, Quantizer) if we want to use APIs that are not in QuantizerBase, which is leaking implementation details out of the class. If it's a QuantizerBase, we'll probably throw an exception anyways.

negvet added 2 commits August 7, 2025 09:53

Introduce QuantizerBase

c3618b1

Signed-off-by: Evgeny <[email protected]>

Expose as a first-class API

9f59e21

Signed-off-by: Evgeny <[email protected]>

negvet requested review from timmoon10 and ptrendx August 7, 2025 10:14

negvet marked this pull request as ready for review August 7, 2025 10:39

ksivaman self-requested a review August 7, 2025 15:57

timmoon10 reviewed Aug 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch] Quantizer as API #2039

[PyTorch] Quantizer as API #2039

Uh oh!

negvet commented Aug 7, 2025

Uh oh!

negvet commented Aug 7, 2025

Uh oh!

negvet commented Aug 7, 2025 •

edited

Loading

Uh oh!

negvet commented Aug 7, 2025

Uh oh!

timmoon10 commented Aug 7, 2025 •

edited

Loading

Uh oh!

timmoon10 left a comment

Uh oh!

timmoon10 Aug 7, 2025

Uh oh!

timmoon10 commented Aug 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

		self.columnwise_usage = columnwise


		class Quantizer(QuantizerBase):

[PyTorch] Quantizer as API #2039

Are you sure you want to change the base?

[PyTorch] Quantizer as API #2039

Uh oh!

Conversation

negvet commented Aug 7, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

negvet commented Aug 7, 2025

Uh oh!

negvet commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

negvet commented Aug 7, 2025

Uh oh!

timmoon10 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

negvet commented Aug 7, 2025 •

edited

Loading

timmoon10 commented Aug 7, 2025 •

edited

Loading

timmoon10 commented Aug 7, 2025 •

edited

Loading