Skip to content

[PyTorch] Quantizer as API #2039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

negvet
Copy link
Collaborator

@negvet negvet commented Aug 7, 2025

Description

Expose quantizer as an API.
Main objective - let users build custom quantizers.

Currently, Quantizer contains TE-specific logic (quantize() with autograd function, calibrate(), _get_compatible_recipe()).
I propose to extract the most generic interfaces/implementations into QuantizerBase and expose it as a first-class API.

Usage example:

 from transformer_engine.pytorch import QuantizerBase
 
 class MyCustomQuantizer(QuantizerBase):
     def quantize(self, tensor, **kwargs):
         # Custom quantization logic, e.g. python-based or pure silicon
         pass

Custom quantizers can be used:

  • Externally
  • In TE, with some TE-required implementations (such as update_quantized())

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • [] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Introduced QuantizerBase
  • Exposed QuantizerBase as an API

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • [] I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@negvet
Copy link
Collaborator Author

negvet commented Aug 7, 2025

/te-ci pytorch

@negvet negvet requested review from timmoon10 and ptrendx August 7, 2025 10:14
@negvet
Copy link
Collaborator Author

negvet commented Aug 7, 2025

About inconsistencies of columnwise_usage and rowwise_usage.

Naming: parameters rowwise and columnwise, but instance attributes: rowwise_usage and columnwise_usage. Plus, the meaning of "usage" is unclear.
Semantic ambiguity: Currently, rowwise_usage and columnwise_usage are independent boolean flags, but they are different aspects of the tensor layout (semantically close).

I would propose to reconsider this design.

Renaming would already improve the situation.
Another option is to move towards JAX Implementation (enum-based):

class QuantizeLayout(Enum):
    ROWWISE = "rowwise"
    COLUMNWISE = "columnwise"  
    ROWWISE_COLWISE = "both"

Although this introduces a new class for users to learn.

@negvet
Copy link
Collaborator Author

negvet commented Aug 7, 2025

@ptrendx I propose to keep update_quantized in the concrete Quantizer class, but not in QuantizerBase.

This is TE-specific optimization (weight workspace caching + cuda graph support etc.)
Custom quantizers for other use cases might not need in-place updates:

  • Research quantizers might only need one-time quantization
  • Inference-only quantizers might not need parameter updates
  • etc.

@negvet negvet marked this pull request as ready for review August 7, 2025 10:39
@ksivaman ksivaman self-requested a review August 7, 2025 15:57
@timmoon10
Copy link
Collaborator

timmoon10 commented Aug 7, 2025

  • I agree "usages" is not the best term for the concept we're describing, but we use it consistently throughout the codebase. If we find something better, we should properly document it and change it everywhere.
  • There's a semantic difference between "row-wise/column-wise usage" and "row-wise/column-wise data". "Row-wise/column-wise usages" indicates intent, and each usage is completely orthogonal since you may or may not use the same tensor for multiple operations. "Row-wise/column-wise data" is not orthogonal and is highly recipe-dependent, since some buffers can be used for multiple usages, e.g. FP8 data on Blackwell.
  • We should keep in mind that we might add more usages in the future. We are seriously considering adding usages for communication. For example, the FP8 wgrad GEMM currently does an all-gather followed by transpose. Instead of quantizing with row-wise usage, it would be more natural to quantize with all-gather-column-wise usage. We have similar considerations if we want to support MXFP8 with pre-swizzled scales.
  • The enum approach is not scalable since its size will grow with 2^n.
  • Who knows what GEMM assumptions future architectures will have? Row-wise and column-wise currently rely on the fact that Hopper/Blackwell Tensor Cores use the same data format for A and B in a TN GEMM. Could we require 4 usages in the future (A, A^T, B, B^T)? What if we need to support convolutions?

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing a pure Python interface will be quite nice. Ideally we'd design the quantizers in such a way that a C++ or Python impl is an implementation detail, and we don't need any special logic in the modules.

One question is how to handle the C++ quantizer infrastructure, e.g. in tex.quantize. Options:

  • Only use the C++ quantizer as a perf optimization. We'll need to add checks to avoid passing a pure-Python quantizer into tex functions (e.g. for norms or activations).
  • Add a C++ quantizer that calls a Python function (or modify the C++ quantizer base class). The C++ quantizer has to deal with both Python tensor classes (as pybind objects) and NVTETensor, and handling NVTETensors from Python will be challenging.

I think the pure Python approach in this PR is more straightforward.

self.columnwise_usage = columnwise


class Quantizer(QuantizerBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the distinction between Quantizer and QuantizerBase is logical. This PR is trying to distinguish between quantizers that call tex.quantize and those that call a Python impl, but that's a quantizer-specific implementation detail. QuantizerBase is also haphazardly removing parts of the quantizer API, like the ability to construct empty tensors, PyTorch autograd support, etc.

I think the right design is not to add an unnecessary QuantizerBase class, but to decouple Quantizer from tex.quantize. We can add an abstract quantize_impl function that is called within quantize. The existing quantizers should call tex.quantize, but future quantizers could use a pure Python impl.

@timmoon10
Copy link
Collaborator

timmoon10 commented Aug 7, 2025

@ptrendx I propose to keep update_quantized in the concrete Quantizer class, but not in QuantizerBase.

This is TE-specific optimization (weight workspace caching + cuda graph support etc.) Custom quantizers for other use cases might not need in-place updates:

  • Research quantizers might only need one-time quantization
  • Inference-only quantizers might not need parameter updates
  • etc.

I don't think this is a good reason to change the API. If we want to cut corners by not implementing things (fair enough for experimentation), we can throw NotImplementedErrors. The problem with this approach is that we'll need to do isinstance(quantizer, Quantizer) if we want to use APIs that are not in QuantizerBase, which is leaking implementation details out of the class. If it's a QuantizerBase, we'll probably throw an exception anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants