-
Notifications
You must be signed in to change notification settings - Fork 503
[PyTorch] Refactor C++ quantizer infrastructure #1952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PyTorch] Refactor C++ quantizer infrastructure #1952
Conversation
Signed-off-by: zhongboz <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
baaef38
to
bd5e1dd
Compare
for more information, see https://pre-commit.ci
/te-ci pytorch |
Signed-off-by: Tim Moon <[email protected]>
for more information, see https://pre-commit.ci
/te-ci pytorch |
Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Quantizer::create_tensor
functionfor more information, see https://pre-commit.ci
/te-ci pytorch |
Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
/te-ci pytorch |
Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
for more information, see https://pre-commit.ci
/te-ci pytorch L1 |
Signed-off-by: Tim Moon <[email protected]>
/te-ci pytorch L1 |
Avoid problems with in-place ops after quantizer usages are changed externally. Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
for more information, see https://pre-commit.ci
@@ -59,6 +59,7 @@ class Kernel { | |||
template <typename... ArgTs> | |||
void launch(int device_id, const dim3 grid_dim, const dim3 block_dim, | |||
unsigned int shared_mem_bytes, cudaStream_t stream, ArgTs &&...args) { | |||
cuda_driver::ensure_context_exists(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR exposed a bug in our NVRTC infrastructure. Three facts:
- The CUDA driver maintains a thread-local stack of CUDA contexts.
- PyTorch will initialize the CUDA context if needed for jitting.
- PyTorch performs autograd on a separate thread.
By removing unnecessary at::reciprocal
s from create_tensor
, I experienced some cases where the backward pass launched an NVRTC kernel before launching any PyTorch ops (namely in the FP8 linear op with UB). Since the autograd thread's context stack was empty, this resulted in "invalid device context" errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting, thanks!
/te-ci core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Will save us a lot of work for NVFP4 if we rebase on this PR.
/te-ci pytorch L0 L1 |
…A#1952 instead of manual quantization Signed-off-by: Jan Bielak <[email protected]>
…#2006) Refactor normalization.cpp to use quantizer logic introduced in #1952 instead of manual quantization Signed-off-by: Jan Bielak <[email protected]>
Description
This PR makes three changes to the quantizer infrastructure in the
transformer_engine_torch
extensions:Quantizer::quantize
function. Previously this was duplicated in functions for quantization, activations, normalization, etc.Quantizer::convert_and_update_tensor
function, similar to Make quantize_ respect the usages of the quantizer #1836.Quantizer::create_tensor
to always return an uninitialized tensor, removing the need for an unnecessary scale reciprocal. For backward compatibility, some quantizer subclasses provide functions for creating initialized tensors.Arguments for removing the `rowwise_data` arg from `Quantize::create_tensor`
rowwise_data
provides an option to provide an already-initialized data buffer. This was implemented to support some use-cases with attention involving QKV fusing and with the Userbuffers buffer (no longer needed after #1711). However, this design has numerous problems:dynamic_cast
the quantizer to specific concrete class, so there is not much benefit in a generic API.This PR removes the
rowwise_data
arg entirely from the base class, so callingcreate_tensor
will create a tensor with pure uninitialized buffers.NoneQuantizer
andFloat8Quantizer
still expose variants ofcreate_tensor
for providing pre-initialized buffers, with better recipe-specific logic.#1950 is an alternative attempt to avoid the problems of the
rowwise_data
API in the FP8 current-scaling quantizer. #1836 adds an optionalout
arg toQuantizer::create_tensor
and will force any provided tensor to match the quantizer's usages.Closes #1836. Closes #1950.
Type of change
Changes
Quantizer::create_tensor
construct uninitialized tensors, with some sub-class variants for constructing initialized tensorstex.quantize
forces quantized tensors to match the quantizer's usagesQuantizer::quantize
Checklist: