-
Notifications
You must be signed in to change notification settings - Fork 30.7k
FP-Quant support #38696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FP-Quant support #38696
Conversation
cc @MekkCyber |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @BlackSamorez ! Thanks a lot for this addition 🤗 ! Left a few comments !
@@ -0,0 +1,49 @@ | |||
# Copyright 2024 The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2024 The HuggingFace Team. All rights reserved. | |
# Copyright 2025 The HuggingFace Team. All rights reserved. |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"HIGGS through FLUTE (Flexible Lookup Table Engine for LUT-quantized LLMs) integration file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"HIGGS through FLUTE (Flexible Lookup Table Engine for LUT-quantized LLMs) integration file" | |
"Quartet QAT integration file" |
if is_torch_available(): | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this
@@ -0,0 +1,164 @@ | |||
# Copyright 2024 The HuggingFace Inc. team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2024 The HuggingFace Inc. team. All rights reserved. | |
# Copyright 2025 The HuggingFace Inc. team. All rights reserved. |
Quantizer of the HIGGS method. Enables the loading of prequantized models and in-flight quantization of full-precision models. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be updated
def is_qutlass_available(): | ||
return _qutlass_available | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find a distribution for qutlass
, is it not released yet ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has just been released: https://github.com/IST-DASLab/qutlass
for name, module in tqdm(quartet_qat_modules.items(), desc="Pre-processing Quartet QAT modules", leave=False): | ||
pass | ||
# module.pre_forward() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s meant to happen here exactly ?
if isinstance(module, QuartetLinear) and tensor_name == "weight": | ||
# Only quantize weights of QuartetLinear modules that are not already quantized | ||
return True | ||
else: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the bias quantized too ?
assert isinstance(module, QuartetLinear), f"Module {param_name} is not a QuartetLinear somehow..." | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for assert here, or we can just raise an error instead
module.pre_forward() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's happening here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Hadamard transform matrix initialization on the correct devices.
- Since it's a QAT method, we might or might not want to keep a full-precision weight copy. If we don't need the full precision weight copy, this function also deletes the
.weight
parameter after quantizing it. Here's the code.
Hi @BlackSamorez, I'm really looking forward to experimenting with this. When can we expect to have the kernels public so we can begin testing, even if they are still WIP? |
@MekkCyber Hi, thanks for reviewing this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on the PR and congrats on the release ! The only major thing missing before we merge this is some documentation for this new method ! Please ping me when it's done and I'll merge the PR !
store_master_weights (`bool`, *optional*, defaults to `False`): | ||
Whether to store the master weights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which context storing master weights could be useful ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of QAT, which we'll add in a later release: we're still working on the quantized backward pass kernels. But I thought it would make sense to include this option right away to not have to edit the config later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to docstring
forward_method (`str`, *optional*, defaults to `"abs_max"`): | ||
The method to use for the forward pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have absmax
and quest
for this arg. can you explain a bit what quest does ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added docstring explanation
hadamard_group_size (`int`, *optional*, defaults to 32): | ||
The group size for the hadamard transform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain a bit what this does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved docstring
if is_torch_available(): | ||
pass | ||
|
||
if is_accelerate_available(): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return | ||
|
||
module.weight = torch.nn.Parameter(param_value.to(target_device)) | ||
module.pre_forward() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really nice to put all the quantization logic there
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@SunMarc added docs, improved docstring, cleaned the code where you asked. |
Should be good |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Thanks for iterating !
One last nit, the build PR documentation is not passing:
|
Head branch was pushed to by a user without write access
Added it to toctree |
@SunMarc it hit job cancellation somehow. Might need a restart. It should be good. |
[For maintainers] Suggested jobs to run (before merge) run-slow: fp_quant_integration |
Merged ! Thanks for your work |
Hey @BlackSamorez, is there a way to make fp_quant compatible with py3.9 ? Our CI runs on this version but fp_quant requires 3.11 |
I guess I'll have to remove match-case constructions and it'll work. |
We want to make sure that the min version of python that is maintained runs transformers correctly. When it will reach EOL, we switch to the next version |
* quartet * quartet qat -> quartet * format * bf16 backward * interfaces * forward_method * quartet -> fp_quant * style * List -> list * list typing * fixed format and annotations * test_fp_quant * docstrings and default dtypes * better docstring and removed noop checks * docs * pseudoquantization support to test on non-blackwell * pseudoquant * Pseudoquant docs * Update docs/source/en/quantization/fp_quant.md Co-authored-by: Marc Sun <[email protected]> * Update docs/source/en/quantization/fp_quant.md * Update docs/source/en/quantization/fp_quant.md * Update src/transformers/utils/quantization_config.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Marc Sun <[email protected]> * small test fixes * dockerfile update * spec link * removed `_process_model_after_weight_loading` * toctree --------- Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]>
* quartet * quartet qat -> quartet * format * bf16 backward * interfaces * forward_method * quartet -> fp_quant * style * List -> list * list typing * fixed format and annotations * test_fp_quant * docstrings and default dtypes * better docstring and removed noop checks * docs * pseudoquantization support to test on non-blackwell * pseudoquant * Pseudoquant docs * Update docs/source/en/quantization/fp_quant.md Co-authored-by: Marc Sun <[email protected]> * Update docs/source/en/quantization/fp_quant.md * Update docs/source/en/quantization/fp_quant.md * Update src/transformers/utils/quantization_config.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Marc Sun <[email protected]> * small test fixes * dockerfile update * spec link * removed `_process_model_after_weight_loading` * toctree --------- Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]>
* quartet * quartet qat -> quartet * format * bf16 backward * interfaces * forward_method * quartet -> fp_quant * style * List -> list * list typing * fixed format and annotations * test_fp_quant * docstrings and default dtypes * better docstring and removed noop checks * docs * pseudoquantization support to test on non-blackwell * pseudoquant * Pseudoquant docs * Update docs/source/en/quantization/fp_quant.md Co-authored-by: Marc Sun <[email protected]> * Update docs/source/en/quantization/fp_quant.md * Update docs/source/en/quantization/fp_quant.md * Update src/transformers/utils/quantization_config.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Marc Sun <[email protected]> * small test fixes * dockerfile update * spec link * removed `_process_model_after_weight_loading` * toctree --------- Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]>
* quartet * quartet qat -> quartet * format * bf16 backward * interfaces * forward_method * quartet -> fp_quant * style * List -> list * list typing * fixed format and annotations * test_fp_quant * docstrings and default dtypes * better docstring and removed noop checks * docs * pseudoquantization support to test on non-blackwell * pseudoquant * Pseudoquant docs * Update docs/source/en/quantization/fp_quant.md Co-authored-by: Marc Sun <[email protected]> * Update docs/source/en/quantization/fp_quant.md * Update docs/source/en/quantization/fp_quant.md * Update src/transformers/utils/quantization_config.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Marc Sun <[email protected]> * small test fixes * dockerfile update * spec link * removed `_process_model_after_weight_loading` * toctree --------- Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]>
* quartet * quartet qat -> quartet * format * bf16 backward * interfaces * forward_method * quartet -> fp_quant * style * List -> list * list typing * fixed format and annotations * test_fp_quant * docstrings and default dtypes * better docstring and removed noop checks * docs * pseudoquantization support to test on non-blackwell * pseudoquant * Pseudoquant docs * Update docs/source/en/quantization/fp_quant.md Co-authored-by: Marc Sun <[email protected]> * Update docs/source/en/quantization/fp_quant.md * Update docs/source/en/quantization/fp_quant.md * Update src/transformers/utils/quantization_config.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Marc Sun <[email protected]> * small test fixes * dockerfile update * spec link * removed `_process_model_after_weight_loading` * toctree --------- Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]>
* quartet * quartet qat -> quartet * format * bf16 backward * interfaces * forward_method * quartet -> fp_quant * style * List -> list * list typing * fixed format and annotations * test_fp_quant * docstrings and default dtypes * better docstring and removed noop checks * docs * pseudoquantization support to test on non-blackwell * pseudoquant * Pseudoquant docs * Update docs/source/en/quantization/fp_quant.md Co-authored-by: Marc Sun <[email protected]> * Update docs/source/en/quantization/fp_quant.md * Update docs/source/en/quantization/fp_quant.md * Update src/transformers/utils/quantization_config.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Marc Sun <[email protected]> * small test fixes * dockerfile update * spec link * removed `_process_model_after_weight_loading` * toctree --------- Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]>
* quartet * quartet qat -> quartet * format * bf16 backward * interfaces * forward_method * quartet -> fp_quant * style * List -> list * list typing * fixed format and annotations * test_fp_quant * docstrings and default dtypes * better docstring and removed noop checks * docs * pseudoquantization support to test on non-blackwell * pseudoquant * Pseudoquant docs * Update docs/source/en/quantization/fp_quant.md Co-authored-by: Marc Sun <[email protected]> * Update docs/source/en/quantization/fp_quant.md * Update docs/source/en/quantization/fp_quant.md * Update src/transformers/utils/quantization_config.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Mohamed Mekkouri <[email protected]> * Update tests/quantization/fp_quant_integration/test_fp_quant.py Co-authored-by: Marc Sun <[email protected]> * small test fixes * dockerfile update * spec link * removed `_process_model_after_weight_loading` * toctree --------- Co-authored-by: Marc Sun <[email protected]> Co-authored-by: Mohamed Mekkouri <[email protected]>
This PR adds support for the FP-Quant method.
The goal of this PR is to integrate inference and training support for the FP-Quant method that utilizes the Hadamard Transform for efficient weights+activations quantization. When using it with MXFP4 and MSE-based scaling, it implements Quartet forward pass. We're also working on adding NVFP4 support and backward pass support.
Currently, we're working on the kernels here, and the integration here.
Installation:
qutlass
:git clone https://github.com/IST-DASLab/qutlass.git && cd qutlass && pip install --no-build-isolation .
fp_quant
:pip install fp_quant
Usage:
quantization_config=FPQuantConfig()
--real_quant
.Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.