Skip to content

Conversation

BlackSamorez
Copy link
Contributor

@BlackSamorez BlackSamorez commented Jun 9, 2025

This PR adds support for the FP-Quant method.

The goal of this PR is to integrate inference and training support for the FP-Quant method that utilizes the Hadamard Transform for efficient weights+activations quantization. When using it with MXFP4 and MSE-based scaling, it implements Quartet forward pass. We're also working on adding NVFP4 support and backward pass support.

Currently, we're working on the kernels here, and the integration here.

Installation:

  1. Install qutlass: git clone https://github.com/IST-DASLab/qutlass.git && cd qutlass && pip install --no-build-isolation .
  2. Install fp_quant: pip install fp_quant

Usage:

  1. Use as JIT quantization from any BF16 model by passing quantization_config=FPQuantConfig()
  2. Calibrate with GPTQ with the repo with --real_quant.
  3. Use pre-quantized models from hub: coming soon...

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1
Copy link
Member

cc @MekkCyber

Copy link
Contributor

@MekkCyber MekkCyber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @BlackSamorez ! Thanks a lot for this addition 🤗 ! Left a few comments !

@@ -0,0 +1,49 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2024 The HuggingFace Team. All rights reserved.
# Copyright 2025 The HuggingFace Team. All rights reserved.

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"HIGGS through FLUTE (Flexible Lookup Table Engine for LUT-quantized LLMs) integration file"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"HIGGS through FLUTE (Flexible Lookup Table Engine for LUT-quantized LLMs) integration file"
"Quartet QAT integration file"

Comment on lines 22 to 24
if is_torch_available():
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this

@@ -0,0 +1,164 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.

Comment on lines 36 to 38
Quantizer of the HIGGS method. Enables the loading of prequantized models and in-flight quantization of full-precision models.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be updated

Comment on lines +1163 to +1165
def is_qutlass_available():
return _qutlass_available

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find a distribution for qutlass, is it not released yet ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has just been released: https://github.com/IST-DASLab/qutlass

Comment on lines 125 to 128
for name, module in tqdm(quartet_qat_modules.items(), desc="Pre-processing Quartet QAT modules", leave=False):
pass
# module.pre_forward()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s meant to happen here exactly ?

Comment on lines 160 to 164
if isinstance(module, QuartetLinear) and tensor_name == "weight":
# Only quantize weights of QuartetLinear modules that are not already quantized
return True
else:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the bias quantized too ?

Comment on lines 96 to 97
assert isinstance(module, QuartetLinear), f"Module {param_name} is not a QuartetLinear somehow..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for assert here, or we can just raise an error instead

Comment on lines 99 to 100
module.pre_forward()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's happening here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Hadamard transform matrix initialization on the correct devices.
  2. Since it's a QAT method, we might or might not want to keep a full-precision weight copy. If we don't need the full precision weight copy, this function also deletes the .weight parameter after quantizing it. Here's the code.

@SunMarc SunMarc self-requested a review June 12, 2025 15:31
@kooshi
Copy link
Contributor

kooshi commented Jun 30, 2025

Hi @BlackSamorez, I'm really looking forward to experimenting with this.

When can we expect to have the kernels public so we can begin testing, even if they are still WIP?

@BlackSamorez BlackSamorez changed the title [WIP] Quartet QAT support [WIP] FP-Quant support Jul 13, 2025
@BlackSamorez
Copy link
Contributor Author

@MekkCyber Hi, thanks for reviewing this!
It took us a while, but all the kernels necessary for inference have been published: I've updated the PR description.
May I ask you to do another pass? Your previous comments mostly don't apply anymore because of refactoring.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on the PR and congrats on the release ! The only major thing missing before we merge this is some documentation for this new method ! Please ping me when it's done and I'll merge the PR !

Comment on lines 1566 to 1567
store_master_weights (`bool`, *optional*, defaults to `False`):
Whether to store the master weights.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which context storing master weights could be useful ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of QAT, which we'll add in a later release: we're still working on the quantized backward pass kernels. But I thought it would make sense to include this option right away to not have to edit the config later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to docstring

Comment on lines 1562 to 1563
forward_method (`str`, *optional*, defaults to `"abs_max"`):
The method to use for the forward pass.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have absmax and quest for this arg. can you explain a bit what quest does ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docstring explanation

Comment on lines 1568 to 1569
hadamard_group_size (`int`, *optional*, defaults to 32):
The group size for the hadamard transform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain a bit what this does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved docstring

Comment on lines 32 to 36
if is_torch_available():
pass

if is_accelerate_available():
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return

module.weight = torch.nn.Parameter(param_value.to(target_device))
module.pre_forward()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really nice to put all the quantization logic there

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BlackSamorez
Copy link
Contributor Author

@SunMarc added docs, improved docstring, cleaned the code where you asked.

@BlackSamorez
Copy link
Contributor Author

Should be good

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thanks for iterating !

@SunMarc SunMarc enabled auto-merge (squash) July 22, 2025 15:01
@SunMarc
Copy link
Member

SunMarc commented Jul 22, 2025

One last nit, the build PR documentation is not passing:

    raise RuntimeError(
RuntimeError: The following files are not present in the table of contents:
- quantization/fp_quant
Add them to ../transformers/docs/source/en/_toctree.yml.

auto-merge was automatically disabled July 22, 2025 15:23

Head branch was pushed to by a user without write access

@BlackSamorez
Copy link
Contributor Author

Added it to toctree

@BlackSamorez
Copy link
Contributor Author

@SunMarc it hit job cancellation somehow. Might need a restart. It should be good.

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: fp_quant_integration

@SunMarc SunMarc merged commit 623ab01 into huggingface:main Jul 23, 2025
25 checks passed
@SunMarc
Copy link
Member

SunMarc commented Jul 23, 2025

Merged ! Thanks for your work

@SunMarc
Copy link
Member

SunMarc commented Jul 24, 2025

Hey @BlackSamorez, is there a way to make fp_quant compatible with py3.9 ? Our CI runs on this version but fp_quant requires 3.11

@BlackSamorez
Copy link
Contributor Author

I guess I'll have to remove match-case constructions and it'll work.
Why run on 3.9 in 2025 though?

@SunMarc
Copy link
Member

SunMarc commented Jul 24, 2025

We want to make sure that the min version of python that is maintained runs transformers correctly. When it will reach EOL, we switch to the next version

zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* quartet

* quartet qat -> quartet

* format

* bf16 backward

* interfaces

* forward_method

* quartet -> fp_quant

* style

* List -> list

* list typing

* fixed format and annotations

* test_fp_quant

* docstrings and default dtypes

* better docstring and removed noop checks

* docs

* pseudoquantization support to test on non-blackwell

* pseudoquant

* Pseudoquant docs

* Update docs/source/en/quantization/fp_quant.md

Co-authored-by: Marc Sun <[email protected]>

* Update docs/source/en/quantization/fp_quant.md

* Update docs/source/en/quantization/fp_quant.md

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Marc Sun <[email protected]>

* small test fixes

* dockerfile update

* spec link

* removed `_process_model_after_weight_loading`

* toctree

---------

Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Mohamed Mekkouri <[email protected]>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* quartet

* quartet qat -> quartet

* format

* bf16 backward

* interfaces

* forward_method

* quartet -> fp_quant

* style

* List -> list

* list typing

* fixed format and annotations

* test_fp_quant

* docstrings and default dtypes

* better docstring and removed noop checks

* docs

* pseudoquantization support to test on non-blackwell

* pseudoquant

* Pseudoquant docs

* Update docs/source/en/quantization/fp_quant.md

Co-authored-by: Marc Sun <[email protected]>

* Update docs/source/en/quantization/fp_quant.md

* Update docs/source/en/quantization/fp_quant.md

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Marc Sun <[email protected]>

* small test fixes

* dockerfile update

* spec link

* removed `_process_model_after_weight_loading`

* toctree

---------

Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Mohamed Mekkouri <[email protected]>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* quartet

* quartet qat -> quartet

* format

* bf16 backward

* interfaces

* forward_method

* quartet -> fp_quant

* style

* List -> list

* list typing

* fixed format and annotations

* test_fp_quant

* docstrings and default dtypes

* better docstring and removed noop checks

* docs

* pseudoquantization support to test on non-blackwell

* pseudoquant

* Pseudoquant docs

* Update docs/source/en/quantization/fp_quant.md

Co-authored-by: Marc Sun <[email protected]>

* Update docs/source/en/quantization/fp_quant.md

* Update docs/source/en/quantization/fp_quant.md

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Marc Sun <[email protected]>

* small test fixes

* dockerfile update

* spec link

* removed `_process_model_after_weight_loading`

* toctree

---------

Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Mohamed Mekkouri <[email protected]>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* quartet

* quartet qat -> quartet

* format

* bf16 backward

* interfaces

* forward_method

* quartet -> fp_quant

* style

* List -> list

* list typing

* fixed format and annotations

* test_fp_quant

* docstrings and default dtypes

* better docstring and removed noop checks

* docs

* pseudoquantization support to test on non-blackwell

* pseudoquant

* Pseudoquant docs

* Update docs/source/en/quantization/fp_quant.md

Co-authored-by: Marc Sun <[email protected]>

* Update docs/source/en/quantization/fp_quant.md

* Update docs/source/en/quantization/fp_quant.md

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Marc Sun <[email protected]>

* small test fixes

* dockerfile update

* spec link

* removed `_process_model_after_weight_loading`

* toctree

---------

Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Mohamed Mekkouri <[email protected]>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* quartet

* quartet qat -> quartet

* format

* bf16 backward

* interfaces

* forward_method

* quartet -> fp_quant

* style

* List -> list

* list typing

* fixed format and annotations

* test_fp_quant

* docstrings and default dtypes

* better docstring and removed noop checks

* docs

* pseudoquantization support to test on non-blackwell

* pseudoquant

* Pseudoquant docs

* Update docs/source/en/quantization/fp_quant.md

Co-authored-by: Marc Sun <[email protected]>

* Update docs/source/en/quantization/fp_quant.md

* Update docs/source/en/quantization/fp_quant.md

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Marc Sun <[email protected]>

* small test fixes

* dockerfile update

* spec link

* removed `_process_model_after_weight_loading`

* toctree

---------

Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Mohamed Mekkouri <[email protected]>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* quartet

* quartet qat -> quartet

* format

* bf16 backward

* interfaces

* forward_method

* quartet -> fp_quant

* style

* List -> list

* list typing

* fixed format and annotations

* test_fp_quant

* docstrings and default dtypes

* better docstring and removed noop checks

* docs

* pseudoquantization support to test on non-blackwell

* pseudoquant

* Pseudoquant docs

* Update docs/source/en/quantization/fp_quant.md

Co-authored-by: Marc Sun <[email protected]>

* Update docs/source/en/quantization/fp_quant.md

* Update docs/source/en/quantization/fp_quant.md

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Marc Sun <[email protected]>

* small test fixes

* dockerfile update

* spec link

* removed `_process_model_after_weight_loading`

* toctree

---------

Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Mohamed Mekkouri <[email protected]>
zaristei pushed a commit to zaristei/transformers that referenced this pull request Sep 9, 2025
* quartet

* quartet qat -> quartet

* format

* bf16 backward

* interfaces

* forward_method

* quartet -> fp_quant

* style

* List -> list

* list typing

* fixed format and annotations

* test_fp_quant

* docstrings and default dtypes

* better docstring and removed noop checks

* docs

* pseudoquantization support to test on non-blackwell

* pseudoquant

* Pseudoquant docs

* Update docs/source/en/quantization/fp_quant.md

Co-authored-by: Marc Sun <[email protected]>

* Update docs/source/en/quantization/fp_quant.md

* Update docs/source/en/quantization/fp_quant.md

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Mohamed Mekkouri <[email protected]>

* Update tests/quantization/fp_quant_integration/test_fp_quant.py

Co-authored-by: Marc Sun <[email protected]>

* small test fixes

* dockerfile update

* spec link

* removed `_process_model_after_weight_loading`

* toctree

---------

Co-authored-by: Marc Sun <[email protected]>
Co-authored-by: Mohamed Mekkouri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants