Fix mxfp8 columnwise data missing #1593

guyueh1 · 2025-03-19T23:31:30Z

…raining

Description

When we use mxfp8 to first run a few validation steps and then run a training step, the weight columnwise data in the row-wise linear layer is missing. This PR fixes it.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…raining Signed-off-by: Guyue Huang <[email protected]>

ksivaman

LGTM

ksivaman · 2025-03-19T23:34:03Z

/te-ci pytorch

Signed-off-by: Guyue Huang <[email protected]>

for more information, see https://pre-commit.ci

guyueh1 · 2025-03-20T00:15:42Z

The previous change didn't cover all buggy cases. I had to change it like now.
Basically, the fp8_workspaces will be updated if it found that for the same name, the tensor usage is different from what the quantizer specifies.
I think this needs more careful review, cc @ksivaman @ptrendx

Signed-off-by: Guyue Huang <[email protected]>

…1/TransformerEngine into fix_mxfp8_columnwise_data_missing Signed-off-by: Guyue Huang <[email protected]>

pggPL · 2025-03-20T13:54:55Z

The problem is when we change quantizer use/recipe during training. For example in your case - validation does not need gradient, but training needs. We plan to handle switching the recipe, but it is not done yet.

But I'm quite confused how you get this error. In first training step you should run with is_first_microbatch set to True or None and it should result with updating the cache with correct tensor.

guyueh1 · 2025-03-20T17:20:31Z

The problem is when we change quantizer use/recipe during training. For example in your case - validation does not need gradient, but training needs. We plan to handle switching the recipe, but it is not done yet.

But I'm quite confused how you get this error. In first training step you should run with is_first_microbatch set to True or None and it should result with updating the cache with correct tensor.

To give you an example that illustrate the bug:

We first run 1 microbatches of validation, then we run 1 microbatches of training,

1st validation iteration: is_first_microbatch = True, update_workspace = True, cache_name = "weight", quantizer.rowwise_usage=True, quantizer.columnwise_usage=False
1st training iteration: is_first_microbatch = True, update_workspace = True, cache_name = "weight", quantizer.rowwise_usage=True, quantizer.columnwise_usage=True
After 1, self._fp8_workspace['weight'] will save a quantized mxfp8 tensor with no columnwise data. Then after 2, theoretically we should update the workspace object in self_fp8_workspace['weight'] in-place in line 1000, here we should update out to have both rowwise and columnwise data, but in practice this didn't happen, after tex.quantize(...) the out still don't have columnwise data, thus raised the error.

ptrendx · 2025-03-20T17:45:10Z

Right, the problem in that line 1000 is that it gives the out parameter to the tex.quantize function, which then just assumes that it is the right output and does not check what the quantizer said. There are 2 possibilities here, the tex.quantize should realize that the provided output is not actually correct and:

either add the missing fields
or error out
The second option would require the caller to provide the right output (probably by creating a new tensor), which would incur some slight overhead. The first option would require us to actually go over the functions (since tex.quantize is not the only one that outputs quantized tensors) and modify them all to be able to properly handle this. I'm inclined to go with the first option for the perf reasons (and will implement it), but wanted to hear your opinions as well @guyueh1 @ksivaman @timmoon10 @pggPL.

guyueh1 · 2025-03-20T18:47:07Z

For option 1, you still update the tex.quantize(..) function, add/remove rowwise or columnwise data fields from it according to quantizer, is that right?

I agree it's the right thing to do.

My PR is an alternative, whenever it finds the saved tensor in _fp8_workspace isn't matching the quantizer (we can expand the condition of 'not-matching' of course, now it only checks the row/col wise usage), it deletes the saved tensor and the consequent code will create a new one and save it. I think the overhead is the same as if you modify text.quantize(...).

ptrendx · 2025-03-20T19:26:48Z

Right, your PR would have similar overheads to option 2 since that way you need to allocate the full tensor again. I'm fine with merging it as is and I will follow up with option 1 that just allocates the missing pieces.

guyueh1 · 2025-03-21T15:59:43Z

can we merge this? @ksivaman

ksivaman · 2025-03-21T16:36:29Z

I agree that functions like tex.quantize that produce fp8 output should give precedence to the quantizer and not what exists in the provided output tensor, so that's better for the longer term. For now will merge is CI is clean

ksivaman · 2025-03-21T16:36:53Z

/te-ci pytorch

transformer_engine/pytorch/module/base.py

Co-authored-by: Tim Moon <[email protected]> Signed-off-by: guyueh1 <[email protected]>

for more information, see https://pre-commit.ci

transformer_engine/pytorch/module/base.py

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2025-03-24T19:06:02Z

/te-ci pytorch

guyueh1 · 2025-03-25T16:45:22Z

L0 test seems to be failing for irrelevant things (paged attention), how should we fix it? @timmoon10

* Fix mxfp8 columnwise data missing when switching from validation to training Signed-off-by: Guyue Huang <[email protected]> * Fix when you interleave training and inference Signed-off-by: Guyue Huang <[email protected]> * refact Signed-off-by: Guyue Huang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rm useless code Signed-off-by: Guyue Huang <[email protected]> * Update transformer_engine/pytorch/module/base.py Co-authored-by: Tim Moon <[email protected]> Signed-off-by: guyueh1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Guyue Huang <[email protected]> Signed-off-by: guyueh1 <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Guyue Huang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Tim Moon <[email protected]>

Fix mxfp8 columnwise data missing when switching from validation to t…

84decdf

…raining Signed-off-by: Guyue Huang <[email protected]>

ksivaman approved these changes Mar 19, 2025

View reviewed changes

Guyue Huang and others added 3 commits March 19, 2025 17:05

Fix when you interleave training and inference

9b92894

Signed-off-by: Guyue Huang <[email protected]>

refact

84fd5bd

Signed-off-by: Guyue Huang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

53e6624

for more information, see https://pre-commit.ci

Guyue Huang added 2 commits March 19, 2025 17:16

rm useless code

2fd8c7d

Signed-off-by: Guyue Huang <[email protected]>

Merge branch 'fix_mxfp8_columnwise_data_missing' of github.com:guyueh…

3cbb861

…1/TransformerEngine into fix_mxfp8_columnwise_data_missing Signed-off-by: Guyue Huang <[email protected]>

Merge branch 'main' into fix_mxfp8_columnwise_data_missing

72ddd03

timmoon10 reviewed Mar 22, 2025

View reviewed changes

transformer_engine/pytorch/module/base.py Outdated Show resolved Hide resolved

guyueh1 and others added 3 commits March 21, 2025 19:47

Update transformer_engine/pytorch/module/base.py

4229787

Co-authored-by: Tim Moon <[email protected]> Signed-off-by: guyueh1 <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

895c3e6

for more information, see https://pre-commit.ci

Merge branch 'main' into fix_mxfp8_columnwise_data_missing

d592a39

timmoon10 reviewed Mar 24, 2025

View reviewed changes

transformer_engine/pytorch/module/base.py Outdated Show resolved Hide resolved

Fix linter warnings

1e8e1ae

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into fix_mxfp8_columnwise_data_missing

3cc787f

timmoon10 merged commit abbdd76 into NVIDIA:main Mar 25, 2025
10 of 11 checks passed

ptrendx mentioned this pull request May 31, 2025

Make quantize_ respect the usages of the quantizer #1836

Closed

13 tasks

timmoon10 mentioned this pull request Jul 19, 2025

[PyTorch] Reset FP8 weight workspace if usages are invalid #1972

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix mxfp8 columnwise data missing #1593

Fix mxfp8 columnwise data missing #1593

Uh oh!

guyueh1 commented Mar 19, 2025

Uh oh!

ksivaman left a comment

Uh oh!

ksivaman commented Mar 19, 2025

Uh oh!

guyueh1 commented Mar 20, 2025

Uh oh!

pggPL commented Mar 20, 2025

Uh oh!

guyueh1 commented Mar 20, 2025

Uh oh!

ptrendx commented Mar 20, 2025

Uh oh!

guyueh1 commented Mar 20, 2025

Uh oh!

ptrendx commented Mar 20, 2025

Uh oh!

guyueh1 commented Mar 21, 2025

Uh oh!

ksivaman commented Mar 21, 2025

Uh oh!

ksivaman commented Mar 21, 2025

Uh oh!

Uh oh!

Uh oh!

timmoon10 commented Mar 24, 2025

Uh oh!

guyueh1 commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

Fix mxfp8 columnwise data missing #1593

Fix mxfp8 columnwise data missing #1593

Uh oh!

Conversation

guyueh1 commented Mar 19, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

ksivaman left a comment

Choose a reason for hiding this comment

Uh oh!

ksivaman commented Mar 19, 2025

Uh oh!

guyueh1 commented Mar 20, 2025

Uh oh!

pggPL commented Mar 20, 2025

Uh oh!

guyueh1 commented Mar 20, 2025

Uh oh!

ptrendx commented Mar 20, 2025

Uh oh!

guyueh1 commented Mar 20, 2025

Uh oh!

ptrendx commented Mar 20, 2025

Uh oh!

guyueh1 commented Mar 21, 2025

Uh oh!

ksivaman commented Mar 21, 2025

Uh oh!

ksivaman commented Mar 21, 2025

Uh oh!

Uh oh!

Uh oh!

timmoon10 commented Mar 24, 2025

Uh oh!

guyueh1 commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!