Skip to content

Conversation

pggPL
Copy link
Collaborator

@pggPL pggPL commented Aug 25, 2025

Description

The issue with negative percentage of underflow was observed. The reason is that we count only 0 in fp8_tensor._data tensor, but we need to take into account also -0, represented by 128 (10000000 in binary, only sign bit is 1).

This PR also fixes issue with computation of percentage of underflows on one device - I forgot to add - (x == 0).sum().

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

pggPL added 2 commits August 25, 2025 08:11
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
@pggPL
Copy link
Collaborator Author

pggPL commented Aug 25, 2025

/te-ci pytorch

Signed-off-by: Pawel Gadzinski <[email protected]>
@pggPL
Copy link
Collaborator Author

pggPL commented Aug 25, 2025

/te-ci pytorch

@ptrendx ptrendx requested a review from Copilot August 25, 2025 18:44
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue with negative underflow percentage calculations in PyTorch FP8 debugging functionality. The problem was that the code only counted 0 values but missed -0 values (represented as 128 in FP8 format). Additionally, it corrects a missing subtraction in the underflow percentage computation.

Key changes:

  • Updated underflow detection to include both 0 and -0 values using torch.isin with [0, 128]
  • Fixed missing - (x == 0).sum() in percentage calculation
  • Updated tests to use random tensors and adjusted tolerance values

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
transformer_engine/debug/features/utils/stats_computation.py Updated underflow detection logic to handle both positive and negative zero values
tests/pytorch/debug/test_log.py Modified test to use random tensors and updated MSE tolerance
tests/pytorch/debug/test_api_features.py Fixed test to use dequantized tensor for underflow calculation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pggPL
Copy link
Collaborator Author

pggPL commented Sep 2, 2025

/te-ci pytorch

@pggPL pggPL merged commit 405d474 into NVIDIA:main Sep 15, 2025
21 of 23 checks passed
vthumbe1503 pushed a commit to vthumbe1503/TransformerEngine that referenced this pull request Sep 19, 2025
for more information, see https://pre-commit.ci

[PyTorch Debug] Fix issue with negative underflow% stat. (NVIDIA#2107)

* fix underflows log issue

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
phu0ngng pushed a commit to phu0ngng/TransformerEngine that referenced this pull request Sep 22, 2025
* fix underflows log issue

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants