Skip to content

[REQUEST ]Support using nvidia-dlfw-inspect when the fp8_primary_weight option is enabled. #2140

@cailun01

Description

@cailun01

Hello, Transformer Engine team!
During MXFP8 training, I encountered an abnormal gradient norm issue. I would like to use the nvidia-dlfw-inspect for troubleshooting, but I ran into an error.

RuntimeError: FP8 weights are not supported in debug mode.

Given that fp8_primary_weight is a commonly used option, could you please enable compatibility between fp8_primary_weight and nvidia-dlfw-inspect so they can be used together?

The following code snippet raised Error:
https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/module/base.py#L1533

fp8 args:

  --fp8-param-gather
  --fp8-recipe mxfp8
  --fp8-format e4m3
  --reuse-grad-buf-for-mxfp8-param-ag

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions