Skip to content

Conversation

HuiGao-NV
Copy link
Collaborator

@HuiGao-NV HuiGao-NV commented Jun 13, 2025

Add debug hook to support dump tensor data and add new debug functions easily.

To enable to dump tensors' data,
from tensorrt_llm._torch.debug.debug_hook import debugger_addon, register_tensor_dump_hook
with debugger_addon(model, DATA_FOLDER):
register_tensor_dump_hook()
model.forward()

The dumped data are put under DATA_FOLDER/rank[ID]/....
The data file name is in the pattern:
[LOOP_COUNT].[model_name]-[OPIDX_IN_MODEL].[OPNAME]-[OPIDX_IN_PRE_OP].[OPNAME]-[input|output].[PARA_NAME].pt
suc as 1.LlamaModel-24.LlamaDecoderLayer-2.LlamaAttention-2.Linear-1.AllReduce-input.input.pt.

@HuiGao-NV HuiGao-NV requested review from hlu1 and QiJune June 13, 2025 03:35
@HuiGao-NV HuiGao-NV requested a review from a team as a code owner June 13, 2025 03:35
@HuiGao-NV HuiGao-NV force-pushed the debug_hook branch 2 times, most recently from 8af04fa to 8dbc872 Compare June 13, 2025 05:44
@HuiGao-NV HuiGao-NV requested a review from juney-nvidia June 14, 2025 12:23
@HuiGao-NV HuiGao-NV force-pushed the debug_hook branch 3 times, most recently from c6f2d38 to 0c533ab Compare June 19, 2025 08:36
@HuiGao-NV HuiGao-NV requested a review from QiJune June 20, 2025 00:41
HuiGao-NV added 11 commits June 24, 2025 09:26
Add context manager method to enable debugger

Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
@HuiGao-NV
Copy link
Collaborator Author

/bot skip --comment="New code and has no impact to existing code"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9678 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9678 [ skip ] completed with state SUCCESS
Skipping testing for commit 2780ca3

@HuiGao-NV HuiGao-NV merged commit 35a92f6 into NVIDIA:main Jun 24, 2025
3 checks passed
@HuiGao-NV HuiGao-NV deleted the debug_hook branch June 24, 2025 09:45
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 9, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants