Skip to content

训练中途突然报错 NCCL watchdog thread terminated with exception #1817

@Wuyingwen

Description

@Wuyingwen

Describe the bug
使用swift sft 命令微调MiniCPM-v-2.6模型时,训练到中途突然报错:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1250, OpType=ALLREDUCE, NumelIn=20280320, NumelOut=20280320, Timeout(ms)=1800000) ran for 1800782 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
image
该报错的意思是,一直在等某张GPU的数据计算完成然后all_reduce,但是卡在了某张GPU上(该GPU上数据没有完成计算),最终报错 time out。但是如果是数据有问题,在读取阶段应该能直接跳过有问题数据,这种在GPU上卡住算不出来的问题如何解决呢?
我的运行命令:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft
--model_type minicpm-v-v2_6-chat
--model_id_or_path ../checkpoint/openbmb/MiniCPM-V-2_6
--sft_type lora
--dataset xxx.json
--save_steps 50
--val_dataset xxx.json
--deepspeed default-zero2

torch版本:2.1.2+cu118
训练中途:
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions