训练中途突然报错 NCCL watchdog thread terminated with exception

**Describe the bug**
使用swift sft 命令微调MiniCPM-v-2.6模型时，训练到中途突然报错：
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1250, OpType=ALLREDUCE, NumelIn=20280320, NumelOut=20280320, Timeout(ms)=1800000) ran for 1800782 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
<img width="1013" alt="image" src="https://github.com/user-attachments/assets/242e070a-37d8-407d-9327-29917c1b73c9">
该报错的意思是，一直在等某张GPU的数据计算完成然后all_reduce，但是卡在了某张GPU上（该GPU上数据没有完成计算），最终报错 time out。但是如果是数据有问题，在读取阶段应该能直接跳过有问题数据，这种在GPU上卡住算不出来的问题如何解决呢？
我的运行命令：
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft \
  --model_type minicpm-v-v2_6-chat \
  --model_id_or_path ../checkpoint/openbmb/MiniCPM-V-2_6 \
  --sft_type lora \
  --dataset xxx.json \
  --save_steps 50 \
  --val_dataset xxx.json \
  --deepspeed default-zero2

torch版本：2.1.2+cu118
训练中途：
<img width="1019" alt="image" src="https://github.com/user-attachments/assets/5f2b356e-8318-4b0f-a85d-b99d0900aa80">







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

训练中途突然报错 NCCL watchdog thread terminated with exception #1817

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

训练中途突然报错 NCCL watchdog thread terminated with exception #1817

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions