-
Notifications
You must be signed in to change notification settings - Fork 897
Description
Describe the bug
使用swift sft 命令微调MiniCPM-v-2.6模型时,训练到中途突然报错:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1250, OpType=ALLREDUCE, NumelIn=20280320, NumelOut=20280320, Timeout(ms)=1800000) ran for 1800782 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
该报错的意思是,一直在等某张GPU的数据计算完成然后all_reduce,但是卡在了某张GPU上(该GPU上数据没有完成计算),最终报错 time out。但是如果是数据有问题,在读取阶段应该能直接跳过有问题数据,这种在GPU上卡住算不出来的问题如何解决呢?
我的运行命令:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft
--model_type minicpm-v-v2_6-chat
--model_id_or_path ../checkpoint/openbmb/MiniCPM-V-2_6
--sft_type lora
--dataset xxx.json
--save_steps 50
--val_dataset xxx.json
--deepspeed default-zero2