Unable to reproduce classification accuracy using the reference scripts #4238

netw0rkf10w · 2021-08-01T12:23:59Z

🐛 Bug

I have been trying to reproduce the reported 79.312% accuracy on ImageNet of resnext101_32x8d using the reference scripts, but I could obtain only 75%-76%. I try two different trainings on 64 GPUs:

16 nodes of 4 V100 GPUs
8 nodes of 8 V100 GPUs

but obtained similar results.

To Reproduce

Clone the master branch of torchvision, then cd vision/references/classification and submit a training to 64 GPUs with arguments --model resnext101_32x8d --epochs 100.

The training logs (including std logs) are attached for your information: log.txt and
resnext101_32x8d_reproduced.log

Expected behavior

Final top-1 accuracy should be around 79%.

Environment

PyTorch / torchvision Version (e.g., 1.0 / 0.4.0): 1.8.1
OS (e.g., Linux): Linux
How you installed PyTorch / torchvision (conda, pip, source): pip
Python version: 3.8
CUDA/cuDNN version: 10.2
GPU models and configuration: V100

cc @vfdev-5

The text was updated successfully, but these errors were encountered:

netw0rkf10w · 2021-08-02T00:10:38Z

Update: I tried ResNet-50 and observed the same issue, 73.3% final accuracy instead of 76.130%. Log file FYI: log.txt.

netw0rkf10w · 2021-08-02T08:42:27Z

Update: I was hypothesizing that the issue was caused by recent changes in the code, so I tried reverting back to version 115d2eb but then obtained again 73.3% for ResNet-50. Std logs: resnet50_115d2eb.log.

@fmassa @datumbox Would it be possible for you to check this please?

fmassa · 2021-08-12T12:28:08Z

Hi @netw0rkf10w

Sorry for the delay in replying.

IIRC the accuracies for ResNet-50 were obtained after training the model and recomputing the batch norm statistics.

For ResNeXt, we just report the numbers after training. FYI, here are the training logs for resnext101_32x8d that we provide in torchvision https://gist.github.com/fmassa/4ce4a8146dbbdbf6e1f9a3e0ec49e3d8 (we report results for checkpoint at epoch 96)

@datumbox once you are back, can you try kicking the runs for ResNet-50 and ResNeXt to double-check?

netw0rkf10w · 2021-08-16T16:28:55Z

Thanks for your reply, @fmassa!

I can already spot something from the logs: you used lr=0.4 while in the documentation (and also in the original paper) it is recommended to use lr=0.1. Maybe this is what makes the difference in the final results. Do you remember the reason you chose lr=0.4 for training?

fmassa · 2021-08-17T11:14:17Z

@netw0rkf10w I've used lr=0.4 because I trained ResNeXt on 4 nodes (for a total of 32 GPUs), so I multiplied the LR by 4 as the batch size was multiplied by 4.
If you see the command-line used in the description is only for 1 node, so we used the lr=0.1 for it to be compatible.

netw0rkf10w · 2021-08-17T16:43:00Z

@fmassa Thanks, but that doesn't seem to be consistent with the documentation though:

ResNext-101 32x8d

On 8 nodes, each with 8 GPUs (for a total of 64 GPUS)
python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
    --model resnext101_32x8d --epochs 100

By the way, if I read your logs correctly, you actually trained the model on 64 GPUs, but with a batch size of 16 (instead of 32 as in the documentation), which means the effective batch size was 64*16 = 1024.

In the ResNeXt paper, they used lr=0.1 for an effective batch size of 256:

We use SGD with a mini-batch size of 256 on 8 GPUs (32 per GPU). The weight decay is 0.0001 and the momentum is 0.9. We start from a learning rate of 0.1, and divide it by 10 for three times using the schedule in [11].

Thus for a batch size of 1024 one would need to scale the learning rate to (1024/256)*0.1 = 0.4, which probably explains the value that you used. For 8 nodes, each with 8 GPUs (for a total of 64 GPUS) as stated in the documentation, it would have to be lr=0.8. I'll try this value to see if we can obtain 79% (but in any case I think the documentation would need to be updated).

fmassa · 2021-08-18T12:20:15Z

Yeah sorry, I might have mixed a few things up in my explanation. I still think we should improve the documentation, but as of now there is no easy scripts in torchvision to launch jobs on multiple nodes, which makes things a bit harder

prabhat00155 · 2021-09-10T10:49:38Z

I was able to reproduce the results:

Acc@1 79.314 Acc@5 94.566

Here is my output log:
resnext101_32x8d_logs.txt

Comparing the logs shared above with mine, I see workers and world size being different.
workers=16, world_size=8 vs workers=10, world_size=64

What did you pass for --gpus-per-node? At the top of the log file, it says 4 GPUs per node. I guess the reported results are with gpus-per-node=8.

This is the command I ran:

srun -p train --cpus-per-task=16 -t 110:00:00 --gpus-per-node=8 python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --model resnext101_32x8d --epochs 100 --output-dir logs/run2 > logs/run2/resnext101_32x8d_logs.txt 2>&1

datumbox · 2021-09-10T10:56:53Z

@prabhat00155 you did better than that. You can get higher accuracy by keeping the checkpoint at epoch 67:
* Acc@1 79.462 Acc@5 94.614

prabhat00155 · 2021-09-10T11:00:29Z

@prabhat00155 you did better than that. You can get higher accuracy by keeping the checkpoint at epoch 67:
* Acc@1 79.462 Acc@5 94.614

Yeah, that's right!

netw0rkf10w · 2021-09-10T11:26:02Z

I was able to reproduce the results:
Acc@1 79.314 Acc@5 94.566 
Here is my output log:
resnext101_32x8d_logs.txt

Comparing the logs shared above with mine, I see workers and world size being different.
workers=16, world_size=8 vs workers=10, world_size=64

What did you pass for --gpus-per-node? At the top of the log file, it says 4 GPUs per node. I guess the reported results are with gpus-per-node=8.

This is the command I ran:
srun -p train --cpus-per-task=16 -t 110:00:00 --gpus-per-node=8 python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --model resnext101_32x8d --epochs 100 --output-dir logs/run2 > logs/run2/resnext101_32x8d_logs.txt 2>&1

Thanks for the results. This confirms that 0.1 is the correct learning rate for 8 GPUs (with batch size 32). If we train on 64 GPUs (as stated in the documentation) without scaling the learning rate, then we would obtain similar results as I posted above. I will soon create a PR to clarify the documentation.

netw0rkf10w · 2021-09-10T13:06:11Z

I still think we should improve the documentation, but as of now there is no easy scripts in torchvision to launch jobs on multiple nodes, which makes things a bit harder

I think something like this would already make it clearer for the users: #4390.

NicolasHug · 2021-09-10T15:42:56Z

I'll close this one since it was resolved in #4390

Resovles #4238.

netw0rkf10w · 2021-12-21T11:23:10Z

@datumbox @fmassa @NicolasHug Congratulations and thank you for your awesome work on the TorchVision with Batteries Included project!

Would it be possible for you to share the training logs for the new weights? I would like to reproduce the results using the exact same configurations as yours. Thanks a lot!

datumbox · 2021-12-21T11:59:44Z

@netw0rkf10w Glad you find it useful! :)

You are right at the moment we don't provide the logs. For full transparency here are some reasons:

We currently don't have a good place to put them.
We don't have the logs for very old models (like AlexNet) and some of those that were trained by the community (like Shufflenet v2).
The raw logs can be noisy, they might be split in multiple files due to job restarts (caused by hardware issues during training), lack features that made it on the final released models (such as optimizations applied on inference for boosting speed or accuracy, checkpoint averaging etc). They might also contain sensitive info (such as usernames) that we might not want to make public.

Here are things we did in Q4 to address some of the above:

We retrieved and centralized information on how past models were trained.
We made this info available on the meta-data of all model weights. This is available on the new prototype model API: example.

Using the above you should have links to the exact commands we used to train the models for every model refreshed using Batteries Included (see here). If we missed one, that's a bug and we should fix it.

Here are some things we plan to do in Q1 to make the situation even better:

Improve our documentation for the model Zoo and the recipes.
Restructure and make public the info we have for all pretrained models.
Update our model contribution guidelines to ensure we can accept pre-trained models from the community.

We certainly haven't figured out everything yet, so bear with us while we try to do so. Feedback is very welcome. :)

netw0rkf10w · 2021-12-21T12:12:34Z

@datumbox Thank you so much for the detailed response and for your transparency! In the issue that you mentioned, there appear to have enough information to reproduce the results (may except one detail, let me post a question there).

fmassa added module: models needs reproduction needs training topic: classification labels Aug 12, 2021

netw0rkf10w mentioned this issue Sep 10, 2021

Clarification for training resnext101_32x8d on ImageNet #4390

Merged

NicolasHug closed this as completed Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce classification accuracy using the reference scripts #4238

Unable to reproduce classification accuracy using the reference scripts #4238

netw0rkf10w commented Aug 1, 2021 •

edited by pytorch-probot bot

Loading

netw0rkf10w commented Aug 2, 2021

netw0rkf10w commented Aug 2, 2021

fmassa commented Aug 12, 2021

netw0rkf10w commented Aug 16, 2021

fmassa commented Aug 17, 2021

netw0rkf10w commented Aug 17, 2021

ResNext-101 32x8d

fmassa commented Aug 18, 2021

prabhat00155 commented Sep 10, 2021 •

edited

Loading

datumbox commented Sep 10, 2021

prabhat00155 commented Sep 10, 2021

netw0rkf10w commented Sep 10, 2021

netw0rkf10w commented Sep 10, 2021

NicolasHug commented Sep 10, 2021

netw0rkf10w commented Dec 21, 2021

datumbox commented Dec 21, 2021

netw0rkf10w commented Dec 21, 2021

Unable to reproduce classification accuracy using the reference scripts #4238

Unable to reproduce classification accuracy using the reference scripts #4238

Comments

netw0rkf10w commented Aug 1, 2021 • edited by pytorch-probot bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

netw0rkf10w commented Aug 2, 2021

netw0rkf10w commented Aug 2, 2021

fmassa commented Aug 12, 2021

netw0rkf10w commented Aug 16, 2021

fmassa commented Aug 17, 2021

netw0rkf10w commented Aug 17, 2021

ResNext-101 32x8d

fmassa commented Aug 18, 2021

prabhat00155 commented Sep 10, 2021 • edited Loading

datumbox commented Sep 10, 2021

prabhat00155 commented Sep 10, 2021

netw0rkf10w commented Sep 10, 2021

netw0rkf10w commented Sep 10, 2021

NicolasHug commented Sep 10, 2021

netw0rkf10w commented Dec 21, 2021

datumbox commented Dec 21, 2021

netw0rkf10w commented Dec 21, 2021

netw0rkf10w commented Aug 1, 2021 •

edited by pytorch-probot bot

Loading

prabhat00155 commented Sep 10, 2021 •

edited

Loading