Skip to content

Unable to reproduce classification accuracy using the reference scripts #4238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
netw0rkf10w opened this issue Aug 1, 2021 · 16 comments
Closed

Comments

@netw0rkf10w
Copy link
Contributor

netw0rkf10w commented Aug 1, 2021

🐛 Bug

I have been trying to reproduce the reported 79.312% accuracy on ImageNet of resnext101_32x8d using the reference scripts, but I could obtain only 75%-76%. I try two different trainings on 64 GPUs:

  • 16 nodes of 4 V100 GPUs
  • 8 nodes of 8 V100 GPUs

but obtained similar results.

To Reproduce

Clone the master branch of torchvision, then cd vision/references/classification and submit a training to 64 GPUs with arguments --model resnext101_32x8d --epochs 100.

The training logs (including std logs) are attached for your information: log.txt and
resnext101_32x8d_reproduced.log

Expected behavior

Final top-1 accuracy should be around 79%.

Environment

  • PyTorch / torchvision Version (e.g., 1.0 / 0.4.0): 1.8.1
  • OS (e.g., Linux): Linux
  • How you installed PyTorch / torchvision (conda, pip, source): pip
  • Python version: 3.8
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration: V100

cc @vfdev-5

@netw0rkf10w
Copy link
Contributor Author

Update: I tried ResNet-50 and observed the same issue, 73.3% final accuracy instead of 76.130%. Log file FYI: log.txt.

@netw0rkf10w
Copy link
Contributor Author

Update: I was hypothesizing that the issue was caused by recent changes in the code, so I tried reverting back to version 115d2eb but then obtained again 73.3% for ResNet-50. Std logs: resnet50_115d2eb.log.

@fmassa @datumbox Would it be possible for you to check this please?

@fmassa
Copy link
Member

fmassa commented Aug 12, 2021

Hi @netw0rkf10w

Sorry for the delay in replying.

IIRC the accuracies for ResNet-50 were obtained after training the model and recomputing the batch norm statistics.

For ResNeXt, we just report the numbers after training. FYI, here are the training logs for resnext101_32x8d that we provide in torchvision https://gist.github.com/fmassa/4ce4a8146dbbdbf6e1f9a3e0ec49e3d8 (we report results for checkpoint at epoch 96)

@datumbox once you are back, can you try kicking the runs for ResNet-50 and ResNeXt to double-check?

@netw0rkf10w
Copy link
Contributor Author

Thanks for your reply, @fmassa!

I can already spot something from the logs: you used lr=0.4 while in the documentation (and also in the original paper) it is recommended to use lr=0.1. Maybe this is what makes the difference in the final results. Do you remember the reason you chose lr=0.4 for training?

@fmassa
Copy link
Member

fmassa commented Aug 17, 2021

@netw0rkf10w I've used lr=0.4 because I trained ResNeXt on 4 nodes (for a total of 32 GPUs), so I multiplied the LR by 4 as the batch size was multiplied by 4.
If you see the command-line used in the description is only for 1 node, so we used the lr=0.1 for it to be compatible.

@netw0rkf10w
Copy link
Contributor Author

@fmassa Thanks, but that doesn't seem to be consistent with the documentation though:

ResNext-101 32x8d

On 8 nodes, each with 8 GPUs (for a total of 64 GPUS)

python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
    --model resnext101_32x8d --epochs 100

By the way, if I read your logs correctly, you actually trained the model on 64 GPUs, but with a batch size of 16 (instead of 32 as in the documentation), which means the effective batch size was 64*16 = 1024.

In the ResNeXt paper, they used lr=0.1 for an effective batch size of 256:

We use SGD with a mini-batch size of 256 on 8 GPUs (32 per GPU). The weight decay is 0.0001 and the momentum is 0.9. We start from a learning rate of 0.1, and divide it by 10 for three times using the schedule in [11].

Thus for a batch size of 1024 one would need to scale the learning rate to (1024/256)*0.1 = 0.4, which probably explains the value that you used. For 8 nodes, each with 8 GPUs (for a total of 64 GPUS) as stated in the documentation, it would have to be lr=0.8. I'll try this value to see if we can obtain 79% (but in any case I think the documentation would need to be updated).

@fmassa
Copy link
Member

fmassa commented Aug 18, 2021

Yeah sorry, I might have mixed a few things up in my explanation. I still think we should improve the documentation, but as of now there is no easy scripts in torchvision to launch jobs on multiple nodes, which makes things a bit harder

@prabhat00155
Copy link
Contributor

prabhat00155 commented Sep 10, 2021

I was able to reproduce the results:

Acc@1 79.314 Acc@5 94.566 

Here is my output log:
resnext101_32x8d_logs.txt

Comparing the logs shared above with mine, I see workers and world size being different.
workers=16, world_size=8 vs workers=10, world_size=64

What did you pass for --gpus-per-node? At the top of the log file, it says 4 GPUs per node. I guess the reported results are with gpus-per-node=8.

This is the command I ran:

srun -p train --cpus-per-task=16 -t 110:00:00 --gpus-per-node=8 python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --model resnext101_32x8d --epochs 100 --output-dir logs/run2 > logs/run2/resnext101_32x8d_logs.txt 2>&1

@datumbox
Copy link
Contributor

@prabhat00155 you did better than that. You can get higher accuracy by keeping the checkpoint at epoch 67:
* Acc@1 79.462 Acc@5 94.614

@prabhat00155
Copy link
Contributor

@prabhat00155 you did better than that. You can get higher accuracy by keeping the checkpoint at epoch 67:
* Acc@1 79.462 Acc@5 94.614

Yeah, that's right!

@netw0rkf10w
Copy link
Contributor Author

I was able to reproduce the results:

Acc@1 79.314 Acc@5 94.566 

Here is my output log:
resnext101_32x8d_logs.txt

Comparing the logs shared above with mine, I see workers and world size being different.
workers=16, world_size=8 vs workers=10, world_size=64

What did you pass for --gpus-per-node? At the top of the log file, it says 4 GPUs per node. I guess the reported results are with gpus-per-node=8.

This is the command I ran:

srun -p train --cpus-per-task=16 -t 110:00:00 --gpus-per-node=8 python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --model resnext101_32x8d --epochs 100 --output-dir logs/run2 > logs/run2/resnext101_32x8d_logs.txt 2>&1

Thanks for the results. This confirms that 0.1 is the correct learning rate for 8 GPUs (with batch size 32). If we train on 64 GPUs (as stated in the documentation) without scaling the learning rate, then we would obtain similar results as I posted above. I will soon create a PR to clarify the documentation.

@netw0rkf10w
Copy link
Contributor Author

I still think we should improve the documentation, but as of now there is no easy scripts in torchvision to launch jobs on multiple nodes, which makes things a bit harder

I think something like this would already make it clearer for the users: #4390.

@NicolasHug
Copy link
Member

I'll close this one since it was resolved in #4390

Resovles #4238.

@netw0rkf10w
Copy link
Contributor Author

@datumbox @fmassa @NicolasHug Congratulations and thank you for your awesome work on the TorchVision with Batteries Included project!

Would it be possible for you to share the training logs for the new weights? I would like to reproduce the results using the exact same configurations as yours. Thanks a lot!

@datumbox
Copy link
Contributor

@netw0rkf10w Glad you find it useful! :)

You are right at the moment we don't provide the logs. For full transparency here are some reasons:

  1. We currently don't have a good place to put them.
  2. We don't have the logs for very old models (like AlexNet) and some of those that were trained by the community (like Shufflenet v2).
  3. The raw logs can be noisy, they might be split in multiple files due to job restarts (caused by hardware issues during training), lack features that made it on the final released models (such as optimizations applied on inference for boosting speed or accuracy, checkpoint averaging etc). They might also contain sensitive info (such as usernames) that we might not want to make public.

Here are things we did in Q4 to address some of the above:

  1. We retrieved and centralized information on how past models were trained.
  2. We made this info available on the meta-data of all model weights. This is available on the new prototype model API: example.

Using the above you should have links to the exact commands we used to train the models for every model refreshed using Batteries Included (see here). If we missed one, that's a bug and we should fix it.

Here are some things we plan to do in Q1 to make the situation even better:

  1. Improve our documentation for the model Zoo and the recipes.
  2. Restructure and make public the info we have for all pretrained models.
  3. Update our model contribution guidelines to ensure we can accept pre-trained models from the community.

We certainly haven't figured out everything yet, so bear with us while we try to do so. Feedback is very welcome. :)

@netw0rkf10w
Copy link
Contributor Author

@datumbox Thank you so much for the detailed response and for your transparency! In the issue that you mentioned, there appear to have enough information to reproduce the results (may except one detail, let me post a question there).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants