-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Unable to reproduce classification accuracy using the reference scripts #4238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Update: I was hypothesizing that the issue was caused by recent changes in the code, so I tried reverting back to version 115d2eb but then obtained again 73.3% for ResNet-50. Std logs: resnet50_115d2eb.log. @fmassa @datumbox Would it be possible for you to check this please? |
Hi @netw0rkf10w Sorry for the delay in replying. IIRC the accuracies for ResNet-50 were obtained after training the model and recomputing the batch norm statistics. For ResNeXt, we just report the numbers after training. FYI, here are the training logs for resnext101_32x8d that we provide in torchvision https://gist.github.com/fmassa/4ce4a8146dbbdbf6e1f9a3e0ec49e3d8 (we report results for checkpoint at epoch 96) @datumbox once you are back, can you try kicking the runs for ResNet-50 and ResNeXt to double-check? |
Thanks for your reply, @fmassa! I can already spot something from the logs: you used |
@netw0rkf10w I've used |
@fmassa Thanks, but that doesn't seem to be consistent with the documentation though:
By the way, if I read your logs correctly, you actually trained the model on 64 GPUs, but with a batch size of 16 (instead of 32 as in the documentation), which means the effective batch size was 64*16 = 1024. In the ResNeXt paper, they used
Thus for a batch size of 1024 one would need to scale the learning rate to (1024/256)*0.1 = 0.4, which probably explains the value that you used. For |
Yeah sorry, I might have mixed a few things up in my explanation. I still think we should improve the documentation, but as of now there is no easy scripts in torchvision to launch jobs on multiple nodes, which makes things a bit harder |
I was able to reproduce the results:
Here is my output log: Comparing the logs shared above with mine, I see workers and world size being different. What did you pass for This is the command I ran:
|
@prabhat00155 you did better than that. You can get higher accuracy by keeping the checkpoint at epoch 67: |
Yeah, that's right! |
Thanks for the results. This confirms that 0.1 is the correct learning rate for 8 GPUs (with batch size 32). If we train on 64 GPUs (as stated in the documentation) without scaling the learning rate, then we would obtain similar results as I posted above. I will soon create a PR to clarify the documentation. |
I think something like this would already make it clearer for the users: #4390. |
@datumbox @fmassa @NicolasHug Congratulations and thank you for your awesome work on the TorchVision with Batteries Included project! Would it be possible for you to share the training logs for the new weights? I would like to reproduce the results using the exact same configurations as yours. Thanks a lot! |
@netw0rkf10w Glad you find it useful! :) You are right at the moment we don't provide the logs. For full transparency here are some reasons:
Here are things we did in Q4 to address some of the above:
Using the above you should have links to the exact commands we used to train the models for every model refreshed using Batteries Included (see here). If we missed one, that's a bug and we should fix it. Here are some things we plan to do in Q1 to make the situation even better:
We certainly haven't figured out everything yet, so bear with us while we try to do so. Feedback is very welcome. :) |
🐛 Bug
I have been trying to reproduce the reported 79.312% accuracy on ImageNet of
resnext101_32x8d
using the reference scripts, but I could obtain only 75%-76%. I try two different trainings on 64 GPUs:but obtained similar results.
To Reproduce
Clone the master branch of
torchvision
, thencd vision/references/classification
and submit a training to 64 GPUs with arguments--model resnext101_32x8d --epochs 100
.The training logs (including std logs) are attached for your information: log.txt and
resnext101_32x8d_reproduced.log
Expected behavior
Final top-1 accuracy should be around 79%.
Environment
conda
,pip
, source):pip
cc @vfdev-5
The text was updated successfully, but these errors were encountered: