-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Investigate if model_without_ddp is needed #4385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
While running a fairly limited training experiment for #4381, it was observed that |
@fmassa Prabhat did investigations where he used the |
Hi, The reason why we have |
@fmassa Thanks for the info. Prabhat found that using vision/references/classification/train.py Line 208 in 12fd3a6
We did end up changing it to |
I ran training on resnet18 for 100 epochs, where I cancelled the training around epoch 34 and resumed it again, with and without
I didn't see any significant difference in the result. |
@prabhat00155 what you would have needed would be to try to load the serialized checkpoint from a model that hasn't been wrapped up in DDP yet. Something as simple as model = torchvision.models.resnet50()
model.load_state_dict(path_to_checkpoint) would fail due to the added I would be ok removing the current |
@fmassa Thanks for providing background on why this was added. So basically, this workaround increases user-friendliness on how the weights are handled after the training is completed (hence outside of the train.py script). Two thoughts on eliminating the non-parallelized version:
I don't have a very strong opinion over this, but I'm leaning towards keeping it for the time being. Yes it's a bit annoying to keep the non-parallelized version around but it does eliminate potential frustration for new users of the library. Thoughts? |
Yes, that's the main reason why I'd lean on keeping it. The use-case I had was indeed for a user to run evaluation on a checkpoint in a different machine with a single GPU (or without GPUs), and the current setup allows them to get this seamlessly. |
🐛 Describe the bug
Investigate if we need
model_without_ddp
in the training script.vision/references/classification/train.py
Line 201 in 12fd3a6
Versions
N/A
cc @datumbox
The text was updated successfully, but these errors were encountered: