-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Add MobileNet V2 #818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MobileNet V2 #818
Conversation
Codecov Report
@@ Coverage Diff @@
## master #818 +/- ##
=========================================
+ Coverage 51.58% 51.8% +0.22%
=========================================
Files 34 35 +1
Lines 3342 3401 +59
Branches 536 545 +9
=========================================
+ Hits 1724 1762 +38
- Misses 1486 1497 +11
- Partials 132 142 +10
Continue to review full report at Codecov.
|
A first run of training gave 70.8 top1 accuracy, which is ~1% lower than the expected 71.8% reported. |
Hi, I'm trying to understand the MobileNetV2 implementation. However, I'm confused that if the t(expand_ratio) equals to 1 as the first bottleneck, it seems we don't add any layer for this. Besides, does the x.mean([2, 3]) equal to the avgpool(7*7) in the forward pass? |
@hsparrow Hi,
Note that
Yes, this is equivalent to a `adaptive_avgpool2d(1) |
@fmassa Thanks for your replying!
But I'm still confusing about this. I mean in the InvertedResidual method, we only add the block to the layers when the expand_ratio doesn't equal to 1. Or, in the first bottleneck where t=1, the self.conv is supposed to be empty? |
@hsparrow this part of the code is not indented at the same level vision/torchvision/models/mobilenet.py Lines 27 to 33 in 742fd13
|
@fmassa Got it! My bad! |
@fmassa do you have an estimate when you'll be able to submit the subsequent PR with the pre-trained ImageNet weights? |
@jeremyjordan I need to retrain the models again changing a few hyperparameters. My target date to upload the pre-trained models (which matches accuracies) is by the end of the month, so in ~1 week. |
@fmassa Would you mind to let me know the parameters that you used to train the MobileNet-V2, such as optimizer, weight decay? |
For reference, there exist pre-trained MobileNetV2 models with cosine LR decay training strategy in https://github.com/d-li14/mobilenetv2.pytorch, reaching 72.2% accuracy, with identical network definition. |
@d-li14 Thanks you, I will try it. |
@D-X-Y once I get all the models giving the expected results, the hyperparameters for training the models will be available in https://github.com/pytorch/vision/tree/master/references/classification |
@fmassa Hi thank you for implementing mobilenetv2, would allowing user to experiment with the inverted_residual_setting is part of the future plan of torchvision ? |
@matthewygf if you send a PR letting the user specify the |
@Mxbonn sorry for not being clear, I trained on 8 GPUs, each GPU having a batch size of 32.
|
Please note the reference implementation does not apply weight decay to the depthwise convolutions: |
Also see this URL for full list of options used by reference implementation to achieve 72% accuracy: Edit: While training preprocessing is different, you are already scaling the validation images the same way |
@andravin Applying those changes could potentially improve the performance, indeed. The current model that I uploaded yields 71.9 top1 accuracy, which is clone enough I believe, but keeps the training code / transforms the same as for the other models, so that we can more easily factor out improvements in the performance which are orthogonal to the model itself. |
You previously reported 70.8% accuracy. What did you change to get 71.9%? |
@andravin a few things:
Also, here is the PR that uploaded the new model #917 |
Thanks @fmassa .. that leads me to the next question, why such a small batch size (32 x 8 GPU)? The MobileNetV2 paper uses 96 x 16. BTW, I ran your script using:
I had to modify train.py in order to add support for the Using an AWS EC2 p3.16xlarge instance (8 x V100), the projected running time is 137.7 hours. Is that what you expect? Torch master checked out yesterday, CUDA 10.1, cuDNN 7.6. |
I use the flag
The total time per epoch on my case (using 8 V100 with CUDA 10, not sure which cudnn anymore) was 9 min 15 seconds on average, and the total training time (including periodic evaluation) lasted for 2 days, 0:33:33, so your projected runtime is quite different from what I had.
No good reason other than I didn't want to use 16GPUs with async updates (like they mentioned in the paper). |
Thanks for the tip about With Any advice how to track down the large performance difference we are reporting?
|
@andravin wow, I've just kicked the same training again with a newer version of PyTorch, and training times are indeed 3x slower. After some digging, the issue was the same as pytorch/pytorch#20311 Here are some results, on a 80 core machine:
So I'd say that the problem in on PyTorch, and we should just tune I'm using
|
That's great @fmassa, with Is your 14:56 (batch) ETA actually worse than the 9 min 15 second epochs you reported earlier, or do your recent measurements include a warmup penalty? |
@andravin I haven't tried other |
I get the best performance with With All batch sizes benefit from Note: my pytorch build was |
@andravin very useful information, thanks for sharing! |
@fmassa I was not able to reproduce your training result. The best test accuracy was 71.536% after 290 epochs.
Testing the pretrained model, I get 71.878%. |
@andravin I could try kicking in some more trainings to verify if this is just due to random noise of the model training, of if something changed since last time I trained it. But I won't have the time to test it before some time though. |
@fmassa It might have been a mistake for me to use a development version of pytorch for this experiment. I doubt the large accuracy difference can be explained by random variance alone. Edit: Re-running now using pytorch 1.1 (stable), CUDA 10.0, cuDNN 7.5, torchvision 0.3. Please advise if this configuration is most likely to reproduce your results. The single epoch training time is still, remarkably, 6:34, using |
@andravin this is indeed the configuration that I used, my log dates from May 8th. If you still can't reproduce the results please do let me know. |
@fmassa using the aforementioned stable build, I get 71.830% accuracy after last epoch, 71.888% after epoch 288. By the way, did you report last epoch accuracy or best epoch accuracy? Anyway, success! Training time 1 day, 10:13:30. |
@andravin I reported best accuracy, which was obtained for me at epoch 285. So our results match fairly closely, which is great! And it's good to know that there might be some regression that happened between PyTorch 1.1, TorchVision 0.3 and now. What remains to be seen is if it's in PyTorch, TorchVision or both :-) |
@fmassa Would this have broken weight initialization: pytorch/pytorch#22529 ? |
@andravin it's a good hypothesis, that you were using a version that had the rng bugged for CUDA, but from my understanding the model initialization is first done on the CPU, and then the model is moved to the GPU, so we actually exercise the CPU codepath for the rng |
More mobilenet_v2 training speed tests for torch 1.1, torchvision 0.3, CUDA 10.0, cuDNN 7.5, libjpeg-turbo8: Using pillow-simd:
Using regular pillow:
Benchmarking gotchas: 1) ignore first epoch time due to substantial warmup cost, 2) make sure to kill all python processes after interrupting with Ctrl-c.
Again, machine is AWS EC2 p3.16xlarge instance, 8xV100, 480 MB RAM.
|
@fmassa I ran into a problem when adding the But mobilenet_v2 training is slower with mixed precision than without:
|
@andravin See NVIDIA/apex#76 for a discussion on this topic. Mixed precision training can improve training times if your GPU has tensor cores. Mixed precision training, however, can reduce your memory usage (your results on a batch size of 256 show this). This article covers this topic quite well. |
@dakshjotwani V100 GPUs, as documented above, have tensor cores. |
@andravin my bad. I wasn't sure about the GPUs being used. I remembered facing a similar issue before and thought it might be useful to the issue. |
@dakshjotwani Understandable, because we would not expect a tensor core enabled GPU to run slower with mixed precision. The throughput per GPU for batch size 256 is actually less than 1 TFLOP, which is less than 1% utilization of the tensor cores. I get a 2.3x speedup by eliminating the data loader, but that still yields less than 2% utilization. |
This PR adds support for MobileNetV2.
It's been heavily based on the implementation from #625 by @tonylins
I'm currently training the model from scratch with a custom script, and I'll be uploading the weights (together with the training hyperparameters) once training finishes and I match reported accuracies.