Skip to content

[FEAT] Add MobileViT v1 & v2 #6404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yassineAlouini opened this issue Aug 12, 2022 · 26 comments
Open

[FEAT] Add MobileViT v1 & v2 #6404

yassineAlouini opened this issue Aug 12, 2022 · 26 comments
Assignees

Comments

@yassineAlouini
Copy link
Contributor

yassineAlouini commented Aug 12, 2022

🚀 The feature

As described in the RFC "Batteries includes, phase 3", I am working on adding MobileViT v1 and v2 inspired by the following code repos/snippets:

The original paper can be found here.

Motivation, pitch

This has been decided in the RFC.

Alternatives

No response

Additional context

No response

cc @datumbox

@datumbox
Copy link
Contributor

Looks great @yassineAlouini. It would be great if we get this implementation.

Please have a read at #5319 where we document some best practices for model authoring. Also to avoid licensing problems, let's do a from scratch implementation.

@yassineAlouini
Copy link
Contributor Author

yassineAlouini commented Aug 12, 2022

Perfect, I will work on this today but mostly next week and the week after. Will let you know how my progress goes. 👌

@yassineAlouini
Copy link
Contributor Author

yassineAlouini commented Aug 18, 2022

I have started the implementation. Seems like a big chunk but excited to do it. 👌

I have found this huggingface implementation, could be useful as another inspiration: https://huggingface.co/docs/transformers/main/model_doc/mobilevit.

[EDIT] It looks like this is a wrapper around the https://github.com/apple/ml-cvnets implementation. 👌

@datumbox
Copy link
Contributor

Hi @yassineAlouini. Just wanted to touch base on the implementation. Any blockers or need help?

@yassineAlouini
Copy link
Contributor Author

yassineAlouini commented Sep 15, 2022

Hello @datumbox, thanks for checking. So far, so good. It is taking a bit longer since I only had one day of working on it and it is paused for now but might work during the weekends and nights.

Do you expect a date for finishing? 🤔

@datumbox
Copy link
Contributor

hey @yassineAlouini, sounds good. Thanks for the work. There are absolutely no deadlines on our side; just checking that everything goes smoothly and that you don't have a blocker. Let me know if you need anything :)

@yassineAlouini
Copy link
Contributor Author

Some update @datumbox: I will have some free time for the upcoming few days and should make some progress. Will let you know how it goes. 👌

@yassineAlouini
Copy link
Contributor Author

By the way, what is the PyTorch and TorchVision policies for the usage of einops? 🤔

@datumbox
Copy link
Contributor

@yassineAlouini So far we don't have a model using this. Is there a specific use-case in MobileViT that can't be done otherwise?

@yassineAlouini
Copy link
Contributor Author

I don't think it is irreplaceable, just wanted to check what is the best practice in torchvision. 👌
I will code everything using PyTorch and existing TorchVision code.

@yassineAlouini
Copy link
Contributor Author

One additional question regarding the TransformerEncoder: should I reimplement it or should I re-use the one from vision_transformer.py (i.e. EncoderBlock)? I was planning to copy-paste the code first, adapt it and then maybe later refactor. What do you think @datumbox?

@datumbox
Copy link
Contributor

@yassineAlouini Makes sense. Let's start by copy-pasting and modifying and see what changes are needed. Then we can decide whether sharing components is worth it. :)

@yassineAlouini
Copy link
Contributor Author

Some more progress @datumbox: I finally made the V1 work (I think), I am cleaning the code a bit and then will push it for a first round of reviews (to make sure I am on the right track).
I will then focus on training the model to get the weights.
Will let you know once it is pushed. Thanks in advance for your help!

@yassineAlouini
Copy link
Contributor Author

yassineAlouini commented Nov 20, 2022

Alright, MobileVit (the v1 version) runs finally 🎉 and I have pushed the code, if you have some time @datumbox I would love to get few first comments. Thanks. 🙏

The PR is here: #6965

I am starting the training step now and next will move to V2.

@yassineAlouini
Copy link
Contributor Author

Alright, I have tried running: torchrun --nproc_per_node=8 train.py --model mobile_vit_xxs to train a model on my Windows laptop but it seems not promising. I will try this on a cloud instance or on colab. @datumbox is there some available torchvision infra to do this or should I do it on my own? Thanks. :)

@datumbox
Copy link
Contributor

@yassineAlouini thanks! I've responded on the PR, let's continue the discussion there. :)

@yassineAlouini
Copy link
Contributor Author

Thanks @datumbox (et al) for the code review, I am checking now. 👌

@yassineAlouini
Copy link
Contributor Author

@datumbox @pmeier I am trying to make progress on this PR again. I need the ImageNet dataset to train the model and get the weights. I have sent a request to get it more than 10 days ago and still no dataset. Do you have another way to get the whole dataset? Thanks for your help!

@datumbox
Copy link
Contributor

@yassineAlouini Given the license of ImageNet, there is no way for us to redistribute it. So I think we might have to wait for them to respond. :(

@pmeier
Copy link
Collaborator

pmeier commented Apr 11, 2023

Unfortunately, I wouldn't get my hopes up:

Screenshot from 2023-04-11 10-54-13

Messaged them multiple times without a response ...

@yassineAlouini
Copy link
Contributor Author

Thanks for the feedback @pmeier. I thought 10 days was a long time. 😄
Alright, I will try to finish the other points of the PR review and maybe ask someone on the torchvision team to do the training (if someone has the data) and then I can check the performance once the weights have been trained. 👍

@datumbox
Copy link
Contributor

datumbox commented Apr 11, 2023

@yassineAlouini We try to work something out with @pmeier. He will ping you on email. I also pinged on Twitter two of the people involved with ImageNet to see if they can help. We'll work something out. 🤞

@gau-nernst
Copy link
Contributor

There is a copy on Kaggle. https://www.kaggle.com/c/imagenet-object-localization-challenge/

@yassineAlouini
Copy link
Contributor Author

Thanks for the link @gau-nernst but it is a smaller dataset if I am not wrong.

@gau-nernst
Copy link
Contributor

@yassineAlouini I believe it is the ImageNet-1k split that most people commonly refer to as "the ImageNet dataset" (used in ILSVRC). It should be the correct one.

Otherwise, HuggingFace is also hosting ImageNet-1k here: https://huggingface.co/datasets/imagenet-1k

@yassineAlouini
Copy link
Contributor Author

Thanks for the feedback and the link @gau-nernst. Isn't the "real" dataset the 22k one? Anyway, I will give it a try with the smaller one once I have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants