Skip to content

Adding Vision Transformer to torchvision/models #4593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yiwen-song opened this issue Oct 12, 2021 · 6 comments · Fixed by #5051, #5025, #4824, #5085 or #5086
Closed

Adding Vision Transformer to torchvision/models #4593

yiwen-song opened this issue Oct 12, 2021 · 6 comments · Fixed by #5051, #5025, #4824, #5085 or #5086

Comments

@yiwen-song
Copy link
Contributor

yiwen-song commented Oct 12, 2021

🚀 The feature

  1. Adding ViT architecture from this paper: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
  2. Adding DeiT architecture from this paper: "Training data-efficient image transformers & distillation through attention"

@fmassa @datumbox @mannatsingh @kazhang

Motivation, pitch

Vision Transformer models should exist in torchvision repo because they are good models :)

I'm currently working on this project.

Additional context

We can also consider adding some techniques from the following papers ^^
For example, adding Conv stem for ViT, see details in "Early Convolutions Help Transformers See Better"

References:
https://github.com/google-research/vision_transformer
https://github.com/facebookresearch/deit
https://github.com/facebookresearch/ClassyVision/blob/main/classy_vision/models/vision_transformer.py

cc @datumbox

@mannatsingh
Copy link

Note that DeiT has the same architecture as ViT - it's the same model, only the training setup is different!

@yiwen-song
Copy link
Contributor Author

Note that DeiT has the same architecture as ViT - it's the same model, only the training setup is different!

Interesting! I was looking at the implementation here
https://github.com/facebookresearch/deit/blob/main/models.py#L20
and found that DeiT model actually inherits the ViT model class.

Does it make sense if I also do this in torchvision?

@datumbox
Copy link
Contributor

@sallysyw First of all, thanks for adding this. This is awesome.

Concerning DeIT, we have a proposal to support distillation tasks on the future and there is an on going RFC that will allow you to have different pre-trained weights for the same architecture. So if you want to pursue that route, we should be able to accommodate.

@mannatsingh
Copy link

Oh yeah, I meant the baseline (no-distillation) DeIT is the same as ViT. Supporting the distillation token + workflow is your call :)

@take2rohit
Copy link

I believe that having ViT/DeiT in torchvision library would be really useful!
So is anyone else working with the implementation or should I go ahead and create a PR?

@datumbox
Copy link
Contributor

@take2rohit Thanks. @sallysyw is working on it at PR #4594.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment