-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Adding Vision Transformer to torchvision/models #4593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note that DeiT has the same architecture as ViT - it's the same model, only the training setup is different! |
Interesting! I was looking at the implementation here Does it make sense if I also do this in torchvision? |
@sallysyw First of all, thanks for adding this. This is awesome. Concerning DeIT, we have a proposal to support distillation tasks on the future and there is an on going RFC that will allow you to have different pre-trained weights for the same architecture. So if you want to pursue that route, we should be able to accommodate. |
Oh yeah, I meant the baseline (no-distillation) DeIT is the same as ViT. Supporting the distillation token + workflow is your call :) |
I believe that having ViT/DeiT in torchvision library would be really useful! |
@take2rohit Thanks. @sallysyw is working on it at PR #4594. |
🚀 The feature
@fmassa @datumbox @mannatsingh @kazhang
Motivation, pitch
Vision Transformer models should exist in torchvision repo because they are good models :)
I'm currently working on this project.
Additional context
We can also consider adding some techniques from the following papers ^^
For example, adding Conv stem for ViT, see details in "Early Convolutions Help Transformers See Better"
References:
https://github.com/google-research/vision_transformer
https://github.com/facebookresearch/deit
https://github.com/facebookresearch/ClassyVision/blob/main/classy_vision/models/vision_transformer.py
cc @datumbox
The text was updated successfully, but these errors were encountered: