Adding Vision Transformer to torchvision/models #4593

yiwen-song · 2021-10-12T00:51:32Z

🚀 The feature

Adding ViT architecture from this paper: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
Adding DeiT architecture from this paper: "Training data-efficient image transformers & distillation through attention"

Motivation, pitch

Vision Transformer models should exist in torchvision repo because they are good models :)

I'm currently working on this project.

Additional context

We can also consider adding some techniques from the following papers ^^
For example, adding Conv stem for ViT, see details in "Early Convolutions Help Transformers See Better"

References:
https://github.com/google-research/vision_transformer
https://github.com/facebookresearch/deit
https://github.com/facebookresearch/ClassyVision/blob/main/classy_vision/models/vision_transformer.py

cc @datumbox

mannatsingh · 2021-10-12T02:20:59Z

Note that DeiT has the same architecture as ViT - it's the same model, only the training setup is different!

yiwen-song · 2021-10-12T02:47:15Z

Note that DeiT has the same architecture as ViT - it's the same model, only the training setup is different!

Interesting! I was looking at the implementation here
https://github.com/facebookresearch/deit/blob/main/models.py#L20
and found that DeiT model actually inherits the ViT model class.

Does it make sense if I also do this in torchvision?

datumbox · 2021-10-12T10:10:40Z

@sallysyw First of all, thanks for adding this. This is awesome.

Concerning DeIT, we have a proposal to support distillation tasks on the future and there is an on going RFC that will allow you to have different pre-trained weights for the same architecture. So if you want to pursue that route, we should be able to accommodate.

mannatsingh · 2021-10-12T21:29:24Z

Oh yeah, I meant the baseline (no-distillation) DeIT is the same as ViT. Supporting the distillation token + workflow is your call :)

take2rohit · 2021-11-23T09:29:37Z

I believe that having ViT/DeiT in torchvision library would be really useful!
So is anyone else working with the implementation or should I go ahead and create a PR?

datumbox · 2021-11-23T09:32:18Z

@take2rohit Thanks. @sallysyw is working on it at PR #4594.

yiwen-song self-assigned this Oct 12, 2021

yiwen-song added the module: models label Oct 12, 2021

yiwen-song mentioned this issue Oct 12, 2021

Adding ViT to torchvision/models #4594

Merged

prabhat00155 added needs discussion new feature labels Oct 12, 2021

datumbox mentioned this issue Oct 12, 2021

[RFC] TorchVision with Batteries included - Phase 1 #3911

Closed

16 tasks

yiwen-song closed this as completed in #5086 Dec 10, 2021

yiwen-song reopened this Dec 10, 2021

datumbox closed this as completed in #5085 Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Vision Transformer to torchvision/models #4593

Adding Vision Transformer to torchvision/models #4593

yiwen-song commented Oct 12, 2021 •

edited by pytorch-probot bot

Loading

mannatsingh commented Oct 12, 2021

yiwen-song commented Oct 12, 2021

datumbox commented Oct 12, 2021

mannatsingh commented Oct 12, 2021

take2rohit commented Nov 23, 2021

datumbox commented Nov 23, 2021

Adding Vision Transformer to torchvision/models #4593

Adding Vision Transformer to torchvision/models #4593

Comments

yiwen-song commented Oct 12, 2021 • edited by pytorch-probot bot Loading

🚀 The feature

Motivation, pitch

Additional context

mannatsingh commented Oct 12, 2021

yiwen-song commented Oct 12, 2021

datumbox commented Oct 12, 2021

mannatsingh commented Oct 12, 2021

take2rohit commented Nov 23, 2021

datumbox commented Nov 23, 2021

yiwen-song commented Oct 12, 2021 •

edited by pytorch-probot bot

Loading