This repository contains a (non-exhaustive) overview of follow-up works based on the original Vision Transformer (ViT) by Google. Feel free to open a PR to add more papers!
- DeiT (Data-efficient Image Transformers): https://arxiv.org/abs/2012.12877
- Efficient Vision Transformers via Fine-Grained Manifold Distillation: https://arxiv.org/abs/2107.01378
- NViT (Vision Transformer Compression and Parameter Redistribution): https://arxiv.org/abs/2110.04869
- SiT (Self-slimmed Vision Transformer): https://arxiv.org/abs/2111.12624
- self-supervised:
- DINO (Emerging Properties in Self-Supervised Vision Transformers): https://arxiv.org/abs/2104.14294
- MoBY (Self-Supervised Learning with Swin Transformers): https://arxiv.org/abs/2105.04553
- EsViT (Efficient self-supervised Vision Transformers): https://arxiv.org/abs/2106.09785
- BEiT (BERT Pre-Training of Image Transformers): https://arxiv.org/abs/2106.08254
- MAE (Masked Autoencoders Are Scalable Vision Learners): https://arxiv.org/abs/2111.06377
- SiT (Self-supervised vIsion Transformer): https://arxiv.org/abs/2104.03602
- SimMIM (A Simple Framework for Masked Image Modeling): https://arxiv.org/abs/2111.09886
- supervised:
- Token Labeling for Better Training Vision Transformers: https://arxiv.org/abs/2104.10858
- Vision Transformers with Patch Diversification: https://arxiv.org/abs/2104.12753
- Token Pooling in Vision Transformers: https://arxiv.org/abs/2110.03860
- Scaling Vision Transformers: https://arxiv.org/abs/2106.04560
- Vision Transformers with Patch Diversification: https://arxiv.org/abs/2104.12753
- Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations: https://arxiv.org/abs/2108.05887
- How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers: https://arxiv.org/abs/2106.10270
- When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations (SAM optimizer): https://arxiv.org/abs/2106.01548
- V-MoE (Scaling Vision with Sparse Mixture of Experts): https://arxiv.org/abs/2106.05974
- Combining convolution with self-attention:
- CvT (Introducing convolutions to Vision Transformers): https://arxiv.org/abs/2103.15808
- ConViT (Improving Vision Transformers with Soft Convolutional Inductive Biases): https://arxiv.org/abs/2103.10697
- CMT (Convolutional Neural Networks Meet Vision Transformers): https://arxiv.org/abs/2107.06263
- LeViT (A Vision Transformer in ConvNet's Clothing for Faster Inference): https://arxiv.org/abs/2104.01136
- Co-Scale Conv-Attentional Image Transformers (CoaT): https://arxiv.org/abs/2104.06399
- Visformer (The Vision-friendly Transformer): https://arxiv.org/abs/2104.12533
- CCT (Escaping the Big Data Paradigm with Compact Transformers): https://arxiv.org/abs/2104.05704
- Refiner (Refining Self-attention for Vision Transformers): https://arxiv.org/abs/2106.03714
- LVT (Lite Vision Transformer with Enhanced Self-Attention): https://arxiv.org/abs/2112.10809
- Others:
- PiT (Rethinking Spatial Dimensions of Vision Transformers): https://arxiv.org/abs/2103.16302
- xCiT (Cross-Covariance Image Transformer): https://arxiv.org/abs/2106.09681
- EsViT (Efficient self-supervised Vision Transformers): https://arxiv.org/abs/2106.09785
- Token-to-token ViT (Training ViT from scratch on ImageNet): https://arxiv.org/abs/2101.11986
- DeepViT (Towards Deeper Vision Transformer): https://arxiv.org/abs/2103.11886
- PVT (Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions): https://arxiv.org/abs/2102.12122
- PVTv2 (Improved Baselines with Pyramid Vision Transformer): https://arxiv.org/abs/2106.13797
- Wider Vision Transformer (Go Wider Instead of Deeper): https://arxiv.org/abs/2107.11817
- CaiT (Going Deeper with Image Transformers): https://arxiv.org/abs/2103.17239
- CrossViT (Cross-Attention Multi-Scale Vision Transformer for Image Classification): https://arxiv.org/abs/2103.14899
- Twins-CVT (Spatial Attention in Vision Transformers): https://arxiv.org/abs/2104.13840
- LIT (Less is More: Pay Less Attention in Vision Transformers): https://arxiv.org/abs/2105.14217
- TnT (Transformer-in-Transformer): https://arxiv.org/abs/2103.00112
- Dynamic Vision Transformer: https://arxiv.org/abs/2105.15075
- Swin Transformer (Hierarchical Vision Transformer using Shifted Windows): https://arxiv.org/abs/2103.14030
- Shuffle Transformer (Rethinking Spatial Shuffle for Vision Transformer): https://arxiv.org/abs/2106.03650
- NesT (Aggregating Nested Transformers): https://arxiv.org/abs/2105.12723
- Long-Short Transformer (Efficient Transformers for Language and Vision): https://t.co/V8qKUkVH1c?amp=1
- DynamicViT (Efficient Vision Transformers with Dynamic Token Sparsification): https://arxiv.org/abs/2106.02034
- Dynamic Transformer (Dynamic Vision Transformers with Adaptive Sequence Length): https://arxiv.org/abs/2105.15075
- PS-ViT (Vision Transformer with Progressive Sampling): https://arxiv.org/abs/2108.01684
- RegionViT (Regional-to-Local Attention for Vision Transformers): https://arxiv.org/abs/2106.02689
- Focal Transformer (Focal Self-attention for Local-Global Interactions in Vision Transformers): https://arxiv.org/pdf/2107.00641.pdf
- kVT (k-NN Attention for Boosting Vision Transformers): https://arxiv.org/abs/2106.00515
- Robust Vision Transformer: https://arxiv.org/abs/2105.07926
- Glance-and-Gaze Vision Transformer: https://arxiv.org/abs/2106.02277
- Feature Fusion Vision Transformer: https://arxiv.org/abs/2107.02341
- Augmented Shortcuts for Vision Transformers: https://arxiv.org/abs/2106.15941
- CrossFormer (A Versatile Vision Transformer Based on Cross-scale Attention): https://arxiv.org/abs/2108.00154
- CSWin Transformer (A General Vision Transformer Backbone with Cross-Shaped Windows): https://arxiv.org/pdf/2107.00652.pdf
- Evo-ViT (Slow-Fast Token Evolution for Dynamic Vision Transformer): https://arxiv.org/abs/2108.01390
- PSViT (Better Vision Transformer via Token Pooling and Attention Sharing): https://t.co/OOnONItfnX?amp=1
- ImageRPE (relative position encodings) for Vision Transformers: https://arxiv.org/abs/2107.14222
- What makes for Hierarchical Vision Transformer? https://arxiv.org/abs/2107.02174
- Multi-Scale Vision Longformer: https://arxiv.org/abs/2103.15358
- CSWin Transformer: https://arxiv.org/abs/2107.00652
- MetaFormer is Actually What You Need for Vision: https://arxiv.org/abs/2111.11418
- Stochastic Layers in Vision Transformers: https://arxiv.org/abs/2112.15111
- ViR: the Vision Reservoir: https://arxiv.org/abs/2112.13545
- Blending Anti-Aliasing into Vision Transformer: https://arxiv.org/abs/2110.15156
- ELSA: Enhanced Local Self-Attention for Vision Transformer: https://arxiv.org/abs/2112.12786
- Swin Transformer V2: Scaling Up Capacity and Resolution: https://arxiv.org/abs/2111.09883
- Are Convolutional Neural Networks or Transformers more like human vision? https://arxiv.org/abs/2105.07197
- Do Vision Transformers See Like Convolutional Neural Networks? https://arxiv.org/abs/2108.08810
- What makes for Hierarchical Vision Transformer? (Survey on Swin + ShuffleTransformer): https://arxiv.org/abs/2107.02174
- Intriguing Properties of Vision Transformers: https://arxiv.org/abs/2105.10497
- Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers: https://arxiv.org/abs/2106.13122
- Are Transformers More Robust Than CNNs? https://arxiv.org/abs/2111.05464v1
- YOLOS (object detection): https://arxiv.org/abs/2106.00666
- ViTGAN (GANs): https://arxiv.org/abs/2107.04589
- SegFormer (semantic segmentation): https://arxiv.org/abs/2105.15203
- Feature Fusion Vision Transformer (Fine-Grained Visual Categorization): https://arxiv.org/abs/2107.02341
- TrOCR (optical character recognition): https://arxiv.org/abs/2109.10282