Vision Transformer

Custom implementation of the original An Image is Worth 16x16 Words vision transformer.

Overview

In this repository, I have built and annotated a simple, lightweight version of the vision transformer (ViT) from basic PyTorch components. I hope that it will be a useful resource for learning about the basics of this architecture, and will provide a helpful jumping-off point for more complex applications of transformers for computer vision.

Architecture

I explain the architecture in greater detail my accompanying blog post, but essentially the vision transformer consists of the following components:

A simple method for cutting images up into patches and flattening them, and turning them into sequences
A concatenated learnable class embedding
An added learnable positional embedding
A transformer, consisting of a number of layers of encoders (and no decoders)
A two-layer MLP for classification

Data

For this project, I trained versions of the transformer on CIFAR-10 and CIFAR-100. Pytorch Lightning data modules for preparing these datasets are included.

Environment & Setup

This model was trained with the following packages:

pytorch 1.8.2
torchvision 0.9.2
pytorch-lightning 1.6.1
torchmetrics 0.8.0
pl_bolts 0.5.0

Repo Structure

data

Data modules for CIFAR-10 and CIFAR-100. These can be used to download, transform, split and load data into dataloaders.