- ✔️ TensorFlow and PyTorch implementations
- ✔️ Pure
tf.keras.Model
andtorch.nn.Module
s, as well as PyTorch Lightning modules ready for training pipelines - ✔️ Automatic weight conversion between DeepVision models (train and fine-tune
.h5
and.pt
checkpoints interchangeably in either framework) - ✔️ Explainability and analysis modules
- ✔️ TensorFlow/PyTorch duality on multiple levels (model-level and component-level are backend agnostic and weights are transferable on model-level and component-level)
- ✔️ Identical, readable implementations, with the same API, code structure and style
- ✔️ Layered API with exposed building blocks (
TransformerEncoder
,MBConv
, etc.) - ✔️ Image classification, semantic segmentation, NeRFs (object detection, instance/panoptic segmentation, etc. coming soon)
- ✔️ Mixed-precision, TPU and XLA training support
DeepVision is a (yet another) computer vision library, aimed at bringing Deep Learning to the hands of the masses. Why another library?
The computer vision engineering toolkit is segmented. Amazing libraries exist, but a practicioner oftentimes needs to make decisions on which ones to use based on their compatabilities.
DeepVision tries to bridge the compatability issues, allowing you to focus on what matters - engineering, and seamlessly switching between ecosystems and backends.
DeepVision:
- ❤️ KerasCV and how readable and well-structured it is.
- ❤️
timm
and how up-to-date it is. - ❤️ HuggingFace and how diverse it is.
- ❤️ Kornia and how practical it is.
To that end, DeepVision takes cues, API and structure inspiration from these libraries. A huge kudos and acknowledgement goes to every contributor in their respective repositories. At the same time, DeepVision provides the same API across the board, so you no longer have to switch between APIs and styles.
Different teams and projects use different tech stacks, and nobody likes switching from their preferred library for a new project. Furthermore, different libraries implement models in different ways. Whether it's code conventions, code structure or model flavors. When it comes to foundational models like ResNets, some libraries default to flavors such as ResNet 1.5, some default to ResNet-B, etc.
With DeepVision, you don't need to switch the library - you just change the backend with a single argument. Additionally, all implementations will strive to be as equal as possible between supported backends, providing the same number of parameters, through the same coding style and structure to enhance readability.
DeepVision is deeply integrated with TensorFlow and PyTorch. You can switch between backends by specifying the backend during initialization:
import deepvision
# TF-Based ViTB16 operating on `tf.Tensor`s
tf_model = deepvision.models.ViTB16(include_top=True,
classes=10,
input_shape=(224, 224, 3),
backend='tensorflow')
# PyTorch-Based ViTB16 operating on `torch.Tensor`s
pt_model = deepvision.models.ViTB16(include_top=True,
classes=10,
input_shape=(3, 224, 224),
backend='pytorch')
All models will share the same API, regardless of the backend. With DeepVision, you can rest assured that training performance between PyTorch and TensorFlow models isn't due to the specific implementation.
Any model returned as a TensorFlow model is a tf.keras.Model
, making it fit for use out-of-the-box, with a straightforward compatability with tf.data
and training on tf.data.Dataset
s:
import deepvision
import tensorflow as tf
import tensorflow_datasets as tfds
(train_set, test_set), info = tfds.load("imagenette",
split=["train", "validation"],
as_supervised=True, with_info=True)
n_classes = info.features["label"].num_classes
def preprocess_img(img, label):
img = tf.image.resize(img, (224, 224))
return img, label
train_set = train_set.map(preprocess_img).batch(32).prefetch(tf.data.AUTOTUNE)
test_set = test_set.map(preprocess_img).batch(32).prefetch(tf.data.AUTOTUNE)
tf_model = deepvision.models.ResNet18V2(include_top=True,
classes=n_classes,
input_shape=(224, 224, 3),
backend='tensorflow')
tf_model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
metrics=['accuracy']
)
history = tf_model.fit(train_set, epochs=1, validation_data=test_set)
Any model returned as a PyTorch model is a pl.LightningModule
, which is a torch.nn.Module
. You may decide to use it manually, as you'd use any torch.nn.Module
:
pt_model = deepvision.models.ResNet50V2(include_top=True,
classes=10,
input_shape=(3, 224, 224),
backend='pytorch')
# Optimizer, loss function, etc.
for epoch in epochs:
for batch in train_loader:
optimizer.zero_grad()
inputs, labels = batch
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# ...
Or you may compile()
a model, and use the PyTorch Lightning Trainer
given a dataset:
import deepvision
import torch
from torchvision import transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
import pytorch_lightning as pl
device = 'cuda' if torch.cuda.is_available() else 'cpu'
transform=transforms.Compose([transforms.ToTensor(),
transforms.Resize([224, 224])])
cifar_train = CIFAR10('cifar10', train=True, download=True, transform=transform)
cifar_test = CIFAR10('cifar10', train=False, download=True, transform=transform)
train_dataloader = DataLoader(cifar_train, batch_size=32)
val_dataloader = DataLoader(cifar_test, batch_size=32)
pt_model = deepvision.models.ResNet18V2(include_top=True,
classes=10,
input_shape=(3, 224, 224),
backend='pytorch')
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(pt_model.parameters(), 1e-4)
pt_model.compile(loss=loss, optimizer=optimizer)
trainer = pl.Trainer(accelerator=device, max_epochs=1)
trainer.fit(pt_model, train_dataloader, val_dataloader)
As models between PyTorch and TensorFlow implementations are equal and to encourage cross-framework collaboration - DeepVision provides you with the option of porting weights between the frameworks. This means that Person 1 can train a model with a TensorFlow pipeline, and Person 2 can then take that checkpoint and fine-tune it with a PyTorch pipeline, and vice-versa.
While still in beta, the feature will come for each model, and currently works for EfficientNets.
For end-to-end examples, take a look at the "Automatic Weight Conversion with DeepVision"
dummy_input_tf = tf.ones([1, 224, 224, 3])
dummy_input_torch = torch.ones(1, 3, 224, 224)
tf_model = deepvision.models.EfficientNetV2B0(include_top=False,
pooling='avg',
input_shape=(224, 224, 3),
backend='tensorflow')
tf_model.save('effnet.h5')
from deepvision.models.classification.efficientnet import efficientnet_weight_mapper
pt_model = efficientnet_weight_mapper.load_tf_to_pt(filepath='effnet.h5', dummy_input=dummy_input_tf)
print(tf_model(dummy_input_tf)['output'].numpy())
print(pt_model(dummy_input_torch).detach().cpu().numpy())
# True
np.allclose(tf_model(dummy_input_tf)['output'].numpy(), pt_model(dummy_input_torch).detach().cpu().numpy())
pt_model = deepvision.models.EfficientNetV2B0(include_top=False,
pooling='avg',
input_shape=(3, 224, 224),
backend='pytorch')
torch.save(pt_model.state_dict(), 'effnet.pt')
from deepvision.models.classification.efficientnet import efficientnet_weight_mapper
kwargs = {'include_top': False, 'pooling':'avg', 'input_shape':(3, 224, 224)}
tf_model = efficientnet_weight_mapper.load_pt_to_tf(filepath='effnet.pt',
architecture='EfficientNetV2B0',
kwargs=kwargs,
dummy_input=dummy_input_torch)
pt_model.eval()
print(pt_model(dummy_input_torch).detach().cpu().numpy())
print(tf_model(dummy_input_tf)['output'].numpy())
# True
np.allclose(tf_model(dummy_input_tf)['output'].numpy(), pt_model(dummy_input_torch).detach().cpu().numpy())
Each distinct block that offers a public API, such as the commonly used MBConv
and FusedMBConv
blocks also offer weight porting between them:
dummy_input_tf = tf.ones([1, 224, 224, 3])
dummy_input_torch = torch.ones(1, 3, 224, 224)
layer = deepvision.layers.FusedMBConv(3, 32, expand_ratio=2, se_ratio=0.25, backend='tensorflow')
layer(dummy_input_tf);
pt_layer = deepvision.layers.fused_mbconv.tf_to_pt(layer)
pt_layer.eval();
layer(dummy_input_tf).numpy()[0][0][0]
"""
array([ 0.07588673, -0.00770299, -0.03178375, -0.06809437, -0.02139765,
0.06691956, 0.05638139, -0.00669611, -0.01785627, 0.08565219,
-0.11967321, 0.01648926, -0.01665686, -0.07395031, -0.05677428,
-0.13836852, 0.10357075, 0.00552578, -0.02682608, 0.10316402,
-0.05773047, 0.08470275, 0.02989118, -0.11372866, 0.07361417,
0.04321364, -0.06806802, 0.06685358, 0.10110974, 0.03804607,
0.04943493, -0.03414273], dtype=float32)
"""
# Reshape so the outputs are easily comparable
pt_layer(dummy_input_torch).detach().cpu().numpy().transpose(0, 2, 3, 1)[0][0][0]
"""
array([ 0.07595398, -0.00769612, -0.03179125, -0.06815705, -0.021454 ,
0.06697321, 0.05642046, -0.00668627, -0.01784784, 0.08573981,
-0.11977906, 0.01648908, -0.01665735, -0.07405862, -0.05680554,
-0.13849407, 0.10368796, 0.00552754, -0.02683712, 0.10324436,
-0.0578215 , 0.08479469, 0.0299269 , -0.11383523, 0.07365884,
0.04328319, -0.06810313, 0.06690993, 0.10120884, 0.03805522,
0.04951007, -0.03417065], dtype=float32)
"""
We want DeepVision to host a suite of visualization and explainability tools, from activation maps, to learned feature analysis through clustering algorithms:
FeatureAnalyzer
- a class used to analyze the learned features of a model, and evaluate the predictionsActivationMaps
- a class used to plot activation maps for Convolutional Neural Networks, based on the GradCam++ algorithm.- ...
Already trained a model and you want to evaluate it? Whether it's a DeepVision model, or a model from another library, as long as a model is either a tf.keras.Model
or torch.nn.Module
that can produce an output vector, be it the fully connected top layers or exposed feature maps - you can explore the learned feature space using DeepVision:
import deepvision
tf_model = deepvision.models.ViTTiny16(include_top=True,
classes=10,
input_shape=(224, 224, 3),
backend='tensorflow')
# Train...
feature_analysis = deepvision.evaluation.FeatureAnalyzer(tf_model, # DeepVision TF Model
train_set, # `tf.data.Dataset` returning (img, label)
limit_batches=500, # Limit the number of batches to go over in the dataset
classnames=class_names, # Optionally supply classnames for plotting
backend='tensorflow') # Specify backend
feature_analysis.extract_features()
feature_analysis.feature_analysis(components=2)
Note: All TensorFlow-based DeepVision models are Functional Subclassing models - i.e. have a dictionary output, which contains 1..n
keys, and the standard output contains an output
key that corresponds to the tf.Tensor
output value. The FeatureAnalyzer
accepts any TensorFlow-based model that can produce a tf.Tensor
output or produces a dictionary output with an 'output':tf.Tensor
key-value pair.
The FeatureAnalyzer
class iterates over the supplied dataset, extracting the features (outputs) of the supplied model, when extract_features()
is called. This expensive operation is called only once, and all subsequent calls, until a new extract_features()
call, re-use the same features. The feature_analysis()
method performs Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embeddings (t-SNE) on the extracted features, and visualizes them using Matplotlib. The components
parameter is the n_components
used for PCA and t-SNE transformations, and naturally has to be in the range of [2..3]
for 2D and 3D plots respectively.
import deepvision
pt_model = deepvision.models.ResNet18V2(include_top=True,
classes=10,
input_shape=(3, 224, 224),
backend='pytorch')
# Train...
classnames = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
feature_analysis = deepvision.evaluation.FeatureAnalyzer(pt_model, # DeepVision PT Model
train_dataloader, # `torch.utils.Dataloader` returning (img, label)
limit_batches=500, # Limit the number of batches to go over in the dataset
classnames=classnames, # Optionally supply classnames for plotting
backend='pytorch') # Specify backend
feature_analysis.extract_features()
feature_analysis.feature_analysis(components=3, figsize=(20, 20))
For more, take a look at the "DeepVision Training and Feature Analysis" notebook.
We want DeepVision to host a model zoo across a wide variety of domains:
- Image Classification and Backbones
- Object Detection
- Semantic, Instance and Panoptic Segmentation
- Object Tracking and MOT
- 3D Reconstruction
- Image Restoration
Currently, these models are supported (parameter counts are equal between backends):
- EfficientNetV2 Family
Architecture | Parameters | FLOPs | Size (MB) |
---|---|---|---|
EfficientNetV2B0 | 7,200,312 | ||
EfficientNetV2B1 | 8,212,124 | ||
EfficientNetV2B2 | 10,178,374 | ||
EfficientNetV2B3 | 14,486,374 | ||
EfficientNetV2S | 21,612,360 | ||
EfficientNetV2M | 54,431,388 | ||
EfficientNetV2L | 119,027,848 |
- Vision Transformer (ViT) Family
Architecture | Parameters | FLOPs | Size (MB) |
---|---|---|---|
ViTTiny16 | 5,717,416 | ||
ViTS16 | 22,050,664 | ||
ViTB16 | 86,567,656 | ||
ViTL16 | 304,326,632 | ||
ViTTiny32 | 6,131,560 | ||
ViTS32 | 22,878,952 | ||
ViTB32 | 88,224,232 | ||
ViTL32 | 306,535,400 |
- ResNetV2 Family
Architecture | Parameters | FLOPs | Size (MB) |
---|---|---|---|
ResNet18V2 | 11,696,488 | ||
ResNet34V2 | 21,812,072 | ||
ResNet50V2 | 25,613,800 | ||
ResNet101V2 | 44,675,560 | ||
ResNet152V2 | 60,380,648 |
- SegFormer Family
Architecture | Parameters | FLOPs | Size (MB) |
---|---|---|---|
SegFormerB0 | 3,714,915 | ||
SegFormerB1 | 13,678,019 | ||
SegFormerB2 | 27,348,931 | ||
SegFormerB3 | 47,224,771 | ||
SegFormerB4 | 63,995,331 | ||
SegFormerB5 | 84,595,651 |
- Mix-Transformer (MiT) Family:
Architecture | Parameters | FLOPs | Size (MB) |
---|---|---|---|
MiTB0 | 3,321,962 | ||
MiTB1 | 13,156,554 | ||
MiTB2 | 24,201,418 | ||
MiTB3 | 44,077,258 | ||
MiTB4 | 60,847,818 | ||
MiTB5 | 81,448,138 |
Architecture | Parameters | FLOPs | Size (MB) |
---|---|---|---|
SAM_B | 93,735,472 | ||
SAM_L | 312,342,832 | ||
SAM_H | 641,090,608 |
Models and architectures are built on top of each other. VGGNets begat ResNets, which begat a plethora of other architectures, with incremental improvements, small changes and new ideas building on top of already accepted ideas to bring about new advances. To make architectures more approachable, as well as easily buildable, more readable and to make experimentation and building new architectures simpler - we want to expose as many internal building blocks as possible, as part of the general DeepVision API. If an architecture uses a certain block repeatedly, it's likely going to be exposed as part of the public API.
Most importantly, all blocks share the same API, and are agnostic to the backend, with an identical implementation.
You can prototype and debug in PyTorch, and then move onto TensorFlow or vice versa to build a model. For instance, a generic TransformerEncoder
deals with the same arguments, in the same order, and performs the same operation on both backends:
tensor = torch.rand(1, 197, 1024)
trans_encoded = deepvision.layers.TransformerEncoder(project_dim=1024,
mlp_dim=3072,
num_heads=8,
backend='pytorch')(tensor)
print(trans_encoded.shape) # torch.Size([1, 197, 1024])
tensor = tf.random.normal([1, 197, 1024])
trans_encoded = deepvision.layers.TransformerEncoder(project_dim=1024,
mlp_dim=3072,
num_heads=8,
backend='tensorflow')(tensor)
print(trans_encoded.shape) # TensorShape([1, 197, 1024])
Similarly, you can create something funky with the building blocks! Say, pass an image through an MBConv
block (MobileNet and EfficientNet style), and through a PatchingAndEmbedding
/TransformerEncoder
(ViT style) duo, and add the results together:
inputs = torch.rand(1, 3, 224, 224)
x = deepvision.layers.MBConv(input_filters=3,
output_filters=32,
backend='pytorch')(inputs)
y = deepvision.layers.PatchingAndEmbedding(project_dim=32,
patch_size=16,
input_shape=(3, 224, 224),
backend='pytorch')(inputs)
y = deepvision.layers.TransformerEncoder(project_dim=32,
num_heads=8,
mlp_dim = 64,
backend='pytorch')(y)
y = y.mean(1)
y = y.reshape(y.shape[0], y.shape[1], 1, 1)
add = x+y
print(add.shape) # torch.Size([1, 32, 224, 224])
Would this make sense in an architecture? Maybe. Maybe not. Your imagination is your limit.
We want DeepVision to host a suite of datasets and data loading utilities that can be easily used in production, as well as to host datasets that are suited for use with DeepVision models as well as vanilla PyTorch and vanilla TensorFlow models, in an attempt to lower the barrier to entry for some domains of computer vision:
For instance, you can easily load the Tiny NeRF dataset used to train Neural Radiance Fields with DeepVision, as both a tf.data.Dataset
or torch.utils.data.Dataset
:
import deepvision
train_ds, valid_ds = deepvision.datasets.load_tiny_nerf(save_path='tiny_nerf.npz',
validation_split=0.2,
backend='tensorflow')
print('Train dataset length:', len(train_ds)) # Train dataset length: 84
train_ds # <ZipDataset element_spec=(TensorSpec(shape=(100, 100, 3), dtype=tf.float32, name=None),
# (TensorSpec(shape=(320000, 99), dtype=tf.float32, name=None), TensorSpec(shape=(100, 100, 32), dtype=tf.float32, name=None)))>
print('Valid dataset length:', len(valid_ds)) # Valid dataset length: 22
valid_ds # <ZipDataset element_spec=(TensorSpec(shape=(100, 100, 3), dtype=tf.float32, name=None),
# (TensorSpec(shape=(320000, 99), dtype=tf.float32, name=None), TensorSpec(shape=(100, 100, 32), dtype=tf.float32, name=None)))>
import torch
train_ds, valid_ds = deepvision.datasets.load_tiny_nerf(save_path='tiny_nerf.npz',
validation_split=0.2,
backend='pytorch')
train_loader = torch.utils.data.DataLoader(train_ds, batch_size=16, drop_last=True)
valid_loader = torch.utils.data.DataLoader(valid_ds, batch_size=16, drop_last=True)
print('Train dataset length:', len(train_ds)) # Train dataset length: 84
train_ds # <deepvision.datasets.tiny_nerf.tiny_nerf_pt.TinyNerfDataset at 0x25e97f4dfd0>
print('Valid dataset length:', len(valid_ds)) # Valid dataset length: 22
valid_ds # <deepvision.datasets.tiny_nerf.tiny_nerf_pt.TinyNerfDataset at 0x25e94939080>
If you'd like to take a look at an example of training NeRFs with PyTorch and TensorFlow, take a look at the "Training Neural Radiance Field (NeRF) Models with DeepVision" notebook.
We want DeepVision to host a suite of training frameworks, from classic supervised, to weakly-supervised and unsupervised learning. These frameworks would serve as a high-level API that you can optionally use, while still focusing on non-proprietary classes and architectures you're used to, such as pure tf.keras.Model
s and torch.nn.Module
s.
We want DeepVision to host easy backend-agnostic image operations (resizing, colorspace conversion, etc) and data augmentation layers, losses and metrics.
If DeepVision plays a part of your research, we'd really appreciate a citation!
@misc{landup2023deepvision,
title={DeepVision},
author={David Landup},
year={2023},
howpublished={\url{https://github.com/DavidLandup0/deepvision/}},
}