Releases: innat/VideoMAE
v1.1
v1.0
This is a keras implementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training model. The pre trained and fine tuned weights are ported from official pytorch model. Following are the list of all available model in .h5
format. It includes both pre-trained and fine-tuned models.
Naming style for these model is: TFVideoMAE_{size}_{dataset}_{input_frame}x{input_size}_FT/PT
. Here, size represent base
, small
, large
and huge
for the available models. The PT
or pre-trained is the video masked autoencoder model, trained with self-supervised manner and FT
or fine-tuned is the encoder
part of PT
+ task specific classification head. For the downstream task, the benchmark dataset are used, i.e. Kinetics-400, Something-Something-V2, and UCF101.
In keras
implementation, these models are available in SavedModel and h5 format, check release page of v.1.1 for other checkpoints. Please note, the officially, for Kinetics-400, there is another huge
model size variant is available. But the official PT
version is corrumpted, MCG-NJU/VideoMAE#89. And the FT
is size of above 2GB, makes it unable to upload here, but it can be found here.
Model Name | arch | params |
---|---|---|
TFVideoMAE_S_K400_16x224_FT.h5 | encoder | 22 MB |
TFVideoMAE_S_K400_16x224_PT.h5 | encoder + decoder | 24 MB |
TFVideoMAE_B_K400_16x224_FT.h5 | encoder | 87 MB |
TFVideoMAE_B_K400_16x224_PT.h5 | encoder + decoder | 94 MB |
TFVideoMAE_L_K400_16x224_FT.h5 | encoder | 304 MB |
TFVideoMAE_L_K400_16x224_PT.h5 | encoder + decoder | 343 MB |
TFVideoMAE_S_SSv2_16x224_FT.h5 | encoder | 22 MB |
TFVideoMAE_S_SSv2_16x224_PT.h5 | encoder + decoder | 24 MB |
TFVideoMAE_B_SSv2_16x224_FT.h5 | encoder | 86 MB |
TFVideoMAE_B_SSv2_16x224_PT.h5 | encoder + decoder | 94 MB |
TFVideoMAE_B_UCF_16x224_FT.h5 | encoder | 86 MB |
TFVideoMAE_B_UCF_16x224_PT.h5 | encoder + decoder | 94 MB |