This file documents a collection of models reported in our paper. All numbers were obtained on Big Basin servers with 8 NVIDIA V100 GPUs & NVLink (except Swin-L models are trained with 16 NVIDIA V100 GPUs).
- The "Name" column contains a link to the config file. Running
train_net.py --num-gpus 8
with this config file will reproduce the model (except Swin-L models are trained with 16 NVIDIA V100 GPUs with distributed training on two nodes). - The model id column is provided for ease of reference. To check downloaded file integrity, any model on this page contains its md5 prefix in its file name.
It's common to initialize from backbone models pre-trained on ImageNet classification tasks. The following backbone models are available:
- R-50.pkl (torchvision): converted copy of torchvision's ResNet-50 model. More details can be found in the conversion script.
- R-103.pkl: a ResNet-101 with its first 7x7 convolution replaced by 3 3x3 convolutions. This modification has been used in most semantic segmentation papers (a.k.a. ResNet101c in our paper). We pre-train this backbone on ImageNet using the default recipe of pytorch examples.
Note: below are available pretrained models in Detectron2 that we do not use in our paper.
- R-50.pkl: converted copy of MSRA's original ResNet-50 model.
- R-101.pkl: converted copy of MSRA's original ResNet-101 model.
- X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB.
Our paper also uses ImageNet pretrained models that are not part of Detectron2, please refer to tools to get those pretrained models.
All models available for download through this document are licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
Name | Backbone | iterations | PQ | AP | mIoU | model id | download |
---|---|---|---|---|---|---|---|
Mask2Former (200 queries) | Swin-L (IN21k) | 160k | 48.1 | 34.2 | 54.5 | 48267279 | model |
Mask2Former (200 queries) + RankSeg | Swin-L (IN21k) | 160k | 48.9 | - | 56.2 | - | model |
Name | Backbone | iterations | mIoU | mIoU (ms+flip) | model id | download |
---|---|---|---|---|---|---|
Mask2Former | Swin-B (IN21k) | 160k | 53.9 | 55.1 | 48333157_5 | model |
Mask2Former + RankSeg | Swin-B (IN21k) | 160k | 54.9 | 55.6 | - | model |
Mask2Former + GT | Swin-B (IN21k) | 160k | 68.0 | - | - | model |
Mask2Former | Swin-L (IN21k) | 160k | 56.1 | 57.3 | 48004474_0 | model |
Mask2Former + RankSeg | Swin-L (IN21k) | 160k | 56.5 | 58.0 | - | model |
Name | Backbone | iterations | AP | model id | download |
---|---|---|---|---|---|
Mask2Former | R101 | 6k | 49.2 | 50897581_1 | model |
Mask2Former + RankSeg | R101 | 6k | 50.5 | - | model |
Mask2Former | Swin-B (IN21k) | 6k | 59.5 | 50897733_2 | model |
Mask2Former + RankSeg | Swin-B (IN21k) | 6k | 60.3 | - | model |
Mask2Former (200 queries) | Swin-L (IN21k) | 6k | 60.4(60.7) | 50908813_0 | model |
Mask2Former (200 queries) + RankSeg | Swin-L (IN21k) | 6k | 61.1(61.4) | - | model |
* Upload result.json
to the online server to evaluate YoutubeVis2019 model. Considering the variance in result, We report the avarage result of 3 models for our methods.
Name | Backbone | iterations | mIoU | model id | download |
---|---|---|---|---|---|
Mask2Former | R101 | 6k | 45.9 | - | model |
Mask2Former + RankSeg | R101 | 6k | 47.0 | - | model |
Mask2Former + GT | R101 | 6k | 62.3 | - | model |
Mask2Former | Swin-B (IN21k) | 6k | 59.4 | - | - |
Mask2Former + RankSeg | Swin-B (IN21k) | 6k | 60.1 | - | - |
* Considering the variance in result, We report the avarage result of 3 models for baseline and our methods.