Yanghao Li, Hanzi Mao, Ross Girshick†, Kaiming He†
In this repository, we provide configs and models in Detectron2 for ViTDet as well as MViTv2 and Swin backbones with our implementation and settings as described in ViTDet paper.
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
ViTDet, ViT-B | IN1K, MAE | 0.314 | 0.079 | 10.9 | 51.6 | 45.9 | 325346929 | model |
ViTDet, ViT-L | IN1K, MAE | 0.603 | 0.125 | 20.9 | 55.5 | 49.2 | 325599698 | model |
ViTDet, ViT-H | IN1K, MAE | 1.098 | 0.178 | 31.5 | 56.7 | 50.2 | 329145471 | model |
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
Swin-B | IN21K, sup | 0.389 | 0.077 | 8.7 | 53.9 | 46.2 | 342979038 | model |
Swin-L | IN21K, sup | 0.508 | 0.097 | 12.6 | 55.0 | 47.2 | 342979186 | model |
MViTv2-B | IN21K, sup | 0.475 | 0.090 | 8.9 | 55.6 | 48.1 | 325820315 | model |
MViTv2-L | IN21K, sup | 0.844 | 0.157 | 19.7 | 55.7 | 48.3 | 325607715 | model |
MViTv2-H | IN21K, sup | 1.655 | 0.285 | 18.4* | 55.9 | 48.3 | 326187358 | model |
ViTDet, ViT-B | IN1K, MAE | 0.362 | 0.089 | 12.3 | 54.0 | 46.7 | 325358525 | model |
ViTDet, ViT-L | IN1K, MAE | 0.643 | 0.142 | 22.3 | 57.6 | 50.0 | 328021305 | model |
ViTDet, ViT-H | IN1K, MAE | 1.137 | 0.196 | 32.9 | 58.7 | 51.0 | 328730692 | model |
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
ViTDet, ViT-B | IN1K, MAE | 0.317 | 0.085 | 14.4 | 40.2 | 38.2 | 329225748 | model |
ViTDet, ViT-L | IN1K, MAE | 0.576 | 0.137 | 24.7 | 46.1 | 43.6 | 329211570 | model |
ViTDet, ViT-H | IN1K, MAE | 1.059 | 0.186 | 35.3 | 49.1 | 46.0 | 332434656 | model |
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
Swin-B | IN21K, sup | 0.368 | 0.090 | 11.5 | 44.0 | 39.6 | 329222304 | model |
Swin-L | IN21K, sup | 0.486 | 0.105 | 13.8 | 46.0 | 41.4 | 329222724 | model |
MViTv2-B | IN21K, sup | 0.475 | 0.100 | 11.8 | 46.3 | 42.0 | 329477206 | model |
MViTv2-L | IN21K, sup | 0.844 | 0.172 | 21.0 | 49.4 | 44.2 | 329661552 | model |
MViTv2-H | IN21K, sup | 1.661 | 0.290 | 21.3* | 49.5 | 44.1 | 330445165 | model |
ViTDet, ViT-B | IN1K, MAE | 0.356 | 0.099 | 15.2 | 43.0 | 38.9 | 329226874 | model |
ViTDet, ViT-L | IN1K, MAE | 0.629 | 0.150 | 24.9 | 49.2 | 44.5 | 329042206 | model |
ViTDet, ViT-H | IN1K, MAE | 1.100 | 0.204 | 35.5 | 51.5 | 46.6 | 332552778 | model |
Note: Unlike the system-level comparisons in the paper, these models use a lower resolution (1024 instead of 1280) and standard NMS (instead of soft NMS). As a result, they have slightly lower box and mask AP.
We observed higher variance on LVIS evalution results compared to COCO. For example, the standard deviations of box AP and mask AP were 0.30% (compared to 0.10% on COCO) when we trained ViTDet, ViT-B five times with varying random seeds.
The above models were trained and measured on 8-node with 64 NVIDIA A100 GPUs in total. *: Activation checkpointing is used.
All configs can be trained with:
../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py
By default, we use 64 GPUs with batch size as 64 for training.
Model evaluation can be done similarly:
../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint
If you use ViTDet, please use the following BibTeX entry.
@article{li2022exploring,
title={Exploring plain vision transformer backbones for object detection},
author={Li, Yanghao and Mao, Hanzi and Girshick, Ross and He, Kaiming},
journal={arXiv preprint arXiv:2203.16527},
year={2022}
}