Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer*
In this repository, we provide detection configs and models for MViTv2 (CVPR 2022) in Detectron2. For image classification tasks, please refer to MViTv2 repo.
Name | pre-train | Method | epochs | box AP |
mask AP |
#params | FLOPS | model id | download |
---|---|---|---|---|---|---|---|---|---|
MViTV2-T | IN1K | Mask R-CNN | 36 | 48.3 | 43.8 | 44M | 279G | 307611773 | model |
MViTV2-T | IN1K | Cascade Mask R-CNN | 36 | 52.2 | 45.0 | 76M | 701G | 308344828 | model |
MViTV2-S | IN1K | Cascade Mask R-CNN | 36 | 53.2 | 46.0 | 87M | 748G | 308344647 | model |
MViTV2-B | IN1K | Cascade Mask R-CNN | 36 | 54.1 | 46.7 | 103M | 814G | 308109448 | model |
MViTV2-B | IN21K | Cascade Mask R-CNN | 36 | 54.9 | 47.4 | 103M | 814G | 309003202 | model |
MViTV2-L | IN21K | Cascade Mask R-CNN | 50 | 55.8 | 48.3 | 270M | 1519G | 308099658 | model |
MViTV2-H | IN21K | Cascade Mask R-CNN | 36 | 56.1 | 48.5 | 718M | 3084G | 309013744 | model |
Note that the above models were trained and measured on 8-node with 64 NVIDIA A100 GPUs in total. The ImageNet pre-trained model weights are obtained from MViTv2 repo.
All configs can be trained with:
../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py
By default, we use 64 GPUs with batch size as 64 for training.
Model evaluation can be done similarly:
../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint
If you use MViTv2, please use the following BibTeX entry.
@inproceedings{li2021improved,
title={MViTv2: Improved multiscale vision transformers for classification and detection},
author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
booktitle={CVPR},
year={2022}
}