GLEE: General Object Foundation Model for Images and Videos at Scale

Junfeng Wu*, Yi Jiang*, Qihao Liu, Zehuan Yuan, Xiang Bai^†,and Song Bai^†

* Equal Contribution, ^†Correspondence

[Project Page] [Paper] [HuggingFace Demo] [Video Demo]

Highlight:

GLEE is accepted by CVPR2024 as Highlight!
GLEE is a general object foundation model jointly trained on over ten million images from various benchmarks with diverse levels of supervision.
GLEE is capable of addressing a wide range of object-centric tasks simultaneously while maintaining SOTA performance.
GLEE demonstrates remarkable versatility and robust zero-shot transferability across a spectrum of object-level image and video tasks, and able to serve as a foundational component for enhancing other architectures or models.

We will release the following contents for GLEE❗

Demo Code
Model Zoo
Comprehensive User Guide
Training Code and Scripts
Detailed Evaluation Code and Scripts
Tutorial for Zero-shot Testing or Fine-tuning GLEE on New Datasets

Getting started

Installation: Please refer to INSTALL.md for more details.
Data preparation: Please refer to DATA.md for more details.
Training: Please refer to TRAIN.md for more details.
Testing: Please refer to TEST.md for more details.
Model zoo: Please refer to MODEL_ZOO.md for more details.

Run the demo APP

Try our online demo app on [HuggingFace Demo] or use it locally:

git clone https://github.com/FoundationVision/GLEE
# support CPU and GPU running
python app.py

Introduction

GLEE has been trained on over ten million images from 16 datasets, fully harnessing both existing annotated data and cost-effective automatically labeled data to construct a diverse training set. This extensive training regime endows GLEE with formidable generalization capabilities.

GLEE consists of an image encoder, a text encoder, a visual prompter, and an object decoder, as illustrated in Figure. The text encoder processes arbitrary descriptions related to the task, including 1) object category list 2）object names in any form 3）captions about objects 4）referring expressions. The visual prompter encodes user inputs such as 1) points 2) bounding boxes 3) scribbles during interactive segmentation into corresponding visual representations of target objects. Then they are integrated into a detector for extracting objects from images according to textual and visual input.

Based on the above designs, GLEE can be used to seamlessly unify a wide range of object perception tasks in images and videos, including object detection, instance segmentation, grounding, multi-target tracking (MOT), video instance segmentation (VIS), video object segmentation (VOS), interactive segmentation and tracking, and supports open-world/large-vocabulary image and video detection and segmentation tasks.

Results

Image-level tasks

Video-level tasks

`

Citing GLEE

@misc{wu2023GLEE,
  author= {Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai},
  title = {General Object Foundation Model for Images and Videos at Scale},
  year={2023},
  eprint={2312.09158},
  archivePrefix={arXiv}
}

Acknowledgments

Thanks UNINEXT for the implementation of multi-dataset training and data processing.
Thanks VNext for providing experience of Video Instance Segmentation (VIS).
Thanks SEEM for providing the implementation of the visual prompter.
Thanks MaskDINO for providing a powerful detector and segmenter.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
configs		configs
conversion		conversion
detectron2		detectron2
dev		dev
docker		docker
docs		docs
projects		projects
tests		tests
tools		tools
weights		weights
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
launch.py		launch.py
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GLEE: General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai^†,and Song Bai^†

Highlight:

Getting started

Run the demo APP

Introduction

Results

Image-level tasks

Video-level tasks

Citing GLEE

Acknowledgments

About

Releases

Packages

Languages

License

FoundationVision/GLEE

Folders and files

Latest commit

History

Repository files navigation

GLEE: General Object Foundation Model for Images and Videos at Scale

Junfeng Wu*, Yi Jiang*, Qihao Liu, Zehuan Yuan, Xiang Bai†,and Song Bai†

Highlight:

Getting started

Run the demo APP

Introduction

Results

Image-level tasks

Video-level tasks

Citing GLEE

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai^†,and Song Bai^†

Packages