Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Jingkang Yang^* ¹, Yuhao Dong^* ^2,6, Shuai Liu^* ^3,6, Bo Li^* ¹
Ziyue Wang ^†,1, Chencheng Jiang ^†,4, Haoran Tan ^†,3, Jiamu Kang ^†,2
Yuanhan Zhang ¹, Kaiyang Zhou ⁵, Ziwei Liu ^1,✉

¹S-Lab, Nanyang Technological University    ²Tsinghua University
³Beijing University of Posts and Telecommunications
⁴Xi'an Jiaotong University  ⁵Hong Kong Baptist University
⁶Shanghai AI Laboratory

^*Equal Contribution   ^†Equal Engineering Contribution   ^✉Corresponding Author

Abstract

Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. Furthermore, when seamlessly integrated into an embodied agent, it signifies a crucial stride towards the creation of autonomous and context-aware systems capable of formulating plans and executing commands with precision. In this paper, we introduce Octopus, an embodied VLM designed to 1) proficiently decipher an agent's visual and textual task objectives, 2) formulate intricate action sequences, and 3) generate executable code. Our design allows the agent to adeptly handle a wide spectrum of tasks, ranging from mundane daily chores in simulators to sophisticated interactions in complex video games. Octopus is trained by leveraging GPT-4 to control an explorative agent to generate training data, i.e., action blueprints and the corresponding executable code, within our experimental environment called OctoVerse. We also collect the feedback that allows the enhanced training scheme of Reinforcement Learning with Environmental Feedback (RLEF). Through a series of experiments, we illuminate Octopus's functionality and present compelling results, and the proposed RLEF turns out to refine the agent's decision-making. By open-sourcing our model architecture, simulator, and dataset, we aspire to ignite further innovation and foster collaborative applications within the broader embodied AI community. The codebase is released at https://github.com/dongyh20/Octopus.

Method

Collection Example for "Cook a Bacon" Task. GPT-4 perceives the environment through the environmental message and produces anticipated plans and code in accordance with the detailed system message. This code is subsequently executed in the simulator, directing the agent to the subsequent state. For each state, we gather the environmental message, wherein observed objects and relations are substituted by egocentric images to serve as the training input. The response from GPT-4 acts as the training output. Environmental feedback, specifically the determination of whether each target state is met, is documented for RLEF training.

Data Collection and Training Pipeline

The provided image depicts a comprehensive pipeline for data collection and training. In the Data Collection Pipeline, environmental information is captured, parsed into a scene graph, and combined to generate environment message and system message. These messages subsequently drive agent control, culminating in executable code. For the Octopus Training Pipeline, the agent's vision and code are input to the Octopus model for training using both SFT and RLEF techniques. The accompanying text emphasizes the importance of a well-structured system message for GPT-4's effective code generation and notes the challenges faced due to errors, underscoring the adaptability of the model in handling a myriad of tasks. In essence, the pipeline offers a holistic approach to agent training, from environment understanding to action execution.

Results

Main Results on OctoGibson We compare various models: standalone language models, adapted vision-language planners, and our Octopus models, across different evaluation settings. In cells displaying two values, the first represents the task completion rate across the target validation task sets, while the second assesses the conceptual accuracy of the model's planning as judged by human evaluators. GT denotes that the model input is directly parsed from the simulator, with information on objects (O) or relations (R). Octopus shows consistently better results on task completion. Description of Image 2

Ablation Study Ablation Study on model components, model size, and vision input. For bars with different colors, the upper bar denotes the number of successful reasoning tasks, and the lower is routine tasks.

Qualitative Results on OctoGibson The demonstration of task find a carboy in OctoGibson environment. We show that the models shown can write executable code, but the proposed Octopus has stronger planning ability, especially after RLEF. We also explore the performance of GPT-4V on the specific task.

BibTeX

@misc{yang2023octopus, title={Octopus: Embodied Vision-Language Programmer from Environmental Feedback}, author={Jingkang Yang and Yuhao Dong and Shuai Liu and Bo Li and Ziyue Wang and Chencheng Jiang and Haoran Tan and Jiamu Kang and Yuanhan Zhang and Kaiyang Zhou and Ziwei Liu}, year={2023}, eprint={2310.08588}, archivePrefix={arXiv}, primaryClass={cs.CV} }