Optimus-3: Towards Generalist Multimodal Minecraft Agents With Scalable Task Experts | Awesome LLM Papers Add your paper to Awesome LLM Papers

Optimus-3: Towards Generalist Multimodal Minecraft Agents With Scalable Task Experts

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Liqiang Nie . No Venue 2025

[Other] [Paper]   Search on Google Scholar   Search on Semantic Scholar
Agentic Compositional Generalization Has Code Image Text Integration Interdisciplinary Approaches Model Architecture Multimodal Semantic Representation Reinforcement Learning Training Techniques Variational Autoencoders Visual Contextualization

Recently, agents based on multimodal large language models (MLLMs) have achieved remarkable progress across various domains. However, building a generalist agent with capabilities such as perception, planning, action, grounding, and reflection in open-world environments like Minecraft remains challenges: insufficient domain-specific data, interference among heterogeneous tasks, and visual diversity in open-world settings. In this paper, we address these challenges through three key contributions. 1) We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development. 2) To mitigate interference among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture with task-level routing. 3) We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent’s reasoning ability for visual diversity in Minecraft. Built upon these innovations, we present Optimus-3, a general-purpose agent for Minecraft. Extensive experimental results demonstrate that Optimus-3 surpasses both generalist multimodal large language models and existing state-of-the-art agents across a wide range of tasks in the Minecraft environment. Project page: https://cybertronagent.github.io/Optimus-3.github.io/

Similar Work