Mm-llms: Recent Advances In Multimodal Large Language Models | Awesome LLM Papers Add your paper to Awesome LLM Papers

Mm-llms: Recent Advances In Multimodal Large Language Models

Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, Dong Yu . No Venue 2024

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Compositional Generalization Image Text Integration Interdisciplinary Approaches Model Architecture Multimodal Semantic Representation Survey Paper Training Techniques Visual Contextualization

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Specifically, we first outline general design formulations for model architecture and training pipeline. Subsequently, we provide brief introductions of 26 existing MM-LLMs, each characterized by its specific formulations. Additionally, we review the performance of MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Lastly, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.

Similar Work