MAGMA – Multimodal Augmentation Of Generative Models Through Adapter-based Finetuning | Awesome LLM Papers Add your paper to Awesome LLM Papers

MAGMA -- Multimodal Augmentation Of Generative Models Through Adapter-based Finetuning

Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, Anette Frank . Findings of the Association for Computational Linguistics: EMNLP 2022 2022 – 40 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
ACL Compositional Generalization EMNLP Efficiency Evaluation Image Text Integration In Context Learning Interactive Environments Interdisciplinary Approaches Multimodal Semantic Representation Training Techniques Variational Autoencoders Visual Contextualization

Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving state of the art results on the OKVQA benchmark and competitive results on a range of other popular VL benchmarks, while pretraining on 0.2% of the number of samples used to train SimVLM.

Similar Work