Synth^2: Boosting Visual-language Models With Synthetic Captions And Image Embeddings | Awesome LLM Papers Add your paper to Awesome LLM Papers

Synth^2: Boosting Visual-language Models With Synthetic Captions And Image Embeddings

Sahand Sharifzadeh, Christos Kaplanis, Shreya Pathak, Dharshan Kumaran, Anastasija Ilic, Jovana Mitrovic, Charles Blundell, Andrea Banino . No Venue 2024

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Compositional Generalization Datasets Efficiency Image Text Integration Interdisciplinary Approaches Multimodal Semantic Representation Productivity Enhancement Training Techniques Visual Contextualization

The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). We propose a novel approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs pretraining a text-to-image model to synthesize image embeddings starting from captions generated by an LLM. These synthetic pairs are then used to train a VLM. Extensive experiments demonstrate that the VLM trained with synthetic data exhibits comparable performance on image captioning, while requiring a fraction of the data used by models trained solely on human-annotated data. In particular, we outperform the baseline by 17% through augmentation with a synthetic dataset. Furthermore, we show that synthesizing in the image embedding space is 25% faster than in the pixel space. This research introduces a promising technique for generating large-scale, customizable image datasets, leading to enhanced VLM performance and wider applicability across various domains, all with improved data efficiency and resource utilization.

Similar Work