Openvision: A Fully-open, Cost-effective Family Of Advanced Vision Encoders For Multimodal Learning | Awesome LLM Papers Add your paper to Awesome LLM Papers

Openvision: A Fully-open, Cost-effective Family Of Advanced Vision Encoders For Multimodal Learning

Xianhang Li, Yanqing Liu, Haoqin Tu, Hongru Zhu, Cihang Xie . No Venue 2025

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Applications Compositional Generalization Efficiency Image Text Integration Productivity Enhancement Tools Training Techniques Visual Contextualization Visual Question Answering

OpenAI’s CLIP, released in early 2021, have long been the go-to choice of vision encoder for building multimodal foundation models. Although recent alternatives such as SigLIP have begun to challenge this status quo, to our knowledge none are fully open: their training data remains proprietary and/or their training recipes are not released. This paper fills this gap with OpenVision, a fully-open, cost-effective family of vision encoders that match or surpass the performance of OpenAI’s CLIP when integrated into multimodal frameworks like LLaVA. OpenVision builds on existing works – e.g., CLIPS for training framework and Recap-DataComp-1B for training data – while revealing multiple key insights in enhancing encoder quality and showcasing practical benefits in advancing multimodal models. By releasing vision encoders spanning from 5.9M to 632.1M parameters, OpenVision offers practitioners a flexible trade-off between capacity and efficiency in building multimodal models: larger models deliver enhanced multimodal performance, while smaller versions enable lightweight, edge-ready multimodal deployments.

Similar Work