Jina CLIP: Your CLIP Model Is Also Your Text Retriever | Awesome LLM Papers Contribute to Awesome LLM Papers

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao . No Venue 2024

[Paper] [Paper]   Search on Google Scholar   Search on Semantic Scholar
Retrieval Systems Training Techniques

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

https://huggingface.co/discussions/paper/66593db1d6898d357e15f9c9

Similar Work