Towards Language Models That Can See: Computer Vision Through The LENS Of Natural Language | Awesome LLM Papers Add your paper to Awesome LLM Papers

Towards Language Models That Can See: Computer Vision Through The LENS Of Natural Language

William Berrios, Gautam Mittal, Tristan Thrush, Douwe Kiela, Amanpreet Singh . No Venue 2023

[Code] [Paper]   Search on Google Scholar   Search on Semantic Scholar
3d Representation Compositional Generalization Few Shot Has Code Image Text Integration Interactive Environments Interdisciplinary Approaches Multimodal Semantic Representation Training Techniques Visual Contextualization

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

Similar Work