Vision Language Models Are Blind | Awesome LLM Papers Add your paper to Awesome LLM Papers

Vision Language Models Are Blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen . No Venue 2024

[Code] [Paper]   Search on Google Scholar   Search on Semantic Scholar
Applications Compositional Generalization Has Code Interdisciplinary Approaches Model Architecture Multimodal Semantic Representation Visual Contextualization

Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. Yet, we find that VLMs fail on 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting the number of circles in a Olympic-like logo. The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like of a person with myopia seeing fine details as blurry, and at worst, like an intelligent person that is blind making educated guesses. Code is available at: https://vlmsareblind.github.io/

Similar Work