Set-of-mark Prompting Unleashes Extraordinary Visual Grounding In GPT-4V | Awesome LLM Papers Add your paper to Awesome LLM Papers

Set-of-mark Prompting Unleashes Extraordinary Visual Grounding In GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao . No Venue 2023

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Image Text Integration Model Architecture Prompting Visual Contextualization

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring segmentation model on RefCOCOg in a zero-shot setting.

Similar Work