Referring Transformer: A One-step Approach To Multi-task Visual Grounding | Awesome LLM Papers Add your paper to Awesome LLM Papers

Referring Transformer: A One-step Approach To Multi-task Visual Grounding

Muchen Li, Leonid Sigal . Arxiv 2021 – 67 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Compositional Generalization Datasets Model Architecture Tools Training Techniques Visual Question Answering

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.

Similar Work