[Paper]
An LBYL (Look Before You Leap') Network is proposed for end-to-end trainable
one-stage visual grounding. The idea behind LBYL-Net is intuitive and
straightforward: we follow a language's description to localize the target
object based on its relative spatial relation to Landmarks’, which is
characterized by some spatial positional words and some descriptive words about
the object. The core of our LBYL-Net is a landmark feature convolution module
that transmits the visual features with the guidance of linguistic description
along with different directions. Consequently, such a module encodes the
relative spatial positional relations between the current object and its
context. Then we combine the contextual information from the landmark feature
convolution module with the target’s visual features for grounding. To make
this landmark feature convolution light-weight, we introduce a dynamic
programming algorithm (termed dynamic max pooling) with low complexity to
extract the landmark feature. Thanks to the landmark feature convolution
module, we mimic the human behavior of `Look Before You Leap’ to design an
LBYL-Net, which takes full consideration of contextual information. Extensive
experiments show our method’s effectiveness in four grounding datasets.
Specifically, our LBYL-Net outperforms all state-of-the-art two-stage and
one-stage methods on ReferitGame. On RefCOCO and RefCOCO+, Our LBYL-Net also
achieves comparable results or even better results than existing one-stage
methods.