Reducing The Vision And Language Bias For Temporal Sentence Grounding | Awesome LLM Papers Add your paper to Awesome LLM Papers

Reducing The Vision And Language Bias For Temporal Sentence Grounding

Daizong Liu, Xiaoye Qu, Wei Hu . Proceedings of the 30th ACM International Conference on Multimedia 2022 – 45 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Compositional Generalization Content Enrichment Datasets Efficiency Ethics & Fairness Evaluation Image Text Integration Question Answering RAG Visual Contextualization

Temporal sentence grounding (TSG) is an important yet challenging task in multimedia information retrieval. Although previous TSG methods have achieved decent performance, they tend to capture the selection biases of frequently appeared video-query pairs in the dataset rather than present robust multimodal reasoning abilities, especially for the rarely appeared pairs. In this paper, we study the above issue of selection biases and accordingly propose a Debiasing-TSG (D-TSG) model to filter and remove the negative biases in both vision and language modalities for enhancing the model generalization ability. Specifically, we propose to alleviate the issue from two perspectives: 1) Feature distillation. We built a multi-modal debiasing branch to firstly capture the vision and language biases, and then apply a bias identification module to explicitly recognize the true negative biases and remove them from the benign multi-modal representations. 2) Contrastive sample generation. We construct two types of negative samples to enforce the model to accurately learn the aligned multi-modal semantics and make complete semantic reasoning. We apply the proposed model to both commonly and rarely appeared TSG cases, and demonstrate its effectiveness by achieving the state-of-the-art performance on three benchmark datasets (ActivityNet Caption, TACoS, and Charades-STA).

Similar Work