Multilingual Multimodal Pre-training For Zero-shot Cross-lingual Transfer Of Vision-language Models | Awesome LLM Papers Add your paper to Awesome LLM Papers

Multilingual Multimodal Pre-training For Zero-shot Cross-lingual Transfer Of Vision-language Models

Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, Alexander Hauptmann . Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2021 – 46 citations

[Code] [Paper]   Search on Google Scholar   Search on Semantic Scholar
ACL Compositional Generalization Datasets Has Code Image Text Integration Interdisciplinary Approaches Model Architecture Multimodal Semantic Representation NAACL Training Techniques Visual Contextualization

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (MultiHowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M.

Similar Work