Unpaired Cross-lingual Image Caption Generation With Self-supervised Rewards | Awesome LLM Papers Add your paper to Awesome LLM Papers

Unpaired Cross-lingual Image Caption Generation With Self-supervised Rewards

Yuqing Song, Shizhe Chen, Yida Zhao, Qin Jin . Proceedings of the 27th ACM International Conference on Multimedia 2019 – 45 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Compositional Generalization Datasets Image Text Integration Interdisciplinary Approaches Neural Machine Translation Reinforcement Learning Tools Training Techniques Visual Contextualization

Generating image descriptions in different languages is essential to satisfy users worldwide. However, it is prohibitively expensive to collect large-scale paired image-caption dataset for every target language which is critical for training descent image captioning models. Previous works tackle the unpaired cross-lingual image captioning problem through a pivot language, which is with the help of paired image-caption data in the pivot language and pivot-to-target machine translation models. However, such language-pivoted approach suffers from inaccuracy brought by the pivot-to-target translation, including disfluency and visual irrelevancy errors. In this paper, we propose to generate cross-lingual image captions with self-supervised rewards in the reinforcement learning framework to alleviate these two types of errors. We employ self-supervision from mono-lingual corpus in the target language to provide fluency reward, and propose a multi-level visual semantic matching model to provide both sentence-level and concept-level visual relevancy rewards. We conduct extensive experiments for unpaired cross-lingual image captioning in both English and Chinese respectively on two widely used image caption corpora. The proposed approach achieves significant performance improvement over state-of-the-art methods.

Similar Work