Hierarchical Cross-modal Agent For Robotics Vision-and-language Navigation | Awesome LLM Papers Add your paper to Awesome LLM Papers

Hierarchical Cross-modal Agent For Robotics Vision-and-language Navigation

Muhammad Zubair Irshad, Chih-Yao Ma, Zsolt Kira . 2021 IEEE International Conference on Robotics and Automation (ICRA) 2021 – 45 citations

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Agentic Compositional Generalization Content Enrichment Evaluation ICRA Productivity Enhancement Training Techniques Variational Autoencoders Visual Question Answering

Deep Learning has revolutionized our ability to solve complex problems such as Vision-and-Language Navigation (VLN). This task requires the agent to navigate to a goal purely based on visual sensory inputs given natural language instructions. However, prior works formulate the problem as a navigation graph with a discrete action space. In this work, we lift the agent off the navigation graph and propose a more complex VLN setting in continuous 3D reconstructed environments. Our proposed setting, Robo-VLN, more closely mimics the challenges of real world navigation. Robo-VLN tasks have longer trajectory lengths, continuous action spaces, and challenges such as obstacles. We provide a suite of baselines inspired by state-of-the-art works in discrete VLN and show that they are less effective at this task. We further propose that decomposing the task into specialized high- and low-level policies can more effectively tackle this task. With extensive experiments, we show that by using layered decision making, modularized training, and decoupling reasoning and imitation, our proposed Hierarchical Cross-Modal (HCM) agent outperforms existing baselines in all key metrics and sets a new benchmark for Robo-VLN.

Similar Work