Video-chatgpt: Towards Detailed Video Understanding Via Large Vision And Language Models · Awesome LLM Papers Contribute to LLM-Bible

Video-chatgpt: Towards Detailed Video Understanding Via Large Vision And Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2023 – 73 citations

[Paper] [Code]    
Has Code Model Architecture Tools Multimodal Models Agentic GPT Evaluation

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the under-explored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with an LLM. The resulting model is capable of understanding and generating detailed conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantitative evaluation framework for video-based dialogue models to objectively analyze the strengths and weaknesses of video-based dialogue models. Code: https://github.com/mbzuai-oryx/Video-ChatGPT.

Similar Work