Av-reasoner: Improving And Benchmarking Clue-grounded Audio-visual Counting For Mllms | Awesome LLM Papers Add your paper to Awesome LLM Papers

Av-reasoner: Improving And Benchmarking Clue-grounded Audio-visual Counting For Mllms

Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, Tong Lu . No Venue 2025

[Code] [Paper]   Search on Google Scholar   Search on Semantic Scholar
Compositional Generalization Datasets Evaluation Has Code Image Text Integration Reinforcement Learning Visual Contextualization Visual Question Answering

Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model’s counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been realeased on https://av-reasoner.github.io.

Similar Work