Large multimodal models (LMMs) with advanced video analysis capabilities are gaining increasing importance. Their ability to understand and interpret videos opens up new possibilities in a wide variety of areas. However, evaluating these models is often complex and time-consuming. Existing benchmarks like VideoMME and LongVideoBench mostly rely on multiple-choice questions, which don't always reflect the complexity of real-world application scenarios.
To address this challenge, VideoAutoArena has been developed – an automated benchmark inspired by the LMSYS Chatbot Arena. VideoAutoArena uses user simulation to generate open-ended, adaptive questions that comprehensively assess the performance of LMMs in video analysis. The system is based on a scalable evaluation framework with a modified ELO rating system, enabling fair and continuous comparisons between different LMMs.
In contrast to conventional benchmarks based on human evaluation, VideoAutoArena relies on automation. LMM agents simulate user behavior and make preference decisions, eliminating the need for human annotators. This reduces costs and significantly accelerates the evaluation process. The scalability of the system allows for the evaluation of a large number of models and questions, which is essential for comprehensive performance analysis.
Another advantage of VideoAutoArena is the integration of an error-driven evolutionary strategy for the questions. The complexity of the questions is gradually increased based on model performance, pushing the models to their limits and revealing potential for improvement. This ensures that the models can also handle demanding video analysis scenarios.
Complementing VideoAutoArena, VideoAutoBench has been developed, a benchmark for faster and simpler evaluation of LMMs. VideoAutoBench utilizes a curated selection of "battles" from VideoAutoArena, where human annotators have chosen the winning answers. GPT-4o serves as an automatic evaluator and compares the model responses with the human-selected and rejected answers. This approach offers an efficient and cost-effective evaluation method.
Experiments with eleven well-known proprietary and open-source LMMs show that open-source models still lag behind the leading closed-source model GPT-4o in video analysis. This performance gap is significantly larger than with conventional multiple-choice benchmarks and increases with increasing video length and question complexity. The gap between the models widens particularly in terms of user relevance and the usefulness of the answers. These results highlight the user-centered approach of VideoAutoArena and provide valuable insights for the further development of LMMs.
VideoAutoArena and VideoAutoBench offer a cost-effective and scalable framework for evaluating LMMs in user-centered video analysis. The automated evaluation and adaptive question development enable efficient and comprehensive performance analysis. The insights gained contribute to identifying the strengths and weaknesses of current models and drive the development of future, more powerful LMMs for video analysis.