November 21, 2024

ViBe Benchmark Measures Hallucinations in Text-to-Video Models

Listen to this article as Podcast
0:00 / 0:00
ViBe Benchmark Measures Hallucinations in Text-to-Video Models

Hallucinations in Text-to-Video Models: The ViBe Benchmark

The rapid advancements in large multimodal models (LMMs) have expanded their capabilities to include video understanding. In particular, text-to-video (T2V) models have made remarkable progress in quality, comprehension, and duration, achieving excellent results in creating videos from simple text inputs. Nevertheless, they frequently produce hallucinated content, clearly revealing that the video is AI-generated.

To address this problem, ViBe was developed: a comprehensive benchmark for hallucinated videos from T2V models. Five main types of hallucinations were identified: Vanishing Object, Numerical Variability, Temporal Dysmorphia, Omission Error, and Physical Inconsistency. Using 10 open-source T2V models, the first large dataset of hallucinated videos was created, comprising 3,782 videos, manually categorized by humans into these five categories.

ViBe offers a unique resource for evaluating the reliability of T2V models and forms the basis for improving hallucination detection and mitigation in video generation. Classification was established as a baseline, and various ensemble classifier configurations were presented, with the combination of TimeSFormer and CNN achieving the best performance, reaching an accuracy of 0.345 and an F1-score of 0.342. This benchmark aims to drive the development of more robust T2V models that generate videos more accurately aligned with text inputs.

The Creation of the ViBe Dataset

The ViBe dataset was created by selecting 700 random captions from the MS-COCO dataset. These diverse and descriptive text inputs are ideal for evaluating the generative performance of T2V models. The selected captions were used as input for ten different open-source T2V models. The model selection represents a range of architectures, model sizes, and training paradigms. The resulting 3,782 videos were manually annotated by humans to identify the different hallucination types.

The Five Categories of Hallucinations

The hallucination types identified within ViBe offer a detailed categorization of the most common errors in T2V-generated videos:

Vanishing Object: The object or a part of it disappears intermittently at arbitrary points in the video.

Numerical Variability: When the text input specifies the number of objects, the generated video increases or decreases the number of object instances.

Temporal Dysmorphia: Objects in the video exhibit continuous temporal deformation, changing their shape, size, or orientation throughout the sequence.

Omission Error: The generated video omits essential components of the original text input—excluding cases with specified object counts—resulting in an incomplete or inaccurate representation.

Physical Inconsistency: The generated video violates fundamental physical laws or juxtaposes incompatible elements, leading to perceptual discrepancies or cognitive dissonance in the viewer.

The Significance of ViBe for the Future of T2V Models

ViBe makes a significant contribution to the research and improvement of T2V models. By providing a large, annotated dataset and a standardized evaluation framework, ViBe enables the systematic investigation of hallucinations. This is essential for the development of more robust and reliable T2V models that generate videos that accurately correspond to the given text inputs. The research findings of ViBe serve as a foundation for future work on hallucination detection and mitigation, contributing to realizing the full potential of T2V models.

Bibliographie: https://arxiv.org/abs/2411.10867 https://arxiv.org/html/2411.10867v1 https://vibe-t2v-bench.github.io/ http://paperreading.club/page?id=266654 https://www.galileo.ai/blog/survey-of-hallucinations-in-multimodal-models https://github.com/Yangyi-Chen/Multimodal-AND-Large-Language-Models https://www.linkedin.com/posts/usabilitysmith_this-ai-paper-by-reka-ai-introduces-vibe-eval-activity-7192031285396094976-1RCk https://paperswithcode.com/?ref=steemhunt&page=7 https://publications.reka.ai/reka-vibe-eval.pdf https://openaccess.thecvf.com/content/CVPR2024/papers/Liu_EvalCrafter_Benchmarking_and_Evaluating_Large_Video_Generation_Models_CVPR_2024_paper.pdf