Processing videos with Artificial Intelligence (AI) poses a challenge due to the enormous amounts of data. Large multimodal models (LMMs), in particular, which process both text and visual information, require immense computing power. A new approach, developed under the name "Quicksviewer," promises a remedy through an innovative video compression method.
Conventional LMMs analyze each video frame with the same intensity, regardless of its information content. Quicksviewer, on the other hand, dynamically divides videos into so-called "Video Cubes." These cubes vary in length and represent sections with different information densities. The basis for this segmentation is the Gumbel-Softmax algorithm, which enables a probabilistic assignment of frames to the cubes.
Each cube is then uniformly resampled, leading to a significant reduction in spatial and temporal redundancy. The developers report an average compression rate of 45x. This approach not only enables more efficient processing but also training with a larger receptive field, which improves the understanding of longer video sequences.
The training of Quicksviewer takes place in three progressive stages, starting from a language model. In each stage, longer videos with an average length of 420 seconds (at 1 frame per second) are used. The efficient processing through dynamic compression enables the use of such extensive video data. Remarkably, the model was trained with only 0.8 million video-text examples.
Compared to a baseline model with fixed partitioning, Quicksviewer achieves a significant improvement in accuracy of up to 8.72 percentage points. Also on the benchmark dataset Video-MME, Quicksviewer achieves state-of-the-art results at moderate sequence lengths, requiring only up to 5% of the tokens per frame used by baseline models. Scaling the number of input frames reveals a clear power law of model capabilities.
Furthermore, empirical tests have shown that the segments generated by the cubing network support the analysis of continuous events in videos. This opens up new possibilities for the application of AI in areas such as video analysis, video surveillance, and content creation.
Quicksviewer represents a promising approach for efficient video understanding. The dynamic compression through Video Cubes allows for a significantly reduced computational load and simultaneously improved performance. Future research could focus on optimizing the cubing algorithm and applying the model to further video tasks.