AI-Powered Video Compression Method Improves Efficiency and Understanding

Efficient Video Understanding through AI: A New Approach to Video Compression

Processing videos with Artificial Intelligence (AI) poses a challenge due to the enormous amounts of data. Large multimodal models (LMMs), in particular, which process both text and visual information, require immense computing power. A new approach, developed under the name "Quicksviewer," promises a remedy through an innovative video compression method.

Dynamic Compression through "Video Cubes"

Conventional LMMs analyze each video frame with the same intensity, regardless of its information content. Quicksviewer, on the other hand, dynamically divides videos into so-called "Video Cubes." These cubes vary in length and represent sections with different information densities. The basis for this segmentation is the Gumbel-Softmax algorithm, which enables a probabilistic assignment of frames to the cubes.

Each cube is then uniformly resampled, leading to a significant reduction in spatial and temporal redundancy. The developers report an average compression rate of 45x. This approach not only enables more efficient processing but also training with a larger receptive field, which improves the understanding of longer video sequences.

Three-Stage Training for Improved Performance

The training of Quicksviewer takes place in three progressive stages, starting from a language model. In each stage, longer videos with an average length of 420 seconds (at 1 frame per second) are used. The efficient processing through dynamic compression enables the use of such extensive video data. Remarkably, the model was trained with only 0.8 million video-text examples.

Convincing Results Compared to Conventional Methods

Compared to a baseline model with fixed partitioning, Quicksviewer achieves a significant improvement in accuracy of up to 8.72 percentage points. Also on the benchmark dataset Video-MME, Quicksviewer achieves state-of-the-art results at moderate sequence lengths, requiring only up to 5% of the tokens per frame used by baseline models. Scaling the number of input frames reveals a clear power law of model capabilities.

Furthermore, empirical tests have shown that the segments generated by the cubing network support the analysis of continuous events in videos. This opens up new possibilities for the application of AI in areas such as video analysis, video surveillance, and content creation.

Outlook