The rapid development of text-to-video (T2V) models has achieved impressive progress in recent years. Generating videos from simple text input opens up a variety of application possibilities, from the automated creation of marketing materials to the production of personalized learning videos. Despite this progress, a challenge remains: the precise alignment of generated videos with human perceptions and expectations.
A recent research paper titled "LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment" addresses this very issue. The researchers argue that human preferences are subjective and complex and therefore difficult to capture in objective evaluation metrics. Conventional T2V models trained on large datasets can deliver impressive results, but they don't always meet the user's taste or intention.
The approach proposed by the researchers, LiFT, is based on a three-stage process. First, a dataset with human evaluations, LiFT-HRA, was created. This comprises around 10,000 human annotations of generated videos. Each annotation contains a rating and an accompanying justification. These detailed evaluations form the basis for the second stage of the process.
In the second step, a so-called reward model, LiFT-Critic, is trained. This model learns from the human evaluations to assess the quality and the correspondence of the videos with the text input. It serves as a kind of proxy for human judgment and enables automated evaluation of the generated videos.
In the third and final stage, the T2V model is fine-tuned using the reward model. By maximizing the reward-weighted likelihood, the model is trained to generate videos that better match human preferences. The researchers demonstrated the effectiveness of LiFT using the CogVideoX-2B model. The results show that the fine-tuned model performed better than the significantly larger CogVideoX-5B model in all 16 evaluated metrics.
The results of the study underscore the importance of human feedback for the development and improvement of AI models. The integration of subjective human evaluations makes it possible to align the models more precisely with the needs of the users and to increase the quality of the generated content. This approach is particularly relevant for creative applications like text-to-video generation, where the evaluation of the results strongly depends on individual preferences.
The development of LiFT is an important step towards human-centered AI development. By incorporating human feedback, AI systems can be made not only more powerful, but also more user-friendly and trustworthy. For companies like Mindverse, which specialize in the development of AI-powered content solutions, these research results offer valuable impetus for the further development of their products and services.
Bibliography: Wang, Y. et al. (2024). LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment. arXiv preprint arXiv:2412.04814. Lee, K. et al. (2023). Aligning Text-to-Image Models using Human Feedback. arXiv preprint arXiv:2302.12192. Xie, A. et al. (2024). Leveraging Human Revisions for Improving Text-to-Layout Models. arXiv preprint arXiv:2405.13026. Wu, X. et al. (2024). Boosting Text-to-Video Generative Model with MLLMs Feedback. NeurIPS 2024. Liang, J. et al. (2024). Rich Human Feedback for Text-to-Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1153-1162). Zhao, W. et al. (2023). Learning Video Representations From Large Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15368-15378). KEIxGPPCLWM (2024). [Video]. YouTube. Lee, K., & Liu, H. (n.d.). Aligning Text-to-Image Models using Human Feedback. Semantic Scholar.