November 21, 2024

JoyVASA: A Two-Stage Diffusion Model for Audio-Driven Portrait and Animal Animation

Listen to this article as Podcast
0:00 / 0:00
JoyVASA: A Two-Stage Diffusion Model for Audio-Driven Portrait and Animal Animation

JoyVASA: An Innovative Approach to Animating Portraits and Animal Pictures

The automated animation of images, particularly portraits, using audio data has made significant progress in recent years. Diffusion-based models have considerably improved the quality of generated videos and lip synchronization. However, the increasing complexity of these models also presents challenges, such as inefficiency in training and inference, as well as limitations regarding video length and continuity between frames.

JoyVASA, a new diffusion-based method, addresses these issues and enables the generation of facial dynamics and head movements in audio-driven facial animations. The approach is based on a two-stage process. In the first stage, a decoupled facial representation framework is introduced, separating dynamic facial expressions from static 3D facial representations. This decoupling allows for the creation of longer videos by combining any static 3D facial representation with dynamic motion sequences. In the second stage, a diffusion transformer trains the generation of motion sequences directly from audio signals, regardless of the character's identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as input to render high-quality animations.

Functionality and Advantages of JoyVASA

JoyVASA utilizes LivePortrait, a decoupled facial representation model, to separate dynamic facial expressions from static 3D facial representations. This allows for flexible combination of static representations with dynamically generated sequences, resulting in more precise and adaptable animations. A diffusion transformer synthesizes motion sequences, including dynamic facial expressions and head movements, based solely on audio cues. This independence from character identity expands the method's versatility and allows for the animation of both human and animal faces.

A renderer, trained within the decoupled representation framework, integrates the static 3D facial representations with the generated motion sequences to create high-quality animated outputs. JoyVASA was trained on a hybrid dataset combining a private Chinese dataset with two publicly available English datasets to ensure better multilingual support.

Compared to end-to-end approaches that generate videos directly from audio, JoyVASA offers more flexible control over facial expressions and head movements through the use of intermediate representations. This leads to more realistic and coherent animations. Older models that map audio to landmarks or 3DMMs are surpassed by JoyVASA's two-stage diffusion approach.

Applications and Future Developments

The technology behind JoyVASA finds application in various fields, including digital avatars, virtual assistants, and entertainment. The realistic animations contribute to increased user engagement. Future work will focus on improving real-time performance and refining expression control to further expand the application possibilities in the field of portrait animation. The ability to animate still images with such accuracy could revolutionize the way we interact with digital content and personalize our online experiences.

Despite the advancements that JoyVASA represents, there are still some limitations. The model primarily focuses on frontal faces and might struggle with profile views or extreme head rotations. Further research could address expanding the model to handle a wider range of head poses and facial expressions. Additionally, the computational cost of diffusion models can be high, limiting real-time applications. Future work could investigate optimizing the model for faster inference.

Bibliography: https://arxiv.org/abs/2411.09209 https://arxiv.org/html/2411.09209v3 https://github.com/jdh-algo/JoyVASA https://www.reddit.com/r/StableDiffusion/comments/1guu5kp/joyvasa_portrait_and_animal_image_animation_with/ https://www.aimodels.fyi/papers/arxiv/joyvasa-portrait-animal-image-animation-diffusion-based https://x.com/CSVisionPapers/status/1857524344974405783 https://bytez.com/docs/arxiv/2411.09209/paper https://gradio.app/