The AI startup Genmo has released its video model Mochi 1 as open source. With 10 billion parameters, it is, according to the company, the largest publicly available AI model for video generation. The model was developed from scratch and, according to Genmo, sets new standards in two crucial areas: the quality of motion and the accuracy with which text instructions are implemented.
Mochi 1 can generate videos at 30 frames per second and up to 5.4 seconds in length. It simulates physical effects such as liquids, fur, and hair movement with great realism, according to Genmo. The model is optimized for photorealistic content and less suitable for animated content, according to the company. Distortions can occasionally occur with extreme movements.
The current version of Mochi 1 generates videos at a resolution of 480p. An HD version with 720p resolution is expected to follow later this year.
Technically, Mochi 1 is based on a new architecture called Asymmetric Diffusion Transformer (AsymmDiT). This processes text and video content separately, with the visual part using about four times as many parameters as the text processing part. Unlike other modern diffusion models, Mochi 1 uses only a single language model (T5-XXL) to process prompts. This is intended to increase efficiency.
In benchmarks, the model achieves higher accuracy than the competition in implementing text prompts, while simulating complex physical effects more realistically in terms of motion quality.
Although Mochi 1 is described as a world model, recent studies have cast doubt on this capability of video generators.
Coinciding with the release of the model, Genmo is announcing a Series A funding round of $28.4 million, led by NEA. The Genmo team includes core members from major AI projects such as DDPM, DreamFusion, and Emu Video.
The model weights and code are available under the Apache-2.0 license on Hugging Face and GitHub. Interested parties can also try out the model for free via a rudimentary playground on the Genmo website, which also displays numerous examples from the community, including their prompts.
Although the quality is impressive for an open video model, commercial models like Runway Gen-3 currently have the edge. The tool based on it can produce both longer and higher-resolution clips and supports additional features such as image prompts, virtual camera movements, or the transfer of facial expressions to an AI character. Further offerings are available from Kling, Vidu, and MiniMax. Meta also recently introduced a new video model with Movie Gen. However, due to the open-source nature of Mochi 1 and the active community, the model could evolve quickly and pose a serious challenge to commercial offerings in the future. The simple architecture and free availability also make it an attractive tool for research and development in the field of AI video generation.