Introduction to VideoFusion
Decomposed Diffusion Models for High-Quality Video Generation
By reading this article, you will learn to perform text-to-video generation using TextToVideoSDPipeline, a new pipeline based on the VideoFusion paper. It is available in the development version of the diffusers
package (0.15.0.dev0
).
VideoFusion is a new research initiative by the Damo Vilab team, which decomposed diffusion models for high-quality video generation. Based on the official repository, the text-to-video generation diffusion model
… consists of three sub-networks: text feature extraction model, text feature-to-video latent space diffusion model, and video latent space to video visual space model. The overall model parameters are about 1.7 billion. Currently, it only supports English input. The diffusion model adopts a UNet3D structure, and implements video generation through the iterative denoising process from the pure Gaussian noise video.
The model is licensed under CC BY-NC-ND 4.0, and is meant for research purposes only.
Let’s proceed to the next section for setup and installation.
Setup
First and foremost, it is recommended to create a new virtual environment. Activate it and run the following command to install all the base dependencies: