Introduction to VideoFusion
--
Decomposed Diffusion Models for High-Quality Video Generation
By reading this article, you will learn to perform text-to-video generation using TextToVideoSDPipeline, a new pipeline based on the VideoFusion paper. It is available in the development version of the diffusers
package (0.15.0.dev0
).
VideoFusion is a new research initiative by the Damo Vilab team, which decomposed diffusion models for high-quality video generation. Based on the official repository, the text-to-video generation diffusion model
… consists of three sub-networks: text feature extraction model, text feature-to-video latent space diffusion model, and video latent space to video visual space model. The overall model parameters are about 1.7 billion. Currently, it only supports English input. The diffusion model adopts a UNet3D structure, and implements video generation through the iterative denoising process from the pure Gaussian noise video.
The model is licensed under CC BY-NC-ND 4.0, and is meant for research purposes only.
Let’s proceed to the next section for setup and installation.
Setup
First and foremost, it is recommended to create a new virtual environment. Activate it and run the following command to install all the base dependencies:
pip install transformers accelerate
diffusers
Next, run the following command to install the latest development version of diffusers
:
pip install git+https://github.com/huggingface/diffusers
At the time of this writing, the stable version of diffusers
is 0.14.0
, which does not support text-to-video generation. Make sure to install the latest development version until the release of version 0.15.0
.
opencv-python
Note that opencv-python
is required for frames to video conversion. Although opencv-python
comes with 4 different packages:
opencv-python
— main packageopencv-contrib-python
— full package (comes with contrib/extra modules)opencv-python-headless
— main package without GUIopencv-contrib-python-headless
—…