Introduction to Kandinsky 2.1
A multilingual text2image latent diffusion model
Kandinsky 2.1 is a new multilingual text2image latent diffusion model that inherits best practices from its predecessor DALL-E 2 and Latent Diffusion. Besides that, it also introduces a few new ideas for text-guided image manipulation and image fusing (interpolation).
Most of the open-sourced multilingual models use their own version of CLIP that supports multiple languages. On the other hand, Kandinsky 2.1 uses CLIP for encoding images and text, and a diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach has proven to be effective in enhancing the visual performance of the generated images, while providing new capabilities in image manipulation.
Kandinsky 2.1 is based on the following architecture:
- Transformer (num_layers=20, num_heads=32 and hidden_size=2048)
- Text encoder (XLM-Roberta-Large-Vit-L-14)
- Diffusion Image Prior
- CLIP image encoder (ViT-L/14)
- Latent Diffusion U-Net
- MoVQ encoder/decoder
Kandinsky 2.1 works extremely well with creative prompts as it understand the input prompt a lot better than Stable Diffusion. This is mainly because it uses the same practice as DALL-E2 when…