Introduction to Kandinsky 2.1

Ng Wai Foong
5 min readMay 26, 2023

A multilingual text2image latent diffusion model

Image by the author

Kandinsky 2.1 is a new multilingual text2image latent diffusion model that inherits best practices from its predecessor DALL-E 2 and Latent Diffusion. Besides that, it also introduces a few new ideas for text-guided image manipulation and image fusing (interpolation).

Most of the open-sourced multilingual models use their own version of CLIP that supports multiple languages. On the other hand, Kandinsky 2.1 uses CLIP for encoding images and text, and a diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach has proven to be effective in enhancing the visual performance of the generated images, while providing new capabilities in image manipulation.

Kandinsky 2.1 is based on the following architecture:

  • Transformer (num_layers=20, num_heads=32 and hidden_size=2048)
  • Text encoder (XLM-Roberta-Large-Vit-L-14)
  • Diffusion Image Prior
  • CLIP image encoder (ViT-L/14)
  • Latent Diffusion U-Net
  • MoVQ encoder/decoder

Kandinsky 2.1 works extremely well with creative prompts as it understand the input prompt a lot better than Stable Diffusion. This is mainly because it uses the same practice as DALL-E2 when…

--

--

Ng Wai Foong

Senior AI Engineer@Yoozoo | Content Writer #NLP #datascience #programming #machinelearning | Linkedin: https://www.linkedin.com/in/wai-foong-ng-694619185/