Introduction to Kandinsky 2.1

Ng Wai Foong
5 min readMay 26

A multilingual text2image latent diffusion model

Image by the author

Kandinsky 2.1 is a new multilingual text2image latent diffusion model that inherits best practices from its predecessor DALL-E 2 and Latent Diffusion. Besides that, it also introduces a few new ideas for text-guided image manipulation and image fusing (interpolation).

Most of the open-sourced multilingual models use their own version of CLIP that supports multiple languages. On the other hand, Kandinsky 2.1 uses CLIP for encoding images and text, and a diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach has proven to be effective in enhancing the visual performance of the generated images, while providing new capabilities in image manipulation.

Kandinsky 2.1 is based on the following architecture:

  • Transformer (num_layers=20, num_heads=32 and hidden_size=2048)
  • Text encoder (XLM-Roberta-Large-Vit-L-14)
  • Diffusion Image Prior
  • CLIP image encoder (ViT-L/14)
  • Latent Diffusion U-Net
  • MoVQ encoder/decoder

Kandinsky 2.1 works extremely well with creative prompts as it understand the input prompt a lot better than Stable Diffusion. This is mainly because it uses the same practice as DALL-E2 when it comes to text encoding.

First and foremost, it encodes text prompts with CLIP. Then, it diffuses the CLIP text embeddings to CLIP image embeddings. Finally, it uses the image embeddings for image generation.

Based on a local experiment, certain concepts may affect the whole image rather than just individual parts. For example,

  • the prompt “sikly hair” may affects both the hair and clothes as well
  • the prompt “blue hair” may affects the color of the eyes
Image by the author

This tutorial is based on the diffusers package. For the original implementation, kindly check the following notebooks:

Ng Wai Foong

Senior AI Engineer@Yoozoo | Content Writer #NLP #datascience #programming #machinelearning | Linkedin: