Introduction to ControlNet for Stable Diffusion

Ng Wai Foong
7 min readMar 8

Better control for text-to-image generation

Image by the author. Generated examples taken from HuggingFace.

This tutorial covers a step-by-step guide on text-to-image generation with ControlNet conditioning using the HuggingFace’s diffusers package.

ControlNet is a neural network structure to control diffusion models by adding extra conditions. It provides a way to augment Stable Diffusion with conditional inputs such as scribbles, edge maps, segmentation maps, pose key points, etc during text-to-image generation. As a result, the generated image will be a lot closer to the input image, which is a big improvement over traditional methods such as image-to-image generation.

In addition, a ControlNet model can be trained with small datasets on consumer GPU. Then, the model can be augmented with any pre-trained Stable Diffusion models for text-to-image generation.

The initial release of ControNet came with the following checkpoints:

Let’s proceed to the next section for the setup and installation.


It is highly recommended to create a new virtual environment before the package installation.


Activate the virtual environment and run the following command to install the stable version of diffusers module:

pip install diffusers

ControlNet requires diffusers>=0.14.0

Ng Wai Foong

Senior AI Engineer@Yoozoo | Content Writer #NLP #datascience #programming #machinelearning | Linkedin: