Personalized text-to-image generation with custom datasets
Previously, I have covered an article on How to Fine-tune SDXL 0.9 using Dreambooth LoRA. For your information, Dreambooth is a specialized method which requires only a few images to create personalized subject or style. It works really well for single subject or style image generation.
Note that some of the frameworks do support Dreambooth training with image-captions pairs datasets. Kindly refer to the corresponding repositories for more information.
This tutorial covers vanilla text-to-image fine-tuning using LoRA. The training is based on image-caption pairs datasets using SDXL 1.0 as the base model. This method should be preferred for training models with multiple subjects and styles.
This tutorial is based on the
diffuserspackage, which does not support image-caption datasets for Dreambooth training. Training has been tested on version 0.19.3. Note that the output LoRA can only be used via the the
diffuserspackage and not compatible with the original implementation (most open-source webui in the market use the original implementation).
Based on local experiments, the VRAM consumptions are as follows:
GeForce RTX 3060 GPU (12GB)—consumes about 12.3 GB for training. Training takes about 7 hours 20 minutes for 24 images (100 epochs) and 63 hours for 209 images (100 epochs).
GeForce RTX 4090 GPU (24GB)— consumes about 16.8 GB for training. Training takes about 12 hours 30 minutes for 252 images (100 epochs).
Let’s proceed to the next section for the installation process.
It is highly recommended to create a new virtual environment before you continue with the installation.
Activate the virtual environment and run the following command to install Pytorch 2:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
Next, install the latest stable version of the
diffusers package as follows:
pip install diffusers