Introduction to Stable Diffusion
A new world of image generation
A few years ago, generating images like the ones above would have been impossible or even tedious to a great extent. But with deep learning models like Stable Diffusion, Imagen, CLIP, and many more, it takes a simple prompt to generate an image like the above ones.
We’ll walk through the basic architecture of stable diffusion and understand how each module plays a role in developing these images.
Stable diffusion is a text-to-image model where we pass in a text prompt for the image to be generated. It consists of three key modules, as follows:
- Text Encoder
- Diffusion Model
- Decoder
Flow Diagram
What is a text encoder?
A text encoder is a language model that takes in a text prompt and turns it into an embedding. For stable diffusion, a frozen CLIP ViT-L/14 text encoder is used to condition the model on text prompts.
A text encoder helps constrain the output image by conditioning the diffusion model to the prompt given by the user.
What is a diffusion model?
Diffusion is the process of converting a structured signal (an image) into noise in a step-by-step manner. By simulating diffusion, we can generate noisy images from our training images and train a neural network to try to denoise (noisy image to valid image) them.
Now, with a trained model, we can simulate the reverse diffusion by generating a valid image from a noisy input image patch.
In stable diffusion, the diffusion model iterates the patch (64x64px) of a noisy image and denoises it into a valid image. A neural network (U-Net) with a scheduling algorithm is used for generating the latent space of the image given as a text prompt.
During the training process, the output of the text encoder is concatenated with diffusion input to condition the model for the text input.
What is a decoder?
To generate a high-quality image, we cannot output the image patch from the diffusion model directly. We pass the output of the diffusion model into a decoder network, which generates the high-quality image as done by neural networks like Image Super-Resolution using an Efficient Sub-Pixel CNN. In stable diffusion, the decoder network usually represents VAE (variational autoencoder), known for generating images.
Stable Diffusion with Keras_CV
import time
import tensorflow as tf
import keras_cv
from tensorflow import keras
import matplotlib.pyplot as plt
model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)
images = model.text_to_image("photograph of an astronaut riding a horse", batch_size=1)
def plot_images(images):
plt.figure(figsize=(8, 8))
for i in range(len(images)):
ax = plt.subplot(1, len(images), i + 1)
plt.imshow(images[i])
plt.axis("off")
plot_images(images)