Introduction to Stable Diffusion

Mayur Jain
3 min readFeb 11, 2024

--

A new world of image generation

Stable Diffusion

A few years ago, generating images like the ones above would have been impossible or even tedious to a great extent. But with deep learning models like Stable Diffusion, Imagen, CLIP, and many more, it takes a simple prompt to generate an image like the above ones.

We’ll walk through the basic architecture of stable diffusion and understand how each module plays a role in developing these images.

Stable diffusion is a text-to-image model where we pass in a text prompt for the image to be generated. It consists of three key modules, as follows:

  • Text Encoder
  • Diffusion Model
  • Decoder

Flow Diagram

What is a text encoder?

A text encoder is a language model that takes in a text prompt and turns it into an embedding. For stable diffusion, a frozen CLIP ViT-L/14 text encoder is used to condition the model on text prompts.

A text encoder helps constrain the output image by conditioning the diffusion model to the prompt given by the user.

Text Encoder

What is a diffusion model?

Diffusion is the process of converting a structured signal (an image) into noise in a step-by-step manner. By simulating diffusion, we can generate noisy images from our training images and train a neural network to try to denoise (noisy image to valid image) them.

Now, with a trained model, we can simulate the reverse diffusion by generating a valid image from a noisy input image patch.

Diffusion Model

In stable diffusion, the diffusion model iterates the patch (64x64px) of a noisy image and denoises it into a valid image. A neural network (U-Net) with a scheduling algorithm is used for generating the latent space of the image given as a text prompt.

During the training process, the output of the text encoder is concatenated with diffusion input to condition the model for the text input.

What is a decoder?

To generate a high-quality image, we cannot output the image patch from the diffusion model directly. We pass the output of the diffusion model into a decoder network, which generates the high-quality image as done by neural networks like Image Super-Resolution using an Efficient Sub-Pixel CNN. In stable diffusion, the decoder network usually represents VAE (variational autoencoder), known for generating images.

Stable Diffusion with Keras_CV

import time
import tensorflow as tf
import keras_cv
from tensorflow import keras
import matplotlib.pyplot as plt

model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)
images = model.text_to_image("photograph of an astronaut riding a horse", batch_size=1)

def plot_images(images):
plt.figure(figsize=(8, 8))
for i in range(len(images)):
ax = plt.subplot(1, len(images), i + 1)
plt.imshow(images[i])
plt.axis("off")

plot_images(images)
Astronaut riding a horse

Reference

https://jalammar.github.io/illustrated-stable-diffusion/

--

--