The image generation capabilities of tools like DALL·E, Imagen, and Stable Diffusion all rely on a key concept called diffusion models. In this post, let’s explore how it works and implement a simple version of the forward diffusion process.

How Diffusion Models Work

At their core, diffusion models operate through two phases:

  1. Forward Diffusion: Gradually corrupt training data with noise
  2. Reverse Diffusion: Learn to reverse this noise-adding process

Forward Diffusion Process

The forward process is a Markov chain that slowly adds Gaussian noise to an image over \(T\) steps. Mathematically, each step is defined as:

$$ q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} , \mathbf{x}_{t-1}, \beta_t \mathbf{I}) $$

The complete process for all timesteps can be expressed as:

$$ q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) $$

Visually, each iteration makes the image progressively more blurry and noisy until only random Gaussian noise remains.

Variance Schedules

The noise addition rate is controlled by a variance schedule \(\beta_t\) that determines how aggressively we add noise at each step. Common schedules include:

ScheduleCharacteristics
Linear\(\beta_t\) increases linearly
CosineSmoother decay (popular in DDPM)
QuadraticFaster initial noise addition

The choice of schedule impacts:

  • Training stability
  • Generation quality
  • Model convergence speed

Closed-Form Sampling

A useful property allows direct sampling at any timestep without iterative steps:

$$ q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} , \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I}) $$

Where: $$ \bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s) $$

This cumulative product describes the total effect of scaling over \(t\) steps.

Closed form is arrived at using reparameterization. Key observation is each \(\mathbf{x}_t\) is obtained by a linear transformation of \(\mathbf{x}_{t-1}\) plus Gaussian noise. By induction, we can show that:

$$ x_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon $$

Code Implementation

Let’s see this in practice with a Python implementation:

import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

def load_image(path, size=(1334, 819)):
    """Load and normalize image"""
    img = Image.open(path).convert('RGB').resize(size)
    return np.asarray(img) / 255.0

def forward_diffusion(img, timesteps, beta_schedule):
    """Apply forward diffusion process"""
    x = img.copy()
    noisy_images = [x]
    
    for t in range(1, timesteps + 1):
        beta_t = beta_schedule[t-1]
        noise = np.random.normal(0, 1, size=x.shape)
        x = np.sqrt(1 - beta_t)*x + np.sqrt(beta_t)*noise
        noisy_images.append(x)
    
    return noisy_images

# Example usage
T = 100  # Total timesteps
beta_schedule = np.linspace(1e-4, 0.06, T)  # Linear schedule
img = load_image("forest.jpg")  # Your image here
noisy_imgs = forward_diffusion(img, T, beta_schedule)

# Visualize diffusion steps
def plot_diffusion(noisy_images, steps_to_show):
    plt.figure(figsize=(15,5))
    for i, t in enumerate(steps_to_show):
        plt.subplot(1, len(steps_to_show), i+1)
        plt.imshow(np.clip(noisy_images[t], 0, 1))
        plt.title(f"Step {t}"), plt.axis('off')
    plt.show()

plot_diffusion(noisy_imgs, [0, 10, 50, 100])

Reverse Diffusion Process

To train a neural network how to reverse the noise, we build a model predicting the noise added at each step.

During training, we do know the noise \epsilon, because we generated it ourselves. During learning, model seeks to predict on its own the noise. That's why the loss function compares actual noise \(\epsilon\) with the predicted noise \(\epsilon_\theta(\mathbf{x}_t,t)\).

$$ p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) $$

Each reverse step is a Gaussian distribution:

$$ p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \Sigma_\theta(\mathbf{x}_t, t)) $$

The network learns to predict the mean (and sometimes the variance) of the denoised image at each step.

The Loss Function

The loss function is:

$$ \mathcal{L}_{\text{simple}} = \mathbb{E}_{\mathbf{x}_0, \epsilon, t} \left[ | \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t) |^2 \right] $$

This loss is known as denoising score matching. It helps the model learn how to reverse the diffusion process.

  1. Pick a random image \(\mathbf{x}_0\) from the dataset.
  2. Sample a timestep \(t \sim \text{Uniform}(1, T)\).
  3. Generate noisy data using: $$ \mathbf{x}_t = \sqrt{\bar{\alpha}_t} , \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} , \epsilon, $$ where \(\epsilon \sim \mathcal{N}(0, \mathbf{I})\).
  4. Feed \(\mathbf{x}_t\) and \(t\) into the model to get the predicted noise: $$ \hat{\epsilon} = \epsilon_\theta(\mathbf{x}_t, t) $$
  5. Compare predictions with the true noise (\epsilon) using squared error: $$ | \epsilon - \hat{\epsilon} |^2 $$
  6. Average this over many samples to estimate the expectation in the loss.

Reverse Model in PyTorch

Here’s an example of a small convolutional neural network that predicts the noise:

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleDenoiseModel(nn.Module):
    def __init__(self, img_channels=3, base_channels=64):
        super().__init__()
        self.conv1 = nn.Conv2d(img_channels + 1, base_channels, 3, padding=1)
        self.conv2 = nn.Conv2d(base_channels, base_channels, 3, padding=1)
        self.conv3 = nn.Conv2d(base_channels, img_channels, 3, padding=1)

    def forward(self, x, t):
        t_embed = t.view(-1, 1, 1, 1).expand(-1, 1, x.shape[2], x.shape[3])
        x_in = torch.cat([x, t_embed], dim=1)
        h = F.relu(self.conv1(x_in))
        h = F.relu(self.conv2(h))
        return self.conv3(h)

model = SimpleDenoiseModel()
x_t = torch.randn(8, 3, 64, 64)
t = torch.randint(0, 100, (8,)).float() / 100.0
pred_noise = model(x_t, t)

Larger models use U-Net structures, attention layers, and learnable timestep embeddings to generate very high-quality images.

Model Architecture

Diffusion models use specialized architectures to predict and remove noise. Here, we briefly cover three key architectures:

U-Net Architecture

The U-Net is a convolutional encoder-decoder with skip connections that preserves spatial detail across resolutions. In diffusion models, it predicts the noise \(\epsilon_\theta(x_t, t)\) at each timestep. A simplified formulation of the skip connection is:

$$ \hat{x}_l = \text{Up}(x_{l+1}) + x_l $$

where \(x_l\) is a feature map at layer \(l\) and \(\text{Up}(\cdot)\) denotes the up-sampling operation. This structure allows U-Net to capture both global context and fine-grained details.

Diffusion Transformer (DiT)

Diffusion Transformers integrate transformer blocks to capture long-range dependencies in image data. A transformer block typically applies self-attention and a multi-layer perceptron (MLP):

$$ z’ = \text{Attention}(z) + z,\quad z’’ = \text{MLP}(z’) + z' $$

The self-attention operation is defined as:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V, $$

where \(Q\), \(K\), and \(V\) are the query, key, and value matrices extracted from the input \(z\), and \(d_k\) is the dimensionality of the key vectors. DiT processes latent patches and incorporates timestep embeddings to guide the denoising process.

Attention Mechanisms

Attention mechanisms enhance the diffusion model by focusing on relevant spatial regions. The general form of the attention mechanism is:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V, $$

where the scaling factor \(\sqrt{d}\) (with \(d\) being the feature dimension) stabilizes gradients. This mechanism allows the network to dynamically balance local and global features during noise prediction.

Comparison of U-Net, Diffusion Transformer, and U-Net with Attention:

FeatureU-NetDiffusion TransformerU-Net + Attention
BackboneCNNTransformerCNN + Self-Attention
StrengthsLocal featuresGlobal contextHybrid of both
EfficiencyHighLowMedium
ScalabilityModerateHighModerate
Use case2D ImagesLarge models, multi-modalGeneral-purpose

References

  • Aditya Ramesh et al. “Hierarchical Text-Conditional Image Generation with CLIP Latents.” arXiv:2204.06125 (2022).
  • Alex Nichol & Prafulla Dhariwal. “Improved denoising diffusion probabilistic models.” arXiv:2102.09672 (2021).
  • Alex Nichol, Prafulla Dhariwal & Aditya Ramesh et al. “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.” ICML 2022.
  • Chitwan Saharia & William Chan et al. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.” arXiv:2205.11487 (2022).
  • Jascha Sohl-Dickstein et al. “Deep Unsupervised Learning using Nonequilibrium Thermodynamics.” ICML 2015.
  • Jiaming Song et al. “Denoising diffusion implicit models.” arXiv:2010.02502 (2020).
  • Jonathan Ho & Tim Salimans. “Classifier-Free Diffusion Guidance.” NeurIPS 2021 Workshop.
  • Jonathan Ho et al. “Denoising diffusion probabilistic models.” arXiv:2006.11239 (2020).
  • Max Welling & Yee Whye Teh. “Bayesian learning via stochastic gradient langevin dynamics.” ICML 2011.
  • Olaf Ronneberger et al. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” MICCAI 2015.
  • Prafulla Dhariwal & Alex Nichol. “Diffusion Models Beat GANs on Image Synthesis.” arXiv:2105.05233 (2021).
  • Weng, Lilian. (Jul 2021). What are diffusion models? Lil’Log
  • William Peebles & Jun-Yan Zhu. “Scalable diffusion models with transformers.” ICCV 2023.
  • Yang Song & Stefano Ermon. “Generative modeling by estimating gradients of the data distribution.” NeurIPS 2019.
  • Yang Song & Stefano Ermon. “Improved techniques for training score-based generative models.” NeurIPS 2020.
  • Yang Song et al. “Score-Based Generative Modeling through Stochastic Differential Equations.” ICLR 2021.
  • Yucheng Zhang et al. “Adding Conditional Control to Text-to-Image Diffusion Models.” arXiv:2302.05543 (2023).