October 12, 2024
Diffusion and Flow matching are the state-of-the-art generative models for generating high-quality and diverse data in many domains (e.g., images, audio, video, graph, etc). They are appealing for two main reasons:
Despite their popularity and wide applicability, diffusion and flow matching models remain challenging for newcomers to fully grasp, particularly in terms of their intricate relationships. The literature presents diffusion models from various angles, such as denoising diffusion probabilistic models (DDPM) and denoising score matching. Flow matching, on the other hand, is often viewed as a distinct category of generative models, drawing inspiration from normalizing flows. This blog post aims to elucidate a unified perspective on diffusion and flow matching models, offering a clearer, more concise, and comprehensive understanding of both frameworks. Additionally, we'll explore how to flexibly transition between these approaches, providing a more holistic view of these powerful generative techniques.
Both diffusion and flow matching fall under continuous normalizing flow based generative models [Lipman et al., 2023]. The main idea is to continuously morph one distribution
Both diffusion and flow matching can be characterized by a continuous time-dependent probability distribution
Flow-matching uses the following setting:
Diffusion models generally uses the following setting:
Figure 1: [Left] Evolution of
Flow matching is inspired by normalizing flow based generative models [Papamakarios et al., 2021]. There the idea is to estimate an invertible map
Denoting
Given
Flow matching (FM) aims at estimating the velocity field
A common choice of
The optimal solution of Eq. (6), assuming expressive enough function class for
Eq. (9) suggests that the only data-dependent quantities that the optimal vector field depends upon are the conditional means
Figure 2: Trajectory of the samples following the vector field
Here's a simple Python implementation of the Flow Matching objective using PyTorch which was used to generate Figure 2:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
# Hyperparameters
batch_size = 512
epochs = 200
learning_rate = 1e-3
# Generate two Gaussians dataset in corners (4,4) and (4,-4)
X, _ = make_blobs(n_samples=10000, centers=[(4,4), (4,-4)], cluster_std=0.5, random_state=42)
X = torch.tensor(X, dtype=torch.float32)
# Simple MLP model for vector field
class VectorFieldMLP(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(2 + 1, 64),
nn.ReLU(),
nn.Linear(64, 64),
nn.ReLU(),
nn.Linear(64, 2)
)
def forward(self, x, t):
x = torch.cat([x, t], dim=-1)
return self.net(x)
def compute_alpha_t(t):
return 1 - t, -1
def compute_sigma_t(t):
return t, 1
# Training function
def train(model, dataloader, optimizer, device):
model.train()
total_loss = 0
for x_0 in tqdm(dataloader):
x_0 = x_0[0].to(device)
optimizer.zero_grad()
# Generate noise
x_1 = torch.randn_like(x_0)
# Sample t uniformly
t = torch.rand(x_0.shape[0], device=device)
t = t.unsqueeze(-1)
alpha_t, d_alpha_t = compute_alpha_t(t)
sigma_t, d_sigma_t = compute_sigma_t(t)
# Interpolate between x_0 and x_1
x_t = alpha_t * x_0 + sigma_t * x_1
# Compute vector field
v_theta = model(x_t, t)
# Compute loss
loss = torch.mean(torch.sum((v_theta - (d_alpha_t * x_0 + d_sigma_t * x_1))**2, dim=-1))
loss.backward()
optimizer.step()
total_loss += loss.item()
return total_loss / len(dataloader)
# Initialize model, optimizer, and dataloader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = VectorFieldMLP().to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
dataloader = DataLoader(TensorDataset(X), batch_size=batch_size, shuffle=True)
# Training loop
losses = []
for epoch in range(epochs):
loss = train(model, dataloader, optimizer, device)
losses.append(loss)
if (epoch + 1) % 50 == 0:
print(f"Epoch {epoch+1}/{epochs}, Loss: {loss:.4f}")
def static_sample(model, num_samples, device, source_samples, num_steps=100):
# Convert source samples to tensor and move to device
x = torch.tensor(source_samples, dtype=torch.float32).to(device)
# Time steps
dt = 1 / num_steps
with torch.no_grad():
# Euler integration
for frame in range(num_steps):
t = torch.ones(num_samples, device=device) * (1 - frame * dt)
t = t.unsqueeze(-1)
v = model(x, t)
x = x - v * dt
return x.cpu().numpy()
In diffusion models, the continuous morphing of
Different choices of the noise schedule
VP diffusion process was proposed in the Denoising Diffusion Probabistic Models (DDPM) paper [Ho et al., 2020]. It is characterized by selecting
VE diffusion process was proposed for Score Matching with Langevin Dynamics (SMLD) in [Song et al., 2019]. It is characterized by setting
The probability path
Score matching [Song et al., 2019] aims at estimating the score of the data distribution
The advantage of this formulation is that the solution of Eq. (17) and Eq. (16) are the same, however, one is free to choose the conditioning information
Hence the score matching objective is given by:
The optimal solution of Eq. (19) is therefore given by:
Instead of the score, one can also estimate the denoiser, as proposed in DDPM [Ho et al., 2020]:
The optimal solution of the above objective is given by:
Recall the optimal velocity field:
We saw two different ways to generate samples using continuous normalizing flows:
The aforementioned observation motivates the purpose of this section: to flexibly sample from the SDE or the ODE, using any of the two quantities
Figure 3: Trajectory of the samples following the reverse SDE with
Here's a Python implementation that demonstrates how to convert a vector field into sore, and use the score for Reverse SDE based sampling. It was used to generate the animation in Figure 3 using the vector field in Figure 2.
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np
import torch
import math
def get_score_from_velocity(velocity, x, t):
t = t.unsqueeze(-1)
alpha_t, d_alpha_t = compute_alpha_t(t)
sigma_t, d_sigma_t = compute_sigma_t(t)
alpha_ratio = d_alpha_t / alpha_t
lambda_t = d_sigma_t - alpha_ratio * sigma_t
score = (alpha_ratio * x - velocity) / (sigma_t * lambda_t)
return score
def get_drift_and_diffusion(x, t):
t = t.unsqueeze(-1)
alpha_t, d_alpha_t = compute_alpha_t(t)
sigma_t, d_sigma_t = compute_sigma_t(t)
alpha_ratio = d_alpha_t / alpha_t
drift = alpha_ratio * x
lambda_t = d_sigma_t - alpha_ratio * sigma_t
diffusion = 2 * sigma_t * lambda_t
return drift, diffusion
def euler_step(x, mean_x, t, model, dt, sde=False):
w_cur = torch.randn(x.size()).to(x)
t = torch.ones(x.size(0)).to(x) * t
dw = w_cur * math.sqrt(dt)
drift, diffusion = get_drift_and_diffusion(x, t)
if not sde:
velocity = drift - 0.5 * diffusion * get_score_from_velocity(model(x, t.unsqueeze(-1)), x, t)
mean_x = x - velocity * dt
x = mean_x
else:
reverse_drift = drift - diffusion * get_score_from_velocity(model(x, t.unsqueeze(-1)), x, t)
mean_x = x - reverse_drift * dt
x = mean_x - torch.sqrt(diffusion) * dw
return x, mean_x
@torch.no_grad()
def sample_sde(source_samples, model, device, steps=100, sde=False):
model.eval()
x = torch.tensor(source_samples, dtype=torch.float32).to(device)
mean_x = x.clone()
dt = 1.0 / steps
for i in range(1, steps):
t = torch.ones(x.size(0)).to(x) * (1 - i * dt)
x, mean_x = euler_step(x, mean_x, t, model, dt, sde=sde)
return x
[Eyring et al., 2024] Eyring, L., Klein, D., Uscidda, T., Palla, G., Kilbertus, N., Akata, Z., & Theis, F. (2024). Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation. arXiv preprint arXiv:2311.15100.
[HuggingFace Blogs 2022] https://huggingface.co/blog/annotated-diffusion
[Ho et al., 2020] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851.
[Lipman et al., 2023] Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. arXiv preprint arXiv:2210.02747.
[Chen et al., 2018] Chen, R. T., Rubanova, Y., Bettencourt, J., & Duvenaud, D. K. (2018). Neural ordinary differential equations. Advances in neural information processing systems, 31.
[Albergo et al., 2023] Albergo, M. S., & Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571.
[Liu et al., 2023] Liu, X., Gong, C., & Liu, Q. (2023). Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003.
[Papamakarios et al., 2020] Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2020). Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57), 1-64.
[Ma et al., 2024] Ma, N., Goldstein, M., Albergo, M. S., Boffi, N. M., Vanden-Eijnden, E., & Xie, S. (2024). Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. arXiv preprint arXiv:2401.08740.
[Karras et al., 2022] Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35, 26565-26577.