Unified Perspective on Diffusion and Flow Matching

October 12, 2024

Density Transport Animation

Diffusion and Flow matching are the state-of-the-art generative models for generating high-quality and diverse data in many domains (e.g., images, audio, video, graph, etc). They are appealing for two main reasons:

  • They have simple least squares minimization training objectives, in contrast to the min-max optimization of GANs, therefore they are easier to train and scale to large datasets and models.
  • They can generate high-quality and diverse data samples.

Despite their popularity and wide applicability, diffusion and flow matching models remain challenging for newcomers to fully grasp, particularly in terms of their intricate relationships. The literature presents diffusion models from various angles, such as denoising diffusion probabilistic models (DDPM) and denoising score matching. Flow matching, on the other hand, is often viewed as a distinct category of generative models, drawing inspiration from normalizing flows. This blog post aims to elucidate a unified perspective on diffusion and flow matching models, offering a clearer, more concise, and comprehensive understanding of both frameworks. Additionally, we'll explore how to flexibly transition between these approaches, providing a more holistic view of these powerful generative techniques.

1. The Density Transport Perspective

Both diffusion and flow matching fall under continuous normalizing flow based generative models [Lipman et al., 2023]. The main idea is to continuously morph one distribution p0 to another pT. Diffusion models can be regarded as a special case of flow matching where one of the distributions, i.e., pT, is a simple easy-to-sample distribution, such as a Gaussian, and the other distribution p0 is the data distribution. Whereas in flow matching, both p0 and pT are allowed to be data distributions (Note: We follow notation convention in diffusion models where the ultimate goal is to transport samples from pT to p0; generally, flow matching literature uses the reverse direction).

1.1. Probability Path and Stochastic Process

Both diffusion and flow matching can be characterized by a continuous time-dependent probability distribution pt,t[0,T] that changes smoothly between p0 and pT. In general, pt is defined as a sequenece of probability distributions associated with the random process x¯t defined as follows: (1)x¯t:=I(x0,xR,t), such that x0p0,xRpR. where I(x0,xR,t) is an interpolating function that is differentiable with respect to t and pR is a reference distribution which is closely related to pT. x¯t is constructed in such a way that x¯t=0=x0p0,x¯t=TpT. In practice, both flow matching and diffusion models generally select I(x0,xT,t) to follow the following form: (2)x¯t=I(x0,xR,t)=α(t)x0+σ(t)xR, where α(t) is a smooth decreasing function of time t and σ(t) is a smooth increasing function of time t.

Flow-matching uses the following setting:

  • xR=xTpT
  • α(0)=1 and α(T)=0
  • σ(0)=0 and σ(T)=1
  • T=1
Hence, x¯t=0=x0p0, x¯t=T=xTpT.

Diffusion models generally uses the following setting:

  • xRN(0,I)
  • pT is a Gaussian distribution with variance σ(T)2I, i.e., pT=N(0,σ(T)2I)
  • T is large enough such that x¯t=T approaches pT=N(0,σ(T)2I).

2. Flow Matching

2.1. Flow and the velocity field

Flow Matching Visualization

Figure 1: [Left] Evolution of pt; [Middle] time dependent velocity field ut; [Right] time dependent flow Ψt.

Flow matching is inspired by normalizing flow based generative models [Papamakarios et al., 2021]. There the idea is to estimate an invertible map Ψ:RdRd, called a flow, that maps samples from p0 to that from pT, i.e., x0p0Ψ(x0)pT, and vice versa, i.e., xTpTΨ1(xT)p0. However, directly estimating such a map is challenging [Flow matching blog]. Continuous normalizing flow [Chen et al., 2018] was proposed to circumvent the need for estimating the flow Ψ by instead defining a differentiable time-dependent flow Ψt for t[0,T] such that the final flow at t=T is the desired map Ψ=ΨT. This time dependent flow is related to the probability path pt,t[0,T] as Ψt(x0)pt, t[0,T]. Such a time-dependent flow Ψt can be characterized by a time-dependent vector field ut:RdRd, also called the velocity field, defined as ut(Ψt(x0))=Ψ˙t(x0), where Ψ˙t(x0)=tΨt(x0).

Denoting Ψt(x0)=xt, we can rewrite the above as the following Ordinary Differential Equation (ODE): (3)dx=ut(x)dt.

Given ut, one can simply transport samples between p0 and pT by integrating the velocity ut using an off-the-shelf ODE solver. For example, to transport a sample x0p0 to xTpT, one can use simple Euler's method to obtain the following update rule: xt+Δt=xt+ut(xt)Δt.

2.2. The flow matching objective

Flow matching (FM) aims at estimating the velocity field ut that continuously morphs p0 to pT tracing the probability path pt,t[0,T]. The ideal flow matching objective is to estimate the true velocity field ut by solving the following least squares objective. (4)minθ  LFM(θ)=Extpt[vθ(xt,t)ut(xt)2]. However, this ideal objective is intractable because the true velocity field ut is unknown. Instead, FM considers the following conditional flow matching objective, which has the same minimizer as that of Eq. (4) [Lipman et al., 2023, Albergo et al., 2023, Liu et al., 2023] [my FM blog]: (5)minθ  LCFM(θ)=Ezq(z),xtpt(xt|z)[vθ(xt,t)ut(xt|z)2], where z is any conditioning information, and ut(xt|z) is the conditional velocity field that traces the conditional probability path pt(xt|z) for a given z.

A common choice of z is the endpoints [x0,x1], with q(z)=ρ(x0,xT) being any joint distribution of x0,xT with the correct marginals p0 and pT. With this choice of z, one can set pt(xt|x0,xT)=δI(x0,xT,t)(xt), where δy(x) is the Dirac measure centered at y, and I(x0,xT,t) is the interpolant function defined in Eq. (1). This will imply that ut(xt|z)=ut(xt|x0,xT)=tI(x0,xT,t) is the associated unique conditional vector field. Hence, the practical objective used throughout the FM literature is given by: (6)minθ  LCFM(θ)=Et,(x0,xT)ρ(x0,xT)[vθ(I(x0,xT,t),t)tI(x0,xT,t)2].

The optimal solution of Eq. (6), assuming expressive enough function class for vθ, is given by: (7)vθ(x,t)=E(x0,xT)ρ(x0,xT)[I˙(x0,xT,t) | x=I(x0,xT,t)]. Substituting the realization of I(x0,xT,t) from Eq.(2) into Eq. (7), and representing I(x0,xT,t) as x¯t, we get: (8)vθ(x,t)=E(x0,xT)ρ(x0,xT)[α˙(t)x0+σ˙(t)xT | x=x¯t]. Since α(t) and σ(t) are constants with respect to x0 and xT, we can further simplify Eq. (8) as: (9)vθ(x,t)=α˙(t)E[x0 | x=x¯t]+σ˙(t)E[xT | x=x¯t].

Eq. (9) suggests that the only data-dependent quantities that the optimal vector field depends upon are the conditional means μ0=E[x0 | x=x¯t] and μT=E[xT | x=x¯t]. The conditional means will reappear in the next section on diffusion models, and will allow us to express the vector field in terms of the score (or denoiser) and vice versa.

Flow Matching Visualization

Figure 2: Trajectory of the samples following the vector field vθ(x,t), the solution of Eq. (6) for α(t)=1t,σ(t)=t.

2.3. Flow Matching Implementation

Here's a simple Python implementation of the Flow Matching objective using PyTorch which was used to generate Figure 2:


import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

# Hyperparameters
batch_size = 512
epochs = 200
learning_rate = 1e-3

# Generate two Gaussians dataset in corners (4,4) and (4,-4)
X, _ = make_blobs(n_samples=10000, centers=[(4,4), (4,-4)], cluster_std=0.5, random_state=42)
X = torch.tensor(X, dtype=torch.float32)

# Simple MLP model for vector field
class VectorFieldMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2 + 1, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 2)
        )

    def forward(self, x, t):
        x = torch.cat([x, t], dim=-1)
        return self.net(x)

def compute_alpha_t(t):
    return 1 - t, -1

def compute_sigma_t(t):
    return t, 1

# Training function
def train(model, dataloader, optimizer, device):
    model.train()
    total_loss = 0
    for x_0 in tqdm(dataloader):
        x_0 = x_0[0].to(device)
        optimizer.zero_grad()

        # Generate noise
        x_1 = torch.randn_like(x_0)
        
        # Sample t uniformly
        t = torch.rand(x_0.shape[0], device=device)
        t = t.unsqueeze(-1)
        alpha_t, d_alpha_t = compute_alpha_t(t)
        sigma_t, d_sigma_t = compute_sigma_t(t)
        # Interpolate between x_0 and x_1
        x_t = alpha_t * x_0 + sigma_t * x_1
        
        # Compute vector field
        v_theta = model(x_t, t)
        
        # Compute loss
        loss = torch.mean(torch.sum((v_theta - (d_alpha_t * x_0 + d_sigma_t * x_1))**2, dim=-1))

        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    return total_loss / len(dataloader)


# Initialize model, optimizer, and dataloader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = VectorFieldMLP().to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
dataloader = DataLoader(TensorDataset(X), batch_size=batch_size, shuffle=True)

# Training loop
losses = []
for epoch in range(epochs):
    loss = train(model, dataloader, optimizer, device)
    losses.append(loss)
    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss:.4f}")


def static_sample(model, num_samples, device, source_samples, num_steps=100):
    # Convert source samples to tensor and move to device
    x = torch.tensor(source_samples, dtype=torch.float32).to(device)

    # Time steps
    dt = 1 / num_steps

    with torch.no_grad():
        # Euler integration
        for frame in range(num_steps):
            t = torch.ones(num_samples, device=device) * (1 - frame * dt)
            t = t.unsqueeze(-1)
            v = model(x, t)
            x = x - v * dt

    return x.cpu().numpy()
        
Here p0 is a mixture of two Gaussians, whereas pT is the standard Gaussian distribution. Since vθ transports samples from p0 to pT, we use vθ to transport samples from pT to p0.

3. Diffusion Models

3.1. Instances of the diffusion process

In diffusion models, the continuous morphing of p0 to pT is called the forward diffusion process, whereas the reverse direction from pT to p0 is called the reverse diffusion process. The forward diffusion is a random process in which a Gaussian noise is gradually added to the data x0p0 as follows: (10)x¯t=I(x0,ϵ,t)=α(t)x0+σ(t)ϵ, ϵN(0,I), where σ(t) is a monotonically increasing function of t and α(t) is a non-increasing function of t. Note that we replaced xT with ϵ. The choices of α(t),σ(t),T are made in such a way that at t=T, the variance of the second term σ(t)ϵ in Eq. (10) dominates, and thus x¯T is distributed approximately as N(0,σ(t)2I).

Different choices of the noise schedule αt,σt will give different instances of the diffusion process. Two notable instances are the variance preserving (VP) diffusion [Ho et al., 2020] and the variance exploding (VE) diffusion [Song et al., 2019].

VP diffusion process was proposed in the Denoising Diffusion Probabistic Models (DDPM) paper [Ho et al., 2020]. It is characterized by selecting α(t) and σ(t) in such a way that the variance of x¯t is preserved over time, i.e., Var(x¯t)=Var(x0). To that end, σ(t) is set to σ(t)=1α(t)2. Hence, (11)x¯t=α(t)x0+1α(t)2ϵ, Generally, α(t) is chosen such that α(t)0 as tT, which implies that x¯T is approximately distributed as N(0,I).

VE diffusion process was proposed for Score Matching with Langevin Dynamics (SMLD) in [Song et al., 2019]. It is characterized by setting α(t)=1 as follows: (12)x¯t=x0+σ(t)ϵ, σ(t) is chosen such that x¯T is approximately distributed as N(x0,σ(T)2I).

3.2. Forward and Reverse SDE

The probability path pt,t[0,T] traced out by x¯t can be alternatively explained by the solution of the following forward SDE [Song et al., 2021] (13)dx=f(x,t)dt+g(t)dw, where f(x,t) is called the drift coefficient and g(t) is called the diffusion coefficient. These coefficients are functions of α(t) and σ(t), and different instances of these coefficients will give different instances of the diffusion process. The forward SDE is not very interesting as it only describes the process of corrupting data sample with noise, which is easy to do. What's more interesting is that there exists an SDE whose solution takes the path pt,t[0,T] in exactly the reverse direction. This SDE is given by: (14)(Reverse SDE) dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dw. Any off-the-self SDE solver can be used to sample data using the above SDE. For instance, using the Euler integration amounts to the following update step to sample from p0 begining with a sample from pT: xtΔ=xt[f(x,t)g(t)2xlogpt(x)]dtg(t)ϵ, ϵN(0,I). The above SDE says that if one has access to the score function xlogpt(x), then one can gradually transform a sample xTpT to a sample from p0 by following the path prescribed by the reverse SDE. Therefore, the literature on diffusion models focuses on estimating the score function xlogpt(x), which we describe in the next subsections.

3.3. The score matching objective

Score matching [Song et al., 2019] aims at estimating the score of the data distribution sθ(x)=xlogp0(x). However, directly estimating this score is challenging [Song et al., 2019]. Instead, [Song et al., 2019] proposed to estimate the score of the noisy data distribution, i.e., the probability density of x¯t denoted by pt. Then the goal is to estimate (15)sθ(x,t)=xlogpt(x). The ideal noisy score matching objective is the following: (16)minθ Ex0p0,ϵN(0,I),t sθ(x¯t,t)x¯tlogpt(x¯t)2, Similar to the case of flow matching, this ideal objective is intractable because the true score x¯tlogpt(x¯t) is unknown. Therefore, similar to conditional flow matching, the score matching objective in Eq. (16) is estimated using the following conditional score matching objective: (17)minθ LCSM(θ)=Ezq(z),xtpt(|z)[sθ(xt,t)xtlogpt(xt|z)2], where z corresponds to any conditioning information as in the case of conditional flow matching.

The advantage of this formulation is that the solution of Eq. (17) and Eq. (16) are the same, however, one is free to choose the conditioning information z so as to make xtlogpt(xt|z) easy to compute. To that end, q(z) is chosen to be the data distribution p0, and pt(xt|z)=pt0(xt|x0)=N(xt;αtx0,σt2I), i.e., the distribution of the random variable I(x0,ϵ,t) where only x0 is constant (compare this with the Dirac measure δI(x0,xT,t)(x¯t) in flow matching where both x0 and xT=ϵ were constants).

Hence the score matching objective is given by: (18)minθ LCSM(θ)=Et,x0p0,ϵN(0,I)[sθ(x¯t,t)x¯tlogpt0(x¯t|x0)2].

Here, x¯tlogpt0(x¯t|x0) can be easily evaluated because x¯tlogpt0(x¯t|x0)=x¯tlogN(x¯t;αtx0,σ(t)2I)=x¯t+αtx0σ(t)2(Defn. of x¯t)=αtx0σ(t)ϵ+αtx0σ(t)2=ϵσ(t) Hence, the score matching objective in Eq. (18) can be re-written as: (19)minθ LCSM(θ)=Et,x0p0,ϵN(0,I)[sθ(x¯t,t)+ϵσ(t)2].

The optimal solution of Eq. (19) is therefore given by: sθ(x,t)=Ex0p0,ϵN(0,I)[ϵσ(t) | x=x¯t](20)=1σ(t)Ex0p0,ϵN(0,I)[ϵ | x=x¯t] Observe that Eq. (20) is the same as the conditional mean μT=E[xT | x=x¯t] in Eq. (9), when pT is the standard Gaussian distribution N(0,I). In section 4 we will use this fact to connect the optimal velocity field vθ(x,t) in Eq. (9) to the optimal score sθ(x,t).

3.4. The denoising diffusion objective

Instead of the score, one can also estimate the denoiser, as proposed in DDPM [Ho et al., 2020]: Ldiffusion(θ)=Ex0p0,ϵN(0,I),t ϵϵθ(x¯t,t)2.

The optimal solution of the above objective is given by: ϵθ(x,t)=Ex0p0,ϵN(0,I)[ϵ | x=x¯t]. Therefore, estimating the score is equivalent to estimating the denoiser up to a constant scaling factor, i.e., ϵθ(x,t)=σ(t)sθ(x,t).

4. Connection between FM and Diffusion

Recall the optimal velocity field: (21)vθ(x,t)=α˙(t)E[x0 | x=x¯t]+σ˙(t)E[xT | x=x¯t]. Taking pT to be the Standard Gaussian, vθ(x,t)=α˙(t)E[x0 | x=x¯t]+σ˙(t)E[ϵ | x=x¯t](22)=(a)α˙(t)E[x0 | x=x¯t]σ˙(t)σ(t)sθ(x,t) where (a) follows from Eq. (20). Now, we need to find the expression for E[x0 | x=x¯t]. To that end, note that we have x=E[x¯t|x¯t=x]=E[α(t)x0+σ(t)ϵ|x¯t=x]=α(t)E[x0|x¯t=x]+σ(t)E[ϵ|x¯t=x] Solving for E[x0|x¯t=x], we get: E[x0|x¯t=x]=xσ(t)2sθ(x,t)α(t) Plugging this into Eq. (22), we get: (23)sθ(x,t)=α(t)vθ(x,t)α˙(t)xα˙(t)σ(t)2α(t)σ(t)σ˙(t)

4.1. Generating samples: Probability Flow ODE and the Reverse SDE

We saw two different ways to generate samples using continuous normalizing flows:

  • Estimate the vector field vθ(x,t), and solve the ODE Eq. (8) to generate samples.
  • Estimate the score sθ(x,t), and solve the reverse SDE Eq. (15) to generate samples.
But why do we need to know about two methods if they are equivalent? Well, although in theory solving both ODE and SDE will give us the same result, in practice, the SDE and ODE are only solved approximately. As such, there is generally a trade-off between sample quality and sampling time. Generally, it has been observed that sampling using the SDE often provides higher quality samples (closer to the data distribution p0) than sampling with the ODE [Karras et al., 2022; Ma et al., 2024]. Whereas, for a given quality, sampling with the ODE requires fewer iterations than with SDE [Karras et al., 2022].

The aforementioned observation motivates the purpose of this section: to flexibly sample from the SDE or the ODE, using any of the two quantities vθ(x,t) or sθ(x,t). To that end, suppose we have obtained the vector field vθ(x,t) by solving Eq. (6) for the toy data in Figure 2. One can obtain the corresponding score sθ(x,t) by using Eq. (23). However, in order to use the reverse SDE (24)dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dw, we need to know the drift coefficient f(x,t) and the diffusion coefficient g(t). This can be easily obtained by using the following simple fact: Probability Flow ODE [Song et al., 2021]: There exists an ODE, called the Probability Flow ODE, that induces the same probability path pt,t[0,T] as induced by the SDE in Eq. (24): (25)dx=[f(x,t)12g(t)2xlogpt(x)]dt. This implies that the vector field corresponding to the score is given by: (26)ut(x)=f(x,t)12g(t)2xlogpt(x) Let's rewrite Eq. (23) as follows: sθ(x,t)=α˙(t)α(t)xvθ(x,t)σ(t)λ(t), where λ(t)=σ˙(t)α˙(t)α(t)σ(t) This implies that (27)vθ(x,t)=α˙(t)α(t)xsθ(x,t)σ(t)λ(t) Comparing Eq. (26) and Eq. (27), we get: f(x,t)=α˙(t)α(t)xg(t)=σ(t)λ(t). This allows one to use the reverse SDE to sample, while estimating the vector field during training. This has often been observed to be an effective approach in practice [Ma et al., 2024].

SDE sampling Visualization

Figure 3: Trajectory of the samples following the reverse SDE with sθ(x,t) computed using Eq. (23) for α(t)=1t,σ(t)=t.

4.2. Implementation: Sampling with Score based SDE

Here's a Python implementation that demonstrates how to convert a vector field into sore, and use the score for Reverse SDE based sampling. It was used to generate the animation in Figure 3 using the vector field in Figure 2.



import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np
import torch
import math

def get_score_from_velocity(velocity, x, t):
    t = t.unsqueeze(-1)
    alpha_t, d_alpha_t = compute_alpha_t(t)
    sigma_t, d_sigma_t = compute_sigma_t(t)
    alpha_ratio = d_alpha_t / alpha_t
    lambda_t = d_sigma_t - alpha_ratio * sigma_t
    score = (alpha_ratio * x - velocity) / (sigma_t * lambda_t)
    return score

def get_drift_and_diffusion(x, t):
    t = t.unsqueeze(-1)
    alpha_t, d_alpha_t = compute_alpha_t(t)
    sigma_t, d_sigma_t = compute_sigma_t(t)
    alpha_ratio = d_alpha_t / alpha_t
    drift = alpha_ratio * x
    lambda_t = d_sigma_t - alpha_ratio * sigma_t
    diffusion = 2 * sigma_t * lambda_t
    return drift, diffusion

def euler_step(x, mean_x, t, model, dt, sde=False):
    w_cur = torch.randn(x.size()).to(x)
    t = torch.ones(x.size(0)).to(x) * t
    dw = w_cur * math.sqrt(dt)
    drift, diffusion = get_drift_and_diffusion(x, t)
    
    if not sde:
        velocity = drift - 0.5 * diffusion * get_score_from_velocity(model(x, t.unsqueeze(-1)), x, t)
        mean_x = x - velocity * dt
        x = mean_x 
    else:
        reverse_drift = drift - diffusion * get_score_from_velocity(model(x, t.unsqueeze(-1)), x, t)
        mean_x = x - reverse_drift * dt
        x = mean_x - torch.sqrt(diffusion) * dw
    return x, mean_x

@torch.no_grad()
def sample_sde(source_samples, model, device, steps=100, sde=False):
    model.eval()
    x = torch.tensor(source_samples, dtype=torch.float32).to(device)
    mean_x = x.clone()
    dt = 1.0 / steps
    
    for i in range(1, steps):
        t = torch.ones(x.size(0)).to(x) * (1 - i * dt)
        x, mean_x = euler_step(x, mean_x, t, model, dt, sde=sde)

    return x
        



5. References

[Eyring et al., 2024] Eyring, L., Klein, D., Uscidda, T., Palla, G., Kilbertus, N., Akata, Z., & Theis, F. (2024). Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation. arXiv preprint arXiv:2311.15100.

[HuggingFace Blogs 2022] https://huggingface.co/blog/annotated-diffusion

[Ho et al., 2020] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851.

[Lipman et al., 2023] Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. arXiv preprint arXiv:2210.02747.

[Chen et al., 2018] Chen, R. T., Rubanova, Y., Bettencourt, J., & Duvenaud, D. K. (2018). Neural ordinary differential equations. Advances in neural information processing systems, 31.

[Albergo et al., 2023] Albergo, M. S., & Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571.

[Liu et al., 2023] Liu, X., Gong, C., & Liu, Q. (2023). Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003.

[Papamakarios et al., 2020] Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2020). Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57), 1-64.

[Ma et al., 2024] Ma, N., Goldstein, M., Albergo, M. S., Boffi, N. M., Vanden-Eijnden, E., & Xie, S. (2024). Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. arXiv preprint arXiv:2401.08740.

[Karras et al., 2022] Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35, 26565-26577.