Shrestha Blogs

Diffusion and Flow matching are the state-of-the-art generative models for generating high-quality and diverse data in many domains (e.g., images, audio, video, graph, etc). From an optimization perspective, what makes them so appealing, compared to the previous state-of-the-art GAN based generative models, is that they have simple least squares minimization training objectives compared to the min-max optimization of GANs.

However, the presentation of the diffusion and flow matching objectives are sometimes oversimplified in blog articles and even published papers, mostly for the sake of brevity. This can be detrimental to the understanding of the models and can lead to wrong conclusion. For example, the least squares objective could convey the impression that the optimal objective value is zero [Eyring et al., 2024, Sec. 3.1]). For another example, the denoising diffusion model objective is often described as trying to estimate the noise in the noisy sample [HuggingFace Blogs 2022]. This may suggest that if the optimal solution to the objective were obtained (i.e., no optimization error), then removing the noise would result in a perfect reconstruction of the original sample. Hence no iterative/Langevin sampling is needed. However, this is not the case. The correct interpretation has a subtle but important difference: diffusion models estimates the the "expected" value of the noise given the noisy sample. Therefore, the estimated denoising direction is only correct for infitesimally small steps.

An important observation that stems from this understanding is that the objective value is not saturated (zero) at the optimal solution, in general. Rather, it is a positive quantity that depends upon the data distributions and the choice of some hyperparameters such as noise scale schedule. This implies that the absolute value of the objective in itself is not an informative quantity for hyperparamter tuning and convergence analysis.

In this blog post, we will carefully analyze the diffusion and flow matching objectives and provide a clearer perspective into the optimal solutions.

Notations

Both diffusion and flow matching fall under continuous normalizing flow based generative models [Lipman et al., 2023]. The main idea is to continuously morph a base distribution

p_{1}

to a target distribution

p_{0}

. Diffusion models can be regarded as a special case of flow matching where the base distribution is a simple easy-to-sample distribution, such as a Gaussian, and the target distribution is the data distribution. Whereas in flow matching, both

p_{0}

and

p_{1}

are allowed to be data distributions (Note: We follow diffusion model notations where the goal is to transport samples from

p_{1}

p_{0}

; generally, flow matching literature uses the reverse direction).

Diffusion Models

Diffusion models are characterized by a forward diffusion process where samples

x_{0} \sim p_{0}

is degraded using the following diffusion process [Ho et al., 2020]:

\begin{matrix} (1) & x_{t} = I (x_{0}, ϵ, t) = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, ϵ \sim N (0, I), \end{matrix}

where

{\bar{α}}_{t}, {\bar{α}}_{t} \in (0, 1)

represents the noise scale at time

t

(Note: here we consider variance preserving (VP) diffusion models),

I (x_{0}, ϵ, t)

is an the interpolation between data sample

x_{0}

and noise

ϵ

at time

t

. Then the denoising diffusion objective is given by

\begin{matrix} (2) & L_{diffusion} (θ) = E_{x_{0} \sim p_{0}, ϵ \sim N (0, I), t} [λ (t) {‖ ϵ - ϵ_{θ} (I (x_{0}, ϵ, t), t) ‖}^{2}], \end{matrix}

where the coefficient

λ (t)

is a function of

{\bar{α}}_{t}

. For simplicity, we will assume

λ (t) = 1

for the rest of the blog post.

If we think of

ϵ_{θ} (x_{t}, t)

as estimating the noise in the noisy sample

x_{t}

, one could arrive at the conclusion that

min_{θ} L_{diffusion} (θ) = 0

at the optimal solution. This is because, if the optimal solution is obtained, the estimated noise should be exactly the noise

ϵ

that was added to the data sample

x_{0}

to produce

x_{t}

. Therefore, removing the noise should result in a perfect reconstruction of the original data sample. However, this is not the case. The reason is that there are many combinations of

x_{0}

and

ϵ

that can produce the same

x_{t}

. Therefore, intuitively, the optimal solution will be the average of all such

ϵ

. Precisely, the optimal solution is given by:

\begin{matrix} (3) & ϵ_{θ}^{*} (x, t) = E_{x_{0} \sim p_{0}, ϵ \sim N (0, I)} [ϵ | x = I (x_{0}, ϵ, t)] . \end{matrix}

Therefore, the optimal objective value is greater than zero in general. This also corroborates the fact that one cannot obtain clean sample

x_{0}

from

x_{t}

by simply removing the noise

ϵ_{θ}^{*} (x_{t}, t)

. Instead,

ϵ_{θ}^{*} (x_{t}, t)

is simply an instantaneous direction to slightly nudge the noisy sample

x_{t}

towards the true data distribution

p_{0}

Flow Matching

Flow matching (FM) aims at estimating a time-dependent vector field

u_{t}

that continuously morphs

p_{1}

p_{0}

tracing the probability path

p_{t}, t \in [0, 1]

. The ideal flow matching objective is to regress over such a vector field.

\begin{matrix} (4) & min_{θ} L_{FM} (θ) = E_{x_{t} \sim p_{t}} [{‖ v_{θ} (x_{t}, t) - u_{t} (x_{t}) ‖}^{2}] . \end{matrix}

However, this ideal objective is intractable due to the unknown velocity field

u_{t}

. Instead, FM considers the following conditional flow matching objective, which has the same minimizer as Eq. (4) [Lipman et al., 2023, Albergo et al., 2023, Liu et al., 2023]:

\begin{matrix} (5) & min_{θ} L_{CFM} (θ) = E_{z \sim q (z), x_{t} \sim p_{t} (x_{t} | z)} [{‖ v_{θ} (x_{t}, t) - u_{t} (x_{t} | z) ‖}^{2}], \end{matrix}

where

u_{t} (x_{t} | z)

is the conditional velocity field that traces the conditional probability path

p_{t} (x_{t} | z)

for a given

z

. For example,

q (z)

can be the joint distribution of

x_{0} \sim p_{0}, x_{1} \sim p_{1}

, and

p_{t} (x_{t} | x_{0}, x_{1}) = δ_{I (x_{1}, x_{0}, t)} (x_{t})

, where

δ_{y} (x)

is the dirac measure centered at

y

, and

I (x_{1}, x_{0}, t)

be the so-called stochastic interpolant function [Albergo et al., 2023] (c.f. Eq. (1)) such that

I (x_{1}, x_{0}, 0) = x_{1}

and

I (x_{1}, x_{0}, 1) = x_{0}

. This will imply that

u_{t} (x_{t} | z) = u_{t} (x_{t} | x_{0}, x_{1}) = \frac{\partial}{\partial t} I (x_{1}, x_{0}, t)

is the associated unique conditional vector field. Hence, the practical objective used throughout the FM literature is given by:

\begin{matrix} (6) & min_{θ} L_{CFM} (θ) = E_{t, x_{0} \sim p_{0}, x_{1} \sim p_{1}} [{‖ v_{θ} (I (x_{1}, x_{0}, t), t) - \frac{\partial}{\partial t} I (x_{1}, x_{0}, t) ‖}^{2}] . \end{matrix}

The optimal solution of Eq. (6) is given by:

\begin{matrix} (7) & v_{θ}^{*} (x, t) = E_{x_{0} \sim p_{0}, x_{1} \sim p_{1}} [\frac{\partial}{\partial t} I (x_{1}, x_{0}, t) | x = I (x_{1}, x_{0}, t)] . \end{matrix}

Similar to the case of diffusion models, the optimal vector field at point

x

and time

t

is the average of the conditional vector field (derivative of the interpolant) over all choices of end points

x_{0}

and

x_{1}

such that the interpolant at time

t

has the value

x

. The optimal objective value is obtained by substituting the optimal vector field into Eq. (6). In general, this value is greater than zero.

References

[Eyring et al., 2024] Eyring, L., Klein, D., Uscidda, T., Palla, G., Kilbertus, N., Akata, Z., & Theis, F. (2024). Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation. arXiv preprint arXiv:2311.15100.

[HuggingFace Blogs 2022] https://huggingface.co/blog/annotated-diffusion

[Ho et al., 2020] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33, 6840-6851.

[Lipman et al., 2023] Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. arXiv preprint arXiv:2210.02747.

[Chen et al., 2018] Chen, R. T., Rubanova, Y., Bettencourt, J., & Duvenaud, D. K. (2018). Neural ordinary differential equations. Advances in neural information processing systems, 31.

[Albergo et al., 2023] Albergo, M. S., & Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571.

[Liu et al., 2023] Liu, X., Gong, C., & Liu, Q. (2023). Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003.