Shrestha Blogs

Recently, a series of concurrent works, Flow matching (Lipman et al., 2023), Rectified Flow (Liu et al., 2023), and Stochastic Interpolants (Albergo et al., 2023), proposed a new class of generative model based on Continuous Normalizing Flow (CNF), which we will refer to as Flow Matching (FM) based models. In principle, FM is a more flexible alternative to the current state-of-the-art Diffusion Models (DMs), and can be viewed as a generalization of DMs in two important ways:

Trasport between arbitrary distributions:
DM requires the base probability distribution to be Gaussian. FM allows base distribtion to be any arbitrary distribution (e.g., image distribution). This can enable many applications such as image-to-image translations.
Arbitrary probability paths:
Path followed by DM is just one of the infinitely many possible probability paths that could be attained using FM models. This is important because it allows for flexible, application-specific designs of probability paths such as optimal transport path.

Note: The blog assumes some familiarity with the problem of generative modeling, probability theory, basic differential equations, and diffusion models.

Normalizing Flow

FM can be seen as a subclass of the general flow based generative models. A flow model aims to transport a base distribution $ρ_{0}$ to a target distribution $ρ_{1}$ , both defined over $R^{d}$ , using a transport map $Ψ : R^{d} \to R^{d}$ , i.e., $if x_{0} \sim ρ_{0}, then Ψ (x_{0}) \sim ρ_{1} .$ A popular framework, called the Normalizing Flow, learns such a transport map $Ψ$ using maximum likelihood objective to learn a data distribution $ρ_{1}$ by fixing the base distribution $ρ_{0}$ to be a simple distribution, e.g., Gaussian, that is amenable to easy sampling and density evaluation. The objective of normalizing flow follows from the change of variable formula for probability density function as shown below: $max_{Ψ} E_{x \sim ρ_{1}} \log p_{Ψ} (x) := \log ρ_{0} (Ψ^{- 1} (x)) + \log | det J_{Ψ^{- 1}} (x) |$ As one can see, the above objective involves computing the determinant of the Jacobian of the inverse of the transport map $Ψ$ . For general functions $Ψ$ , this is a computationally prohibitive operation, especially for high-dimensional data. To avoid this expensive computation, $Ψ$ is parameterized with a sequence of simple invertible transformations such that the Jacobian determinant is easy to compute. This restriction limited the expressiveness of normalizing flow models, consequently limiting their performance compared to other generative models such as GANs.

Continuous Normalizing Flow (CNF)

CNF uses a continuous time perspective of the aforementioned transport process. Consider a continuous time dependent map $Ψ_{t}$ for $t \in [0, 1]$ such that $[Ψ_{1}]_{# ρ_{0}} = ρ_{1}$ and $[Ψ_{0}]_{# ρ_{0}} = ρ_{0}$ , a time-dependent velocity field, $v_{t} : R^{d} \to R^{d}$ , and corresponding time-dependent probability path $p_{t} : R^{d} \to R_{> 0}$ . The vector field $v_{t}$ is related to the transport map $Ψ_{t}$ via the following ordinary differential equation (ODE): $\frac{d}{d t} Ψ_{t} (x) = v_{t} (Ψ_{t} (x)) .$ And the time-dependent probability path $p_{t}$ is related to end distributions via: $p_{t} = [Ψ_{t}]_{# ρ_{0}}$

Figure 1: Illustration of the CNF idea.

Mass Continuity Equation

A velocity field $v_{t}$ results in the probability path $p_{t}$ if and only if it satisfies the mass continuity equation: $\frac{\partial p_{t}}{\partial t} + \nabla \cdot (p_{t} v_{t}) = 0.$ This equation follows from Gauss's divergence theorem by enforcing probability mass conservation. Basically, one can use this equation to verify if a vector field $v_{t}$ generates a given probability path $p_{t}$ , or to even find such a vector field (which is what we will do).

Flow Matching (FM)

Given a probability density path $p_{t}$ with $p_{0} = ρ_{0}, p_{1} = ρ_{1}$ , and corresponding vector field $u_{t}$ , the flow matching objective is to minimize the following loss function: $L_{F M} (v_{t}) = E_{t \sim U ([0, 1]), x \sim p_{t}} ‖ v_{t} (x) - u_{t} (x) ‖^{2} .$ $L_{F M}$ is simple but intractable in practice as it requires access to two quantities that we have no prior knowledge on: (i) samples from $p_{t}, \forall t$ and (ii) the vector field $u_{t}$ . In the following sections, we will discuss how to solve the two problems, (i) and (ii), to make the flow matching objective practical.

Stochastic Interpolant

In (Albergo et al., 2023), the authors define a time differentiable interpolant function $I_{t} : R^{d} \times R^{d} \to R^{d}, t \in [0, 1]$ such that $I_{t = 0} (x_{0}, x_{1}) = x_{0}, I_{t = 1} (x_{0}, x_{1}) = x_{1} .$ A typical example of $I_{t}$ is the linear interpolant: $I_{t} (x_{0}, x_{1}) = (1 - t) x_{0} + t x_{1} .$ Next, a joint distribution (coupling) $ρ (x_{0}, x_{1})$ is chosen such that $\int ρ (x_{0}, x_{1}) d x_{0} = ρ_{1} (x_{1}), and \int ρ (x_{0}, x_{1}) d x_{1} = ρ_{0} (x_{0}) .$ The independent coupling $ρ (x_{0}, x_{1}) = ρ_{0} (x_{0}) ρ_{1} (x_{1})$ is a special example that satisfies the above conditions. The final tractable flow matching objective takes the following form: $L = E_{t, (x_{0}, x_{1}) \sim ρ (x_{0}, x_{1}), x = I_{t} (x_{0}, x_{1})} {‖ v_{t} (x) - \frac{\partial}{\partial t} I_{t} (x_{0}, x_{1}) ‖}^{2}$ For the linear interpolant $I_{t} (x_{0}, x_{1}) = (1 - t) x_{0} + t x_{1}$ and independent coupling the above objective becomes: $L = E_{t, x_{0} \sim ρ_{0}, x_{1} \sim ρ_{1}, x = (1 - t) x_{0} + t x_{1}} {‖ v_{t} (x) - (x_{1} - x_{0}) ‖}^{2}$ The above objective is very simple to implement: one just needs to randomly sample a $x_{0}$ and $x_{1}$ from the two datasets, a time stamp and regress the vector field at $I_{t} (x_{0}, x_{1})$ with a vector pointing from $x_{0}$ to $x_{1}$ . At first glance, it almost seems too simple and naive: how can matching the vector field with random directions recover the desired vector field $u_{t}$ ? The short answer is that, in expectation, these random directions will recover the desired vector field $u_{t}$ . To understand why this is the case, the following section will guide you through the detailed proof.

Proof:

In previous section, we discussed that the flow matching objective

L_{F M}

is intractable due to two reasons: (i) sampling from

p_{t}

and (ii) estimating

u_{t}

. In this section, we will discuss how addressing these two issues results in the tractable and simple flow matching objective

L

(i) Sampling from $p_{t}$

Then,

x_{t} = I_{t} (x_{0}, x_{1})

defines a stochastic process

x_{t}

(hence called stochastic interpolant), given samples from

(x_{0}, x_{1}) \sim ρ (x_{0}, x_{1})

. Probability path

p_{t}

induced by

x_{t}

is a valid time dependent probability path for constructing a CNF because when

(x_{0}, x_{1}) \sim ρ (x_{0}, x_{1})

I_{0} (x_{0}, x_{1}) = x_{0} ⟹ p_{0} = ρ_{0}, I_{1} (x_{0}, x_{1}) = x_{1} ⟹ p_{1} = ρ_{1} .

Therefore, the procedure for sampling from

p_{t}

is straightforward:

Sample $(x_{0}, x_{1}) \sim ρ (x_{0}, x_{1})$
Compute $x_{t} = I_{t} (x_{0}, x_{1})$ . $x_{t}$ is our required sample.

(ii) Estimating $u_{t}$

Obtaining

u_{t}

is a bit long, but not difficult to follow. First, note that we can express

p_{t}

using Dirac delta function as follows:

p_{t} (x) = \int δ_{I_{t} (x_{0}, x_{1})} (x) ρ (x_{0}, x_{1}) d x_{0} d x_{1},

where

δ_{I_{t} (x_{0}, x_{1})}

is the Dirac delta function centered at

I_{t} (x_{0}, x_{1})

. We know that our desired velocity field

u_{t}

corresponding to

p_{t}

should satisfy the mass continuity equation:

\begin{aligned} \frac{\partial p_{t}}{\partial t} = - \nabla \cdot (p_{t} u_{t}) \\ ⟹ \frac{\partial}{\partial t} \int δ_{I_{t} (x_{0}, x_{1})} (x) ρ (x_{0}, x_{1}) d x_{0} d x_{1} = - \nabla \cdot (p_{t} (x) u_{t} (x)) \\ (1) & ⟹ \int (\frac{\partial}{\partial t} δ_{I_{t} (x_{0}, x_{1})} (x)) ρ (x_{0}, x_{1}) d x_{0} d x_{1} = - \nabla \cdot (p_{t} (x) u_{t} (x)) \end{aligned}

One important fact that will help us obtain

u_{t}

is that

δ_{I_{t} (x_{0}, x_{1})}

itself defines a time dependent probability path between

δ_{x_{0}}

and

δ_{x_{1}}

. Further,

I_{t} (x_{0}, x_{1})

is the corresponding flow that achieves this continuous transport, i.e.,

[I_{t} (x_{0}, x_{1})]_{# δ_{x_{0}}} = δ_{I_{t} (x_{0}, x_{1})}

let us define the conditional vector field

u_{t} (x | x_{0}, x_{1})

u_{t} (x | x_{0}, x_{1}) = {\begin{cases} \frac{\partial}{\partial t} I_{t} (x_{0}, x_{1}) & if x = I_{t} (x_{0}, x_{1}) \\ 0 & otherwise \end{cases}

Hence, by the definition of the velocity field of the flow, we have that

u_{t} (x | x_{0}, x_{1})

is the velocity field that induces the probability path

δ_{I_{t} (x_{0}, x_{1})}

. Hence, the pair

(I_{t} (x_{0}, x_{1}), u_{t} (x | x_{0}, x_{1}))

satisfies the continuity equation. Therefore,

\frac{\partial}{\partial t} δ_{I_{t} (x_{0}, x_{1})} (x) = - \nabla \cdot (δ_{I_{t} (x_{0}, x_{1})} (x) u_{t} (x | x_{0}, x_{1}))

Using the above in Eq. (1), we get

\begin{aligned} - \int \nabla \cdot δ_{I_{t} (x_{0}, x_{1})} (x) u_{t} (x | x_{0}, x_{1}) ρ (x_{0}, x_{1}) d x_{0} d x_{1} = - \nabla \cdot (p_{t} (x) u_{t} (x)) \\ ⟹ \nabla \cdot \int δ_{I_{t} (x_{0}, x_{1})} (x) u_{t} (x | x_{0}, x_{1}) ρ (x_{0}, x_{1}) d x_{0} d x_{1} = \nabla \cdot (p_{t} (x) u_{t} (x)) \\ ⟹ \nabla \cdot p_{t} (x) (\int u_{t} (x | x_{0}, x_{1}) \frac{δ_{I_{t} (x_{0}, x_{1})} (x) ρ (x_{0}, x_{1}) d x_{0} d x_{1}}{p_{t} (x)}) = \nabla \cdot (p_{t} (x) u_{t} (x)) \end{aligned}

The above equation implies that

u_{t} (x) = \int u_{t} (x | x_{0}, x_{1}) \frac{δ_{I_{t} (x_{0}, x_{1})} (x) ρ (x_{0}, x_{1}) d x_{0} d x_{1}}{p_{t} (x)}

is a valid velocity field that satisfies the mass continuity equation with respect to

p_{t}

. Now, let's use the above expression for

u_{t} (x)

L_{F M}

to obtain a practical objective for flow matching as follows:

\begin{aligned} \arg min_{v_{t}} L_{F M} (v_{t}) \\ = \arg min_{v_{t}} E_{t, x \sim p_{t}} ‖ v_{t} (x) - u_{t} (x) ‖^{2} \\ = \arg min_{v_{t}} E_{t, p_{t}} ‖ v_{t} (x) ‖^{2} - 2 E_{t, p_{t}} ⟨ v_{t} (x), u_{t} (x) ⟩ + E_{t, p_{t}} ‖ u_{t} (x) ‖^{2} \\ (2) & = \arg min_{v_{t}} E_{t, p_{t}} ‖ v_{t} (x) ‖^{2} - 2 E_{t, p_{t}} ⟨ v_{t} (x), u_{t} (x) ⟩ \end{aligned}

Consider the first term in the above equation. We can express it as follows:

\begin{aligned} E_{t, p_{t}} ‖ v_{t} (x) ‖^{2} \\ = E_{t} \int ‖ v_{t} (x) ‖^{2} p_{t} (x) d x \\ = E_{t} \int ‖ v_{t} (x) ‖^{2} \int δ_{I_{t} (x_{0}, x_{1})} (x) ρ (x_{0}, x_{1}) d x_{0} d x_{1} d x \\ = E_{t} \int ‖ v_{t} (x) ‖^{2} δ_{I_{t} (x_{0}, x_{1})} (x) ρ (x_{0}, x_{1}) d x_{0} d x_{1} d x \\ (3) & = E_{t, (x_{0}, x_{1}) \sim ρ (x_{0}, x_{1}), x = I_{t} (x_{0}, x_{1})} ‖ v_{t} (x) ‖^{2} \end{aligned}

Now, consider the second term:

\begin{aligned} E_{t, p_{t}} ⟨ v_{t} (x), u_{t} (x) ⟩ \\ = E_{t, p_{t}} ⟨ v_{t} (x), \int u_{t} (x | x_{0}, x_{1}) \frac{δ_{I_{t} (x_{0}, x_{1})} (x) ρ (x_{0}, x_{1}) d x_{0} d x_{1}}{p_{t} (x)} ⟩ \\ = E_{t} \int ⟨ v_{t} (x), \int u_{t} (x | x_{0}, x_{1}) \frac{δ_{I_{t} (x_{0}, x_{1})} (x) ρ (x_{0}, x_{1}) d x_{0} d x_{1}}{p_{t} (x)} ⟩ p_{t} (x) d x \\ = E_{t} \int ⟨ v_{t} (x), \int u_{t} (x | x_{0}, x_{1}) δ_{I_{t} (x_{0}, x_{1})} (x) ρ (x_{0}, x_{1}) d x_{0} d x_{1} ⟩ d x \\ = E_{t} \int ⟨ v_{t} (x), u_{t} (x | x_{0}, x_{1}) ⟩ δ_{I_{t} (x_{0}, x_{1})} (x) ρ (x_{0}, x_{1}) d x_{0} d x_{1} d x \\ = E_{t, (x_{0}, x_{1}) \sim ρ (x_{0}, x_{1}), x = I_{t} (x_{0}, x_{1})} ⟨ v_{t} (x), u_{t} (x | x_{0}, x_{1}) ⟩ \\ (4) & = E_{t, (x_{0}, x_{1}) \sim ρ (x_{0}, x_{1}), x = I_{t} (x_{0}, x_{1})} ⟨ v_{t} (x), \frac{\partial}{\partial t} I_{t} (x_{0}, x_{1}) ⟩ \end{aligned}

Using Eq. (3) and (4) in (2), we get

\begin{aligned} \arg min_{v_{t}} L_{F M} (v_{t}) \\ = \arg min_{v_{t}} E_{t, (x_{0}, x_{1}) \sim ρ (x_{0}, x_{1}), x = I_{t} (x_{0}, x_{1})} {‖ v_{t} (x) - \frac{\partial}{\partial t} I_{t} (x_{0}, x_{1}) ‖}^{2} \end{aligned}

Note on Diffusion Model

Diffusion model (stochastic or deterministic/probability flow) can be viewed as a special case of flow matching. Consider a special instance of interpolant function $I_{t} (x_{0}, x_{1}) = α_{t} x_{0} + σ_{t} x_{1}$ . Further, let $ρ_{0} = N (0, 1)$ . Then Variance preserving (VP) SDE follows the probability path determined by $σ_{t} = \sqrt{1 - α_{t}^{2}}$ , where $α_{t} = \exp (- \frac{1}{2} \int_{0}^{t} β (s) d s)$ , and $β$ is the noise schedule function.

Results

References

[Lipman et al., 2023] Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. arXiv preprint arXiv:2210.02747.

[Chen et al., 2018] Chen, R. T., Rubanova, Y., Bettencourt, J., & Duvenaud, D. K. (2018). Neural ordinary differential equations. Advances in neural information processing systems, 31.

[Albergo et al., 2023] Albergo, M. S., & Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571.

[Liu et al., 2023] Liu, X., Gong, C., & Liu, Q. (2023). Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003.