From Diffusion Modeling to Flow Matching for Generative AI

Carsten Mönning

Last modified: 30.01.2025 12 minutes read

7b42b917-a4a6-4e8a-9f81-553737ee5f1c 640x384

Diffusion models and, more generally, Flow Matching are the state-of-the-art in visual generative AI. Their derivation is not straightforward. Here’s what they are and how to use them.

What is diffusion and what has it to do with generative AI?

Diffusion in its most general sense captures the notion of entities moving from locations of higher concentration to locations of lower concentration reducing the concentration in the original locations. I.e., it embodies the notion of entropy and is a direct result of the second law of thermo-dynamics. Also, it is a stochastic process: Diffusion is defined as a time-dependent random process causing a spread in space.

Given this wide-ranging definition, it is unsurprising that diffusion is ubiquitous and can be used to model a variety of phenomena in mathematics, physical sciences, biology, chemistry, economics, finance, etc. It is perhaps more surprising, however, that it can also be used for generative modeling in the field of Artificial Intelligence. But what exactly is generative modeling in the first place?

Given a training data set x, generative models aim at estimating an unknown distribution (1)

in the form of the model distribution (2) parameterized by θ and then sampling from this generative model (2) to generate new data, as if it were sampled from the original, unknown distribution (1).

It can be thought of in classical statistical terms as a form of maximum likelihood estimation, where the learnable parameters θ of the generative model are chosen such that the original and the model distribution resemble each other as much as possible. When dealing with high-dimensional data such as image or video data, however, this approach becomes quickly computationally intractable.

Deep generative models, i.e., generative models making use of multi-layered neural networks, address this problem by explicitly modeling the unknown, original density function, but constraining it in some way to make it tractable, e.g., Variational Autoencoders (VAE) and diffusion models, or by using a tractable density function approximation, e.g., autoregressive and normalizing flow models. Alternatively, Generative Adversarial Networks (GAN) model the density function implicitly using a data-generating stochastic process (Foster, 2023).

Explicit density function modeling in the form of diffusion models has revolutionized and is dominating the areas of Computer Graphics and Computer Vision, in particular. They represent an instance of the Flow Matching (Lipman et al., 2023) family of generative models. Flow Matching, or, more generally, Generator Matching (Holderrieth et al., 2024), can in fact be shown to unify all generative models mentioned above across all modalities and is the new state-of-the-art in image, video and audio generation (Po et al., 2023). Diffusion models (Ho et al., 2020; Song et al., 2021) and Flow Matching (Esser et al., 2024) are at the core of image and video generation products such as OpenAI DALL.E 3, Google ImageGen, Black Forest Labs Flux, Runway Gen3-Alpha and OpenAI Sora, respectively.

This blog post deals primarily with diffusion models. A second blog post will extend the discussion to Flow, respectively Generative, Matching. Their mathematical derivation is not straightforward. The discussion is targeted at an audience interested in understanding the primary forces behind these groundbreaking generative achievements and their implementation without wanting to go through their line-by-line mathematical derivation (for which I recommend to consult the papers referenced below). It assumes intermediate knowledge of probability theory and Python.

What does diffusion modeling actually do?

The basic idea of diffusion models is that repeatedly adding noise to a given training data set, e.g., in the form of training images, results in a noise diffusion in latent space indistinguishable from Gaussian white noise. By leveraging findings from thermodynamics and stochastics to learn the reverse denoising process numerically or in closed form, a generative model is obtained, which can be used to generate new samples, e.g., new images, from the unknown, training data-generating distribution.

Diffusion models can in fact be motivated and derived more thoroughly in a number of different ways. They were originally introduced as discrete-time Gaussian processes (Sohl-Dickstein et al., 2015). Later, Song and Ermon (2019) motivated diffusion models in terms of continuous-time Stochastic Differential Equations (SDEs). Related, but more recently, Flow Matching regards diffusion as source to target distribution probability path modeling via a forward noising processes, where the generator of this process is parameterized with the help of the score function (Lipman et al., 2023; Lipman et al., 2024). Let’s have a look at this particular derivation, called score matching, first.

The idea of score matching is to parameterize the generative model (2) in such a way that the gradient, and the gradient only, of its log likelihood function resembles the gradient of the unknown data distribution’s (1) log likelihood function. That is, θ is to be chosen such that the gradients of the two distributions resemble each other to the maximum degree possible rather than the entire likelihood functions. This objective limits the information, which is learnt about the underlying, unknown structure of the data-generating density function to the gradient for a given value of θ. On the plus side, however, the density function approximation becomes tractable.

The gradient of the generative model’s log likelihood function with respect to the input data x is represented by the (Stein) score function (3):

Intuitively, this represents a vector field pointing in the direction of the biggest log likelihood function growth. The idea of score matching is to minimize the divergence between this model score function and the data distribution’s score function (4), I.e., upon optimization of this objective function, the model learns the probability path of the unknown data-generating distribution. Results by Hyvärinen (2005) and Song and Ermon (2019) allow for rewriting (4) in such a way that it depends on the generating model (2) only, so that it can be optimized using standard optimization methods such as gradient descent with neural networks.

For most real-world data sets, even very high-dimensional ones, however, the training data can be observed to sit on a two-dimensional manifold embedded within an higher-dimensional ambient space, the so-called manifold hypothesis (Whiteley et al., 2022). With the score being a gradient computed in ambient space, it becomes undefined if the input data sits on a low-dimensional manifold. Song and Ermon (2019) show that the addition of small perturbations of various magnitudes to the input data, e.g., in the form of Gaussian noise, resolves this problem and the neural network estimation of the gradient converges, i.e., noise diffusion makes the optimization of (4) tractable.

Diffusion and score matching are in fact equivalent and the underlying principle is also at work in the discrete-time Denoising Diffusion Probabilistic Model put forward by Ho et al. (2020), which is widely considered the breakthrough paper of the field, leveraging the connection between diffusion and score matching for training a state-of-the-art diffusion model at the time of its publication.

The Denoising Diffusion Probabilistic Model

Denoising diffusion models generate samples from the unknown data distribution (1) by sequentially denoising samples of random noise into samples from a computationally tractable model of (1). The training regime is a two-pass process, a forward process, called encoder and a reverse (denoising) process, called decoder. The job of the encoder is to learn a representation, the latent variable, of the input training data. This latent variable is meant to capture the essence of the training data, so might be lossy, but maintains key characteristics of the input data. It is not part of the training data, but part of the model and serves as the sampling space from which the decoder can then generate new, unobserved output data such as a newly generated image. Now, how does this work in more detail?

Let us assume the image generation scenario and denote (5) as the original image and (6) as the latent variable, with (7) representing intermediate latent states. The joint distribution of the sequence is denoted as for the forward process and for the reverse process.

Sohl-Dickstein et al. (2015) suggest to, firstly, make each transition distribution dependent on its immediate predecessor only, i.e., a “memoryless” Markov chain, thereby breaking the process down into individual steps featuring simple(r) distributions and making the computations tractable, i.e. (8) and (9): Secondly, they propose to set (6) to a zero-mean, unit variance Gaussian, i.e., thereby permitting to treat each transitional distribution also as a Gaussian. As a result, the computation of the transitional and (non-zero mean, unit variance) joint distributions during the forward pass depends on the two moments of the Gaussian distribution only and becomes tractable.

A key point of these ideas is to obtain a closed-form representation of the forward process. This is useful because although the actual image generation occurs backwards starting from white noise, i.e. happens during the decoder (denoising) process, the learning of it is closely coupled with the corresponding, now tractable step of the forward process.

Note that the forward process does not require any parametric modelling: Once (5) is given in the form of a well-defined training image, the intermediate noisy images (7) can simply be obtained by adding Gaussian noise according to a noise variance schedule (a.k.a. diffusion schedule)

at each step t, with (10)

representing the noisy image at step t. Note that due to this so-called reparameterization trick, the noisy image (7) can be obtained from (5) directly in one computational step rather than t computational steps.

The denoiser is then conditioned on denoising this very specific latent image at step t and this noisy image only. Here, the interesting part comes in: With the training samples known from the training image set and the forward diffusion process, respectively, the decoder can be trained using supervised learning (using the same neural network at each step). More specifically, the denoiser (11) is trained for a pair of source and noisy images, with the noisy image being well-defined from the forward process, by backpropagating the gradient of the loss between the image predicted by the denoiser at step t and the true image at t-1 known from the non-parametric forward process. Once the denoiser (11) has been trained, it can be sequentially applied T times to a white noise input image (6) to obtain a newly generated image (12) These forward and reverse diffusion processes are illustrated for the example of a source image of a flower (left image) below.

Source: András Béres, Denoising Diffusion Implicit Models blog post, June 2022

In summary, the training algorithm for a denoising diffusion probabilistic model for every image (5) and one denoising neural network (11) reads (Chan, 2024)¹:

1. Select a random timestamp t,

2. Draw a sample

3. Take a gradient descent step on the squared error between the denoised latent representation and the true image across all of the M training samples,

4. Repeat from 1. until convergence.

This trained denoiser (11) can then be used to do inference, i.e., to generate images from a white noise vector by running it T times from the white noise image (6) all the way to a newly generated image (12). That is, given a white noise image,

repeat for timestamp t=T, T-1, …, 1:

is the radius of the Gaussian, which is large, i.e., resembling white noise, for t=T and approaching zero for t=1. At the end of this sequential process, a newly generated image in the form of (12) is obtained.

Note that the Gaussian noise term ε makes the derivations straightforward, but it also makes it slow. It can in fact take a large number of iterations for the reverse diffusion process to converge. Starting with Denoising Diffusion Implicit Models (Song et al., 2021), there have been numerous improvements of the efficiency of the original algorithm, including the efficiency of the forward process. I suggest to have a read of Po et al. (2023) for a discussion of these improvements.

Diffusion model implementation

Excellent Python implementations of the diffusion model processes have been made available by András Béres and David Foster, respectively.² I recommend to use Foster’s Docker image to work through the implementation details and to experiment with diffusion models. If you are planning to run the diffusion model implementation on your local machine, it should be equipped with a dedicated GPU, since training times can run into the couple of hours otherwise.

The training dataset is the “Oxford Flowers 102” dataset provided by the Visual Geometry Group of Oxford University. It consists of images of 102 flower categories commonly found across the United Kingdom, with 40 to 258 images per category. It is conveniently available as a TensorFlow dataset pre-split into training, validation and test sets. Some training samples scaled to 64x64 pixels are shown below.

Although decent looking results are obtained for fewer epochs, for good results, I recommend to run the diffusion model implementations by András Béres/David Foster for a minimum of 50 epochs. All other hyperparameter settings can be left unchanged.

In the case of my experiments, the more steadily increasing offset cosine diffusion schedule yielded the visually most appealing results.

The implementations support TensorBoard to check for training progress and convergence and plot intermediate training results, examples of which are shown below for epochs 1, 5, 15, 25 and 40 of a training run.

The trained model can then be used to generate entirely new samples of synthetic flower imagery:

This completes the discussion of the diffusion model concept and its implementation. Diffusion modeling, just like any other eminent multimodal generative AI concept, can be shown to be a special case of Flow/Generative Matching already touched upon above, which I will discuss in more detail in a second and final blog post.

Footnotes

1 Ho et al. (2020) actually formulate the objective function in terms of noise prediction. The formulations are equivalent and for intuition reasons, I stick with the objective function in terms of image prediction.

2 András Béres provides an implementation of the Denoising Diffusion Implicit Model (DDIM) (Song et al. 2021) rather than the Denoising Diffusion Probabilistic Model (DDPM) (Ho et al. 2020). The difference, however, amounts to deterministic (DDIM) vs. stochastic (DDPM) sampling from the trained diffusion model “only”.

References

S. H. Chan, Tutorial on Diffusion Models for Imaging and Vision, Sept. 2024
https://arxiv.org/abs/2403.18103

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, R. Rombach, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, Proc. ICML, 2024
https://export.arxiv.org/pdf/2403.03206

D. Foster, Generative Deep Learning, 2nd ed., O’Reilly, Sebastopol, USA, 2023

J. Ho, A. Jain, and P. Abbeel, Denoising diffusion probabilistic models, Advances in Neural Information Processing Systems (NeurIPS), 2020
https://arxiv.org/abs/2006.11239

P. Holderrieth, M. Havasi, J. Yim, N. Shaul, I. Gat, T. Jaakkola, B. Karrer, R.T.Q. Chen, Y. Lipman, Generator Matching: Generative Modeling with Arbitrary Markov Processes, Computing Research Repository, 2024
https://arxiv.org/pdf/2410.20587

A. Hyvärinen, Estimation of non-normalized statistical models by score matching, Journal of Machine Learning Research (JMLR), Vol. 6, Issue 24, pp. 695–709, 2005
https://jmlr.org/papers/volume6/hyvarinen05a/hyvarinen05a.pdf

Y. Lipman, R.T.Q. Chen, H. Ben-Hamu, M. Nickel and M. Lee, Flow Matching for Generative Modeling, Proc. ICLR, 2023
https://arxiv.org/pdf/2210.02747

Y. Lipman, M. Havasi, P. Holderrieth, N. Shaul, M. Le, B. Karrer, R.T.Q. Chen, D. Lopez-Paz, H. Ben-Hamu, I. Gat, Flow Matching Guide and Code, Dec. 2024
https://arxiv.org/pdf/2412.06264

R. Po, W. Yifan, V. Golyanik, K. Aberman, J.T. Barron, A. Bermano, E. Chan, T. Dekel, A. Holynski, A. Kanazawa, C.K. Liu, L. Liu, B. Mildenhall, M. Nießner, B. Ommer, C. Theobalt, P. Wonka, G. Wetzstein, State of the Art on Diffusion Models for Visual Computing, Oct. 2023
https://arxiv.org/pdf/2310.07204

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, Deep Unsupervised Learning using Nonequilibrium Thermodynamics, Proc. of Int. Conf. on Machine Learning (ICML), Vol. 27, pp. 2256–2265, 2015
https://arxiv.org/pdf/1503.03585

Y. Song and S. Ermon, Generative modeling by estimating gradients of the data distribution, In: Advances in Neural Information Processing Systems, pp. 11895–11907, 2019

J. Song, C. Meng, and S. Ermon, Denoising diffusion implicit models, Int. Conf. on Learning Representations (ICLR), 2021
https://openreview.net/forum?id=St1giarCHLP

N. Whiteley, A. Gray and P. Rubin-Delanchy, Statistical exploration of the Manifold Hypothesis, 2022
https://arxiv.org/pdf/2208.11665

Originally published at https://technoids.substack.com.

The opinions and information stated in this article are personal to the individual author and do not necessarily represent Bertelsmann.

Carsten MönningVice President TechnologyCorporate Center