https://theaisummer.com/diffusion-models/

 
AI Summer
Start Here
Learn AI
Deep Learning Fundamentals
Advanced Deep Learning
AI Software Engineering
Books & Courses
Deep Learning in Production Book
Introduction to Deep Learning Interactive Course
Representation Learning MSc course 2023
Get started with Deep Learning Free Course
Deep Reinforcement Learning Free Course
GANs in Computer Vision Free Ebook
Projects
MedicalZoo
Self Attention CV
Resources
About
Contact
Support us
 
 
 
 
[                    ]
Newsletter
x
 
AI SummerAI Summer
Start Here
Learn AI
Deep Learning Fundamentals
Advanced Deep Learning
AI Software Engineering
Books & Courses
Deep Learning in Production Book
Introduction to Deep Learning Interactive Course
Representation Learning MSc course 2023
Get started with Deep Learning Free Course
Deep Reinforcement Learning Free Course
GANs in Computer Vision Free Ebook
Projects
Medical Zoo
Self Attention CV
Resources
About
Contact
Search
Support us

 Check out our Introduction to Deep Learning & Neural Networks
course 

Learn more

How diffusion models work: the math from scratch

Sergios Karagiannakos,Nikolas Adaloglouon2022-09-29*14 mins
Generative LearningComputer Vision
How diffusion models work: the math from scratchHow diffusion models
work: the math from scratch
SIMILAR ARTICLES
Generative Learning
An overview of classifier-free diffusion guidance: impaired model
guidance with a bad version of itself (part 2)
An overview of classifier-free guidance for diffusion models
The theory behind Latent Variable Models: formulating a Variational
Autoencoder
Deepfakes: Face synthesis with GANs and Autoencoders
GANs in computer vision - semantic image synthesis and learning a
generative model from a single image
GANs in computer vision - self-supervised adversarial training and
high-resolution image synthesis with style incorporation
GANs in computer vision - 2K image and video synthesis, and
large-scale class-conditional image generation
GANs in computer vision - Improved training with Wasserstein
distance, game theory control and progressively growing schemes
GANs in computer vision - Conditional image synthesis and 3D object
generation
GANs in computer vision - Introduction to generative learning
More articles
Computer Vision
An overview of classifier-free diffusion guidance: impaired model
guidance with a bad version of itself (part 2)
An overview of classifier-free guidance for diffusion models
ICCV 2023 top papers, general trends, and personal picks
Understanding Vision Transformers (ViTs): Hidden properties,
insights, and robustness of their representations
How Neural Radiance Fields (NeRF) and Instant Neural Graphics
Primitives work
BYOL tutorial: self-supervised learning on CIFAR images with code in
Pytorch
Self-supervised learning tutorial: Implementing SimCLR with pytorch
lightning
Vision Language models: towards multi-modal deep learning
Transformers in computer vision: ViT architectures, tips, tricks and
improvements
Grokking self-supervised (representation) learning: how it works in
computer vision and why
More articles
BOOKS & COURSES
Introduction to Deep Learning & Neural Networks with Pytorch 
Deep Learning in Production Book 

Diffusion models are a new class of state-of-the-art generative
models that generate diverse high-resolution images. They have
already attracted a lot of attention after OpenAI, Nvidia and Google
managed to train large-scale models. Example architectures that are
based on diffusion models are GLIDE, DALLE-2, Imagen, and the full
open-source stable diffusion.

But what is the main principle behind them?

In this blog post, we will dig our way up from the basic principles.
There are already a bunch of different diffusion-based architectures.
We will focus on the most prominent one, which is the Denoising
Diffusion Probabilistic Models (DDPM) as initialized by
Sohl-Dickstein et al and then proposed by Ho. et al 2020. Various
other approaches will be discussed to a smaller extent such as stable
diffusion and score-based models.

    Diffusion models are fundamentally different from all the
    previous generative methods. Intuitively, they aim to decompose
    the image generation process (sampling) in many small "denoising"
    steps.

The intuition behind this is that the model can correct itself over
these small steps and gradually produce a good sample. To some
extent, this idea of refining the representation has already been
used in models like alphafold. But hey, nothing comes at zero-cost.
This iterative process makes them slow at sampling, at least compared
to GANs.

Diffusion process

The basic idea behind diffusion models is rather simple. They take
the input image x0\mathbf{x}_0x0  and gradually add Gaussian noise to
it through a series of TTT steps. We will call this the forward
process. Notably, this is unrelated to the forward pass of a neural
network. If you'd like, this part is necessary to generate the
targets for our neural network (the image after applying t<Tt<Tt<T
noise steps).

Afterward, a neural network is trained to recover the original data
by reversing the noising process. By being able to model the reverse
process, we can generate new data. This is the so-called reverse
diffusion process or, in general, the sampling process of a
generative model.

How? Let's dive into the math to make it crystal clear.

Forward diffusion

Diffusion models can be seen as latent variable models. Latent means
that we are referring to a hidden continuous feature space. In such a
way, they may look similar to variational autoencoders (VAEs).

In practice, they are formulated using a Markov chain of TTT steps.
Here, a Markov chain means that each step only depends on the
previous one, which is a mild assumption. Importantly, we are not
constrained to using a specific type of neural network, unlike
flow-based models.

Given a data-point x0\textbf{x}_0x0  sampled from the real data
distribution q(x)q(x)q(x) ( x0~q(x)\textbf{x}_0 \sim q(x)x0 ~q(x)),
one can define a forward diffusion process by adding noise.
Specifically, at each step of the Markov chain we add Gaussian noise
with variance bt\beta_{t}bt  to xt-1\textbf{x}_{t-1}xt-1 , producing
a new latent variable xt\textbf{x}_{t}xt  with distribution q
(xt|xt-1)q(\textbf{x}_t |\textbf{x}_{t-1})q(xt |xt-1 ). This
diffusion process can be formulated as follows:

q(xt|xt-1)=N(xt;mt=1-btxt-1,St=btI)q(\mathbf{x}_t \vert \mathbf{x}_
{t-1}) = \mathcal{N}(\mathbf{x}_t; \boldsymbol{\mu}_t=\sqrt{1 - \
beta_t} \mathbf{x}_{t-1}, \boldsymbol{\Sigma}_t = \beta_t\mathbf{I}) 
q(xt |xt-1 )=N(xt ;mt =1-bt  xt-1 ,St =bt I)

forward-diffusion Forward diffusion process. Image modified by Ho et
al. 2020

Since we are in the multi-dimensional scenario I\textbf{I}I is the
identity matrix, indicating that each dimension has the same standard
deviation bt\beta_tbt . Note that q(xt|xt-1)q(\mathbf{x}_t \vert \
mathbf{x}_{t-1})q(xt |xt-1 ) is still a normal distribution, defined
by the mean m\boldsymbol{\mu}m and the variance S\boldsymbol{\Sigma}S
where mt=1-btxt-1\boldsymbol{\mu}_t =\sqrt{1 - \beta_t} \mathbf{x}_
{t-1}mt =1-bt  xt-1  and St=btI\boldsymbol{\Sigma}_t=\beta_t\mathbf
{I}St =bt I. S\boldsymbol{\Sigma}S will always be a diagonal matrix
of variances (here bt\beta_tbt )

Thus, we can go in a closed form from the input data x0\mathbf{x}_0x0
  to xT\mathbf{x}_{T}xT  in a tractable way. Mathematically, this is
the posterior probability and is defined as:

q(x1:T|x0)=[?]t=1Tq(xt|xt-1)q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) = \
prod^T_{t=1} q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) q(x1:T |x0 )=t=1[?]
T q(xt |xt-1 )

The symbol ::: in q(x1:T)q(\mathbf{x}_{1:T})q(x1:T ) states that we
apply qqq repeatedly from timestep 111 to TTT. It's also called
trajectory.

So far, so good? Well, nah! For timestep t=500<Tt=500 < Tt=500<T we
need to apply qqq 500 times in order to sample xt\mathbf{x}_txt .
Can't we really do better?

The reparametrization trick provides a magic remedy to this.

The reparameterization trick: tractable closed-form sampling at any
timestep

If we define at=1-bt\alpha_t= 1- \beta_tat =1-bt , a-t=[?]s=0tas\bar{\
alpha}_t = \prod_{s=0}^t \alpha_sa-t =[?]s=0t as  where 
[?]0,...,[?]t-2,[?]t-1~N(0,I)\boldsymbol{\epsilon}_{0},..., \epsilon_{t-2},
\epsilon_{t-1} \sim \mathcal{N}(\textbf{0},\mathbf{I})[?]0 ,...,[?]t-2 ,[?]
t-1 ~N(0,I), one can use the reparameterization trick in a recursive
manner to prove that:

xt=1-btxt-1+bt[?]t-1=atxt-2+1-at[?]t-2=...=a-tx0+1-a-t[?]0\begin{aligned} \
mathbf{x}_t &=\sqrt{1 - \beta_t} \mathbf{x}_{t-1} + \sqrt{\beta_t}\
boldsymbol{\epsilon}_{t-1}\\ &= \sqrt{\alpha_t}\mathbf{x}_{t-2} + \
sqrt{1 - \alpha_t}\boldsymbol{\epsilon}_{t-2} \\ &= \dots \\ &= \sqrt
{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\
epsilon_0} \end{aligned}xt  =1-bt  xt-1 +bt  [?]t-1 =at  xt-2 +1-at  [?]t
-2 =...=a-t  x0 +1-a-t  [?]0  


    Note: Since all timestep have the same Gaussian noise we will
    only use the symbol [?]\boldsymbol{\epsilon}[?] from now on.

Thus to produce a sample xt\mathbf{x}_txt  we can use the following
distribution:

xt~q(xt|x0)=N(xt;a-tx0,(1-a-t)I)\mathbf{x}_t \sim q(\mathbf{x}_t \
vert \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}
\mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I})xt ~q(xt |x0 )=N(xt ;a-t
  x0 ,(1-a-t )I)

Since bt\beta_tbt  is a hyperparameter, we can precompute at\alpha_t
at  and a-t\bar{\alpha}_ta-t  for all timesteps. This means that we
sample noise at any timestep ttt and get xt\mathbf{x}_txt  in one go.
Hence, we can sample our latent variable xt\mathbf{x}_txt  at any
arbitrary timestep. This will be our target later on to calculate our
tractable objective loss LtL_tLt .

Variance schedule

The variance parameter bt\beta_tbt  can be fixed to a constant or
chosen as a schedule over the TTT timesteps. In fact, one can define
a variance schedule, which can be linear, quadratic, cosine etc. The
original DDPM authors utilized a linear schedule increasing from b1=
10-4\beta_1= 10^{-4}b1 =10-4 to bT=0.02\beta_T = 0.02bT =0.02. Nichol
et al. 2021 showed that employing a cosine schedule works even
better.

variance-schedule Latent samples from linear (top) and cosine
(bottom) schedules respectively. Source: Nichol & Dhariwal 2021

Reverse diffusion

As T-[?]T \to \inftyT-[?], the latent xTx_TxT  is nearly an isotropic
Gaussian distribution. Therefore if we manage to learn the reverse
distribution q(xt-1|xt)q(\mathbf{x}_{t-1} \vert \mathbf{x}_{t})q(xt-1
 |xt ) , we can sample xTx_TxT  from N(0,I)\mathcal{N}(0,\mathbf{I})N
(0,I), run the reverse process and acquire a sample from q(x0)q(x_0)q
(x0 ), generating a novel data point from the original data
distribution.

The question is how we can model the reverse diffusion process.

Approximating the reverse process with a neural network

In practical terms, we don't know q(xt-1|xt)q(\mathbf{x}_{t-1} \vert
\mathbf{x}_{t})q(xt-1 |xt ). It's intractable since statistical
estimates of q(xt-1|xt)q(\mathbf{x}_{t-1} \vert \mathbf{x}_{t})q(xt-1
 |xt ) require computations involving the data distribution.

Instead, we approximate q(xt-1|xt)q(\mathbf{x}_{t-1} \vert \mathbf{x}
_{t})q(xt-1 |xt ) with a parameterized model pthp_{\theta}pth  (e.g. a
neural network). Since q(xt-1|xt)q(\mathbf{x}_{t-1} \vert \mathbf{x}_
{t})q(xt-1 |xt ) will also be Gaussian, for small enough bt\beta_tbt 
, we can choose pthp_{\theta}pth  to be Gaussian and just parameterize
the mean and variance:

pth(xt-1|xt)=N(xt-1;mth(xt,t),Sth(xt,t))p_\theta(\mathbf{x}_{t-1} \vert
\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta
(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) pth (x
t-1 |xt )=N(xt-1 ;mth (xt ,t),Sth (xt ,t))

reverse-diffusion Reverse diffusion process. Image modified by Ho et
al. 2020

If we apply the reverse formula for all timesteps (pth(x0:T)p_\theta(\
mathbf{x}_{0:T})pth (x0:T ), also called trajectory), we can go from 
xT\mathbf{x}_TxT  to the data distribution:

pth(x0:T)=pth(xT)[?]t=1Tpth(xt-1|xt)p_\theta(\mathbf{x}_{0:T}) = p_{\
theta}(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \
mathbf{x}_t)pth (x0:T )=pth (xT )t=1[?]T pth (xt-1 |xt )

By additionally conditioning the model on timestep ttt, it will learn
to predict the Gaussian parameters (meaning the mean mth(xt,t)\
boldsymbol{\mu}_\theta(\mathbf{x}_t, t)mth (xt ,t) and the covariance
matrix Sth(xt,t)\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)Sth (xt ,t)
) for each timestep.

But how do we train such a model?

Training a diffusion model

If we take a step back, we can notice that the combination of qqq and
ppp is very similar to a variational autoencoder (VAE). Thus, we can
train it by optimizing the negative log-likelihood of the training
data. After a series of calculations, which we won't analyze here, we
can write the evidence lower bound (ELBO) as follows:

logp(x)>=Eq(x1|x0)[logpth(x0|x1)]-DKL(q(xT|x0)||p(xT))-[?]t=2TEq(xt|x0)
[DKL(q(xt-1|xt,x0)||pth(xt-1|xt))]=L0-LT-[?]t=2TLt-1\begin{aligned} log
p(\mathbf{x}) \geq &\mathbb{E}_{q(x_1 \vert x_0)} [log p_{\theta} (\
mathbf{x}_0 \vert \mathbf{x}_1)] - \\ &D_{KL}(q(\mathbf{x}_T \vert \
mathbf{x}_0) \vert\vert p(\mathbf{x}_T))- \\ &\sum_{t=2}^T \mathbb{E}
_{q(\mathbf{x}_t \vert \mathbf{x}_0)} [D_{KL}(q(\mathbf{x}_{t-1} \
vert \mathbf{x}_t, \mathbf{x}_0) \vert \vert p_{\theta}(\mathbf{x}_
{t-1} \vert \mathbf{x}_t)) ] \\ & = L_0 - L_T - \sum_{t=2}^T L_{t-1}
\end{aligned}logp(x)>= Eq(x1 |x0 ) [logpth (x0 |x1 )]-DKL (q(xT |x0 )||
p(xT ))-t=2[?]T Eq(xt |x0 ) [DKL (q(xt-1 |xt ,x0 )||pth (xt-1 |xt ))]=L0
 -LT -t=2[?]T Lt-1  

Let's analyze these terms:

 1. The Eq(x1|x0)[logpth(x0|x1)]\mathbb{E}_{q(x_1 \vert x_0)} [log p_
    {\theta} (\mathbf{x}_0 \vert \mathbf{x}_1)]Eq(x1 |x0 ) [logpth (x0
     |x1 )] term can been as a reconstruction term, similar to the
    one in the ELBO of a variational autoencoder. In Ho et al 2020 ,
    this term is learned using a separate decoder.

 2. DKL(q(xT|x0)||p(xT))D_{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \
    vert\vert p(\mathbf{x}_T))DKL (q(xT |x0 )||p(xT )) shows how
    close xT\mathbf{x}_TxT  is to the standard Gaussian. Note that
    the entire term has no trainable parameters so it's ignored
    during training.

 3. The third term [?]t=2TLt-1\sum_{t=2}^T L_{t-1}[?]t=2T Lt-1 , also
    referred as LtL_tLt , formulate the difference between the
    desired denoising steps pth(xt-1|xt))p_{\theta}(\mathbf{x}_{t-1} \
    vert \mathbf{x}_t))pth (xt-1 |xt )) and the approximated ones q
    (xt-1|xt,x0)q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)q
    (xt-1 |xt ,x0 ).

It is evident that through the ELBO, maximizing the likelihood boils
down to learning the denoising steps LtL_tLt .

    Important note: Even though q(xt-1|xt)q(\mathbf{x}_{t-1} \vert \
    mathbf{x}_{t})q(xt-1 |xt ) is intractable Sohl-Dickstein et al
    illustrated that by additionally conditioning on x0\textbf{x}_0x0
      makes it tractable.

Intuitively, a painter (our generative model) needs a reference image
(x0\textbf{x}_0x0 ) to slowly draw (reverse diffusion step q
(xt-1|xt,x0)q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)q(xt-
1 |xt ,x0 )) an image. Thus, we can take a small step backwards,
meaning from noise to generate an image, if and only if we have x0\
textbf{x}_0x0  as a reference.

In other words, we can sample xt\textbf{x}_txt  at noise level ttt
conditioned on x0\textbf{x}_0x0 . Since at=1-bt\alpha_t= 1- \beta_tat
 =1-bt  and a-t=[?]s=0tas\bar{\alpha}_t = \prod_{s=0}^t \alpha_sa-t =[?]s
=0t as , we can prove that:

q(xt-1|xt,x0)=N(xt-1;m~(xt,x0),b~tI)b~t=1-a-t-11-a-t[?]btm~t(xt,x0)=
a-t-1bt1-a-tx0+at(1-a-t-1)1-a-txt\begin{aligned} q(\mathbf{x}_{t-1} \
vert \mathbf{x}_t, \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_{t-1}; {\
tilde{\boldsymbol{\mu}}}(\mathbf{x}_t, \mathbf{x}_0), {\tilde{\beta}
_t} \mathbf{I}) \\ \tilde{\beta}_t &= \frac{1 - \bar{\alpha}_{t-1}}{1
- \bar{\alpha}_t} \cdot \beta_t \\ \tilde{\boldsymbol{\mu}}_t (\
mathbf{x}_t, \mathbf{x}_0) &= \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}
{1 - \bar{\alpha}_t} \mathbf{x_0} + \frac{\sqrt{\alpha_t}(1 - \bar{\
alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t \end{aligned}q(xt-1 |
xt ,x0 )b~ t m~ t (xt ,x0 ) =N(xt-1 ;m~ (xt ,x0 ),b~ t I)=1-a-t 1-a-t
-1  [?]bt =1-a-t a-t-1  bt  x0 +1-a-t at  (1-a-t-1 ) xt  


    Note that at\alpha_tat  and a-t\bar{\alpha}_ta-t  depend only on 
    bt\beta_tbt , so they can be precomputed.

This little trick provides us with a fully tractable ELBO. The above
property has one more important side effect, as we already saw in the
reparameterization trick, we can represent x0\mathbf{x}_0x0  as

x0=1a-t(xt-1-a-t[?])),\mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(\
mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon})),x0 =a
-t  1 (xt -1-a-t  [?])),

where [?]~N(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\textbf{0},\
mathbf{I})[?]~N(0,I).

By combining the last two equations, each timestep will now have a
mean m~t\tilde{\boldsymbol{\mu}}_tm~ t  (our target) that only
depends on xt\mathbf{x}_txt :

m~t(xt)=1at(xt-bt1-a-t[?]))\tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t) =
{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{\beta_t}{\sqrt
{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon} ) \Big)}m~ t (xt )=at  1 
(xt -1-a-t  bt  [?]))

Therefore we can use a neural network [?]th(xt,t)\epsilon_{\theta}(\
mathbf{x}_t,t)[?]th (xt ,t) to approximate [?]\boldsymbol{\epsilon}[?] and
consequently the mean:

mth~(xt,t)=1at(xt-bt1-a-t[?]th(xt,t))\tilde{\boldsymbol{\mu}_{\theta}}( \
mathbf{x}_t,t) = {\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \
frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_{\
theta}(\mathbf{x}_t,t) \Big)}mth ~ (xt ,t)=at  1 (xt -1-a-t  bt  [?]th (x
t ,t))

Thus, the loss function (the denoising term in the ELBO) can be
expressed as:

Lt=Ex0,t,[?][12||Sth(xt,t)||22||m~t-mth(xt,t)||22]=Ex0,t,[?][bt22at(1-a-t)
||Sth||22[?][?]t-[?]th(a-tx0+1-a-t[?],t)||2]\begin{aligned} L_t &= \mathbb{E}_
{\mathbf{x}_0,t,\boldsymbol{\epsilon}}\Big[\frac{1}{2||\boldsymbol{\
Sigma}_\theta (x_t,t)||_2^2} ||\tilde{\boldsymbol{\mu}}_t - \
boldsymbol{\mu}_\theta(\mathbf{x}_t, t)||_2^2 \Big] \\ &= \mathbb{E}_
{\mathbf{x}_0,t,\boldsymbol{\epsilon}}\Big[\frac{\beta_t^2}{2\alpha_t
(1 - \bar{\alpha}_t) ||\boldsymbol{\Sigma}_\theta||^2_2} \| \
boldsymbol{\epsilon}_{t}- \boldsymbol{\epsilon}_{\theta}(\sqrt{\bar
{a}_t} \mathbf{x}_0 + \sqrt{1-\bar{a}_t}\boldsymbol{\epsilon}, t ) ||
^2 \Big] \end{aligned}Lt  =Ex0 ,t,[?] [2||Sth (xt ,t)||22 1 ||m~ t -mth (
xt ,t)||22 ]=Ex0 ,t,[?] [2at (1-a-t )||Sth ||22 bt2  [?][?]t -[?]th (a-t  x0 +1
-a-t  [?],t)||2] 

This effectively shows us that instead of predicting the mean of the
distribution, the model will predict the noise [?]\boldsymbol{\epsilon}
[?] at each timestep ttt.

Ho et.al 2020 made a few simplifications to the actual loss term as
they ignore a weighting term. The simplified version outperforms the
full objective:

Ltsimple=Ex0,t,[?][[?][?]-[?]th(a-tx0+1-a-t[?],t)||2]L_t^\text{simple} = \mathbb
{E}_{\mathbf{x}_0, t, \boldsymbol{\epsilon}} \Big[\|\boldsymbol{\
epsilon}- \boldsymbol{\epsilon}_{\theta}(\sqrt{\bar{a}_t} \mathbf{x}
_0 + \sqrt{1-\bar{a}_t} \boldsymbol{\epsilon}, t ) ||^2 \Big]Ltsimple
 =Ex0 ,t,[?] [[?][?]-[?]th (a-t  x0 +1-a-t  [?],t)||2]

The authors found that optimizing the above objective works better
than optimizing the original ELBO. The proof for both equations can
be found in this excellent post by Lillian Weng or in Luo et al. 2022
.

Additionally, Ho et. al 2020 decide to keep the variance fixed and
have the network learn only the mean. This was later improved by
Nichol et al. 2021, who decide to let the network learn the
covariance matrix (S)(\boldsymbol{\Sigma})(S) as well (by modifying 
LtsimpleL_t^\text{simple}Ltsimple  ), achieving better results.

training-sampling-ddpm Training and sampling algorithms of DDPMs.
Source: Ho et al. 2020

Architecture

One thing that we haven't mentioned so far is what the model's
architecture looks like. Notice that the model's input and output
should be of the same size.

To this end, Ho et al. employed a U-Net. If you are unfamiliar with
U-Nets, feel free to check out our past article on the major U-Net
architectures. In a few words, a U-Net is a symmetric architecture
with input and output of the same spatial size that uses skip
connections between encoder and decoder blocks of corresponding
feature dimension. Usually, the input image is first downsampled and
then upsampled until reaching its initial size.

In the original implementation of DDPMs, the U-Net consists of Wide
ResNet blocks, group normalization as well as self-attention blocks.

The diffusion timestep ttt is specified by adding a sinusoidal
position embedding into each residual block. For more details, feel
free to visit the official GitHub repository. For a detailed
implementation of the diffusion model, check out this awesome post by
Hugging Face.

unet The U-Net architecture. Source: Ronneberger et al.

Conditional Image Generation: Guided Diffusion

A crucial aspect of image generation is conditioning the sampling
process to manipulate the generated samples. Here, this is also
referred to as guided diffusion.

There have even been methods that incorporate image embeddings into
the diffusion in order to "guide" the generation. Mathematically,
guidance refers to conditioning a prior data distribution p(x)p(\
textbf{x})p(x) with a condition yyy, i.e. the class label or an image
/text embedding, resulting in p(x|y)p(\textbf{x}|y)p(x|y).

To turn a diffusion model pthp_\thetapth  into a conditional diffusion
model, we can add conditioning information yyy at each diffusion
step.

pth(x0:T|y)=pth(xT)[?]t=1Tpth(xt-1|xt,y)p_\theta(\mathbf{x}_{0:T} \vert y)
= p_\theta(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \
vert \mathbf{x}_t, y)pth (x0:T |y)=pth (xT )t=1[?]T pth (xt-1 |xt ,y)

The fact that the conditioning is being seen at each timestep may be
a good justification for the excellent samples from a text prompt.

In general, guided diffusion models aim to learn [?]log[?]pth(xt|y)\nabla
\log p_\theta( \mathbf{x}_t \vert y)[?]logpth (xt |y). So using the
Bayes rule, we can write:

[?]xtlog[?]pth(xt|y)=[?]xtlog[?](pth(y|xt)pth(xt)pth(y))=[?]xtlogpth(xt)+[?]xtlog(pth
(y|xt))\begin{aligned} \nabla_{\textbf{x}_{t}} \log p_\theta(\mathbf
{x}_t \vert y) &= \nabla_{\textbf{x}_{t}} \log (\frac{p_\theta(y \
vert \mathbf{x}_t) p_\theta(\mathbf{x}_t) }{p_\theta(y)}) \\ &= \
nabla_{\textbf{x}_{t}} log p_\theta(\mathbf{x}_t) + \nabla_{\textbf
{x}_{t}} log (p_\theta( y \vert\mathbf{x}_t )) \end{aligned}[?]xt  log
pth (xt |y) =[?]xt  log(pth (y)pth (y|xt )pth (xt ) )=[?]xt  logpth (xt )+[?]xt 
 log(pth (y|xt )) 

pth(y)p_\theta(y)pth (y) is removed since the gradient operator [?]xt\
nabla_{\textbf{x}_{t}}[?]xt   refers only to xt\textbf{x}_{t}xt , so no
gradient for yyy. Moreover remember that log[?](ab)=log[?](a)+log[?](b)\log
(a b)= \log(a) + \log(b)log(ab)=log(a)+log(b).

And by adding a guidance scalar term sss, we have:

[?]log[?]pth(xt|y)=[?]log[?]pth(xt)+s[?][?]log[?](pth(y|xt))\nabla \log p_\theta(\
mathbf{x}_t \vert y) = \nabla \log p_\theta(\mathbf{x}_t) + s \cdot \
nabla \log (p_\theta( y \vert\mathbf{x}_t ))[?]logpth (xt |y)=[?]logpth (xt
 )+s[?][?]log(pth (y|xt ))

Using this formulation, let's make a distinction between classifier
and classifier-free guidance. Next, we will present two family of
methods aiming at injecting label information.

Classifier guidance

Sohl-Dickstein et al. and later Dhariwal and Nichol showed that we
can use a second model, a classifier fph(y|xt,t)f_\phi(y \vert \mathbf
{x}_t, t)fph (y|xt ,t), to guide the diffusion toward the target class
yyy during training. To achieve that, we can train a classifier fph
(y|xt,t)f_\phi(y \vert \mathbf{x}_t, t)fph (y|xt ,t) on the noisy
image xt\mathbf{x}_txt  to predict its class yyy. Then we can use the
gradients [?]log[?](fph(y|xt))\nabla \log (f_\phi( y \vert\mathbf{x}_t ))[?]
log(fph (y|xt )) to guide the diffusion. How?

We can build a class-conditional diffusion model with mean mth(xt|y)\
mu_\theta(\mathbf{x}_t|y)mth (xt |y) and variance Sth(xt|y)\boldsymbol
{\Sigma}_\theta(\mathbf{x}_t |y)Sth (xt |y).

Since pth~N(mth,Sth)p_\theta \sim \mathcal{N}(\mu_{\theta}, \Sigma_{\
theta})pth ~N(mth ,Sth ), we can show using the guidance formulation
from the previous section that the mean is perturbed by the gradients
of log[?]fph(y|xt)\log f_\phi(y|\mathbf{x}_t)logfph (y|xt ) of class yyy,
resulting in:

m^(xt|y)=mth(xt|y)+s[?]Sth(xt|y)[?]xtlogfph(y|xt,t)\hat{\mu}(\mathbf{x}_t |
y) =\mu_\theta(\mathbf{x}_t |y) + s \cdot \boldsymbol{\Sigma}_\theta
(\mathbf{x}_t |y) \nabla_{\mathbf{x}_t} logf_\phi(y \vert \mathbf{x}
_t, t)m^ (xt |y)=mth (xt |y)+s[?]Sth (xt |y)[?]xt  logfph (y|xt ,t)

In the famous GLIDE paper by Nichol et al, the authors expanded on
this idea and use CLIP embeddings to guide the diffusion. CLIP as
proposed by Saharia et al., consists of an image encoder ggg and a
text encoder hhh. It produces an image and text embeddings g(xt)g(\
mathbf{x}_t)g(xt ) and h(c)h(c)h(c), respectively, wherein ccc is the
text caption.

Therefore, we can perturb the gradients with their dot product:

m^(xt|c)=m(xt|c)+s[?]Sth(xt|c)[?]xtg(xt)[?]h(c)\hat{\mu}(\mathbf{x}_t |c) =\
mu(\mathbf{x}_t |c) + s \cdot \boldsymbol{\Sigma}_\theta(\mathbf{x}_t
|c) \nabla_{\mathbf{x}_t} g(\mathbf{x}_t) \cdot h(c)m^ (xt |c)=m(xt |
c)+s[?]Sth (xt |c)[?]xt  g(xt )[?]h(c)

As a result, they manage to "steer" the generation process toward a
user-defined text caption.

classifier-guidance Algorithm of classifier guided diffusion
sampling. Source: Dhariwal & Nichol 2021

Classifier-free guidance

Using the same formulation as before we can define a classifier-free
guided diffusion model as:

[?]log[?]p(xt|y)=s[?][?]log(p(xt|y))+(1-s)[?][?]logp(xt)\nabla \log p(\mathbf{x}
_t \vert y) =s \cdot \nabla log(p(\mathbf{x}_t \vert y)) + (1-s) \
cdot \nabla log p(\mathbf{x}_t) [?]logp(xt |y)=s[?][?]log(p(xt |y))+(1-s)[?][?]
logp(xt )

Guidance can be achieved without a second classifier model as
proposed by Ho & Salimans. Instead of training a separate classifier,
the authors trained a conditional diffusion model [?]th(xt|y)\boldsymbol
{\epsilon}_\theta (\mathbf{x}_t|y)[?]th (xt |y) together with an
unconditional model [?]th(xt|0)\boldsymbol{\epsilon}_\theta (\mathbf{x}
_t |0)[?]th (xt |0). In fact, they use the exact same neural network.
During training, they randomly set the class yyy to 000, so that the
model is exposed to both the conditional and unconditional setup:

[?]^th(xt|y)=s[?][?]th(xt|y)+(1-s)[?][?]th(xt|0)=[?]th(xt|0)+s[?]([?]th(xt|y)-[?]th(xt|0))\
begin{aligned} \hat{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t |y) &
= s \cdot \boldsymbol{\epsilon}_\theta(\mathbf{x}_t |y) + (1-s) \cdot
\boldsymbol{\epsilon}_\theta(\mathbf{x}_t |0) \\ &= \boldsymbol{\
epsilon}_\theta(\mathbf{x}_t |0) + s \cdot (\boldsymbol{\epsilon}_\
theta(\mathbf{x}_t |y) -\boldsymbol{\epsilon}_\theta(\mathbf{x}_t |0)
) \end{aligned}[?]^th (xt |y) =s[?][?]th (xt |y)+(1-s)[?][?]th (xt |0)=[?]th (xt |0)+
s[?]([?]th (xt |y)-[?]th (xt |0)) 


    Note that this can also be used to "inject" text embeddings as we
    showed in classifier guidance.

This admittedly "weird" process has two major advantages:

  * It uses only a single model to guide the diffusion.

  * It simplifies guidance when conditioning on information that is
    difficult to predict with a classifier (such as text embeddings).

Imagen as proposed by Saharia et al. relies heavily on
classifier-free guidance, as they find that it is a key contributor
to generating samples with strong image-text alignment. For more info
on the approach of Imagen check out this video from AI Coffee Break
with Letitia:

Scaling up diffusion models

You might be asking what is the problem with these models. Well, it's
computationally very expensive to scale these U-nets into
high-resolution images. This brings us to two methods for scaling up
diffusion models to higher resolutions: cascade diffusion models and
latent diffusion models.

Cascade diffusion models

Ho et al. 2021 introduced cascade diffusion models in an effort to
produce high-fidelity images. A cascade diffusion model consists of a
pipeline of many sequential diffusion models that generate images of
increasing resolution. Each model generates a sample with superior
quality than the previous one by successively upsampling the image
and adding higher resolution details. To generate an image, we sample
sequentially from each diffusion model.

cascade-diffusion Cascade diffusion model pipeline. Source: Ho &
Saharia et al.

To acquire good results with cascaded architectures, strong data
augmentations on the input of each super-resolution model are
crucial. Why? Because it alleviates compounding error from the
previous cascaded models, as well as due to a train-test mismatch.

It was found that gaussian blurring is a critical transformation
toward achieving high fidelity. They refer to this technique as
conditioning augmentation.

Stable diffusion: Latent diffusion models

Latent diffusion models are based on a rather simple idea: instead of
applying the diffusion process directly on a high-dimensional input,
we project the input into a smaller latent space and apply the
diffusion there.

In more detail, Rombach et al. proposed to use an encoder network to
encode the input into a latent representation i.e. zt=g(xt)\mathbf{z}
_t = g(\mathbf{x}_t)zt =g(xt ). The intuition behind this decision is
to lower the computational demands of training diffusion models by
processing the input in a lower dimensional space. Afterward, a
standard diffusion model (U-Net) is applied to generate new data,
which are upsampled by a decoder network.

If the loss for a typical diffusion model (DM) is formulated as:

LDM=Ex,t,[?][[?][?]-[?]th(xt,t)||2]L _{DM} = \mathbb{E}_{\mathbf{x}, t, \
boldsymbol{\epsilon}} \Big[\| \boldsymbol{\epsilon}- \boldsymbol{\
epsilon}_{\theta}( \mathbf{x}_t, t ) ||^2 \Big]LDM =Ex,t,[?] [[?][?]-[?]th (xt
 ,t)||2]

then given an encoder E\mathcal{E}E and a latent representation zzz,
the loss for a latent diffusion model (LDM) is:

LLDM=EE(x),t,[?][[?][?]-[?]th(zt,t)||2]L _{LDM} = \mathbb{E}_{ \mathcal{E}(\
mathbf{x}), t, \boldsymbol{\epsilon}} \Big[\| \boldsymbol{\epsilon}-
\boldsymbol{\epsilon}_{\theta}( \mathbf{z}_t, t ) ||^2 \Big]LLDM =EE(
x),t,[?] [[?][?]-[?]th (zt ,t)||2]

stable-diffusion Latent diffusion models. Source: Rombach et al

For more information check out this video:

Score-based generative models

Around the same time as the DDPM paper, Song and Ermon proposed a
different type of generative model that appears to have many
similarities with diffusion models. Score-based models tackle
generative learning using score matching and Langevin dynamics.

    Score-matching refers to the process of modeling the gradient of
    the log probability density function, also known as the score
    function. Langevin dynamics is an iterative process that can draw
    samples from a distribution using only its score function.

xt=xt-1+d2[?]xlog[?]p(xt-1)+d[?], where [?]~N(0,I)\mathbf{x}_t=\mathbf{x}_
{t-1}+\frac{\delta}{2} \nabla_{\mathbf{x}} \log p\left(\mathbf{x}_
{t-1}\right)+\sqrt{\delta} \boldsymbol{\epsilon}, \quad \text { where
} \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})xt =x
t-1 +2d [?]x logp(xt-1 )+d [?], where [?]~N(0,I)

where d\deltad is the step size.

Suppose that we have a probability density p(x)p(x)p(x) and that we
define the score function to be [?]xlog[?]p(x)\nabla_x \log p(x)[?]x logp(x
). We can then train a neural network sths_{\theta}sth  to estimate 
[?]xlog[?]p(x)\nabla_x \log p(x)[?]x logp(x) without estimating p(x)p(x)p(x
) first. The training objective can be formulated as follows:

Ep(x)[[?][?]xlog[?]p(x)-sth(x)[?]22]=[?]p(x)[?][?]xlog[?]p(x)-sth(x)[?]22dx\mathbb{E}_{p
(\mathbf{x})}[\| \nabla_\mathbf{x} \log p(\mathbf{x}) - \mathbf{s}_\
theta(\mathbf{x}) \|_2^2] = \int p(\mathbf{x}) \| \nabla_\mathbf{x} \
log p(\mathbf{x}) - \mathbf{s}_\theta(\mathbf{x}) \|_2^2 \mathrm{d}\
mathbf{x}Ep(x) [[?][?]x logp(x)-sth (x)[?]22 ]=[?]p(x)[?][?]x logp(x)-sth (x)[?]22 dx

Then by using Langevin dynamics, we can directly sample from p(x)p(x)
p(x) using the approximated score function.

    In case you missed it, guided diffusion models use this
    formulation of score-based models as they learn directly [?]xlog[?]p
    (x)\nabla_x \log p(x)[?]x logp(x). Of course, they don't rely on
    Langevin dynamics.

Adding noise to score-based models: Noise Conditional Score Networks
(NCSN)

    The problem so far: the estimated score functions are usually
    inaccurate in low-density regions, where few data points are
    available. As a result, the quality of data sampled using
    Langevin dynamics is not good.

Their solution was to perturb the data points with noise and train
score-based models on the noisy data points instead. As a matter of
fact, they used multiple scales of Gaussian noise perturbations.

Thus, adding noise is the key to make both DDPM and score based
models work.

score-based Score-based generative modeling with score matching +
Langevin dynamics. Source: Generative Modeling by Estimating
Gradients of the Data Distribution

Mathematically, given the data distribution p(x)p(x)p(x), we perturb
with Gaussian noise N(0,si2I)\mathcal{N}(\textbf{0}, \sigma_i^2 I)N(0
,si2 I) where i=1,2,[?] ,Li=1,2,\cdots,Li=1,2,[?],L to obtain a
noise-perturbed distribution:

psi(x)=[?]p(y)N(x;y,si2I)dyp_{\sigma_i}(\mathbf{x}) = \int p(\mathbf
{y}) \mathcal{N}(\mathbf{x}; \mathbf{y}, \sigma_i^2 I) \mathrm{d} \
mathbf{y}psi  (x)=[?]p(y)N(x;y,si2 I)dy

Then we train a network sth(x,i)s_\theta(\mathbf{x},i)sth (x,i), known
as Noise Conditional Score-Based Network (NCSN) to estimate the score
function [?]xlog[?]dsi(x)\nabla_\mathbf{x} \log d_{\sigma_i}(\mathbf{x})[?]
x logdsi  (x). The training objective is a weighted sum of Fisher
divergences for all noise scales.

[?]i=1Ll(i)Epsi(x)[[?][?]xlog[?]psi(x)-sth(x,i)[?]22]\sum_{i=1}^L \lambda(i) \
mathbb{E}_{p_{\sigma_i}(\mathbf{x})}[\| \nabla_\mathbf{x} \log p_{\
sigma_i}(\mathbf{x}) - \mathbf{s}_\theta(\mathbf{x}, i) \|_2^2]i=1[?]L 
l(i)Epsi  (x) [[?][?]x logpsi  (x)-sth (x,i)[?]22 ]

Score-based generative modeling through stochastic differential
equations (SDE)

Song et al. 2021 explored the connection of score-based models with
diffusion models. In an effort to encapsulate both NSCNs and DDPMs
under the same umbrella, they proposed the following:

Instead of perturbing data with a finite number of noise
distributions, we use a continuum of distributions that evolve over
time according to a diffusion process. This process is modeled by a
prescribed stochastic differential equation (SDE) that does not
depend on the data and has no trainable parameters. By reversing the
process, we can generate new samples.

score-sde Score-based generative modeling through stochastic
differential equations (SDE). Source: Song et al. 2021

We can define the diffusion process {x(t)}t[?][0,T]\{ \mathbf{x}(t) \}_
{t\in [0, T]}{x(t)}t[?][0,T]  as an SDE in the following form:

dx=f(x,t)dt+g(t)dw\mathrm{d}\mathbf{x} = \mathbf{f}(\mathbf{x}, t) \
mathrm{d}t + g(t) \mathrm{d} \mathbf{w}dx=f(x,t)dt+g(t)dw

where w\mathbf{w}w is the Wiener process (a.k.a., Brownian motion), f
([?],t)\mathbf{f}(\cdot, t)f([?],t) is a vector-valued function called
the drift coefficient of x(t)\mathbf{x}(t)x(t), and g([?])g(\cdot)g([?])
is a scalar function known as the diffusion coefficient of x(t)\
mathbf{x}(t)x(t). Note that the SDE typically has a unique strong
solution.

    To make sense of why we use an SDE, here is a tip: the SDE is
    inspired by the Brownian motion, in which a number of particles
    move randomly inside a medium. This randomness of the particles'
    motion models the continuous noise perturbations on the data.

After perturbing the original data distribution for a sufficiently
long time, the perturbed distribution becomes close to a tractable
noise distribution.

To generate new samples, we need to reverse the diffusion process.
The SDE was chosen to have a corresponding reverse SDE in closed
form:

dx=[f(x,t)-g2(t)[?]xlog[?]pt(x)]dt+g(t)dw\mathrm{d}\mathbf{x} = [\mathbf
{f}(\mathbf{x}, t) - g^2(t) \nabla_\mathbf{x} \log p_t(\mathbf{x})]\
mathrm{d}t + g(t) \mathrm{d} \mathbf{w}dx=[f(x,t)-g2(t)[?]x logpt (x)]d
t+g(t)dw

To compute the reverse SDE, we need to estimate the score function 
[?]xlog[?]pt(x)\nabla_\mathbf{x} \log p_t(\mathbf{x})[?]x logpt (x). This
is done using a score-based model sth(x,i)s_\theta(\mathbf{x},i)sth (x,
i) and Langevin dynamics. The training objective is a continuous
combination of Fisher divergences:

Et[?]U(0,T)Ept(x)[l(t)[?][?]xlog[?]pt(x)-sth(x,t)[?]22]\mathbb{E}_{t \in \
mathcal{U}(0, T)}\mathbb{E}_{p_t(\mathbf{x})}[\lambda(t) \| \nabla_\
mathbf{x} \log p_t(\mathbf{x}) - \mathbf{s}_\theta(\mathbf{x}, t) \|
_2^2]Et[?]U(0,T) Ept (x) [l(t)[?][?]x logpt (x)-sth (x,t)[?]22 ]

where U(0,T)\mathcal{U}(0, T)U(0,T) denotes a uniform distribution
over the time interval, and l\lambdal is a positive weighting
function. Once we have the score function, we can plug it into the
reverse SDE and solve it in order to sample x(0)\mathbf{x}(0)x(0)
from the original data distribution p0(x)p_0(\mathbf{x})p0 (x).

    There are a number of options to solve the reverse SDE which we
    won't analyze here. Make sure to check the original paper or this
    excellent blog post by the author.

score-based-sde-overview Overview of score-based generative modeling
through SDEs. Source: Song et al. 2021

Summary

Let's do a quick sum-up of the main points we learned in this
blogpost:

  * Diffusion models work by gradually adding gaussian noise through
    a series of TTT steps into the original image, a process known as
    diffusion.

  * To sample new data, we approximate the reverse diffusion process
    using a neural network.

  * The training of the model is based on maximizing the evidence
    lower bound (ELBO).

  * We can condition the diffusion models on image labels or text
    embeddings in order to "guide" the diffusion process.

  * Cascade and Latent diffusion are two approaches to scale up
    models to high-resolutions.

  * Cascade diffusion models are sequential diffusion models that
    generate images of increasing resolution.

  * Latent diffusion models (like stable diffusion) apply the
    diffusion process on a smaller latent space for computational
    efficiency using a variational autoencoder for the up and
    downsampling.

  * Score-based models also apply a sequence of noise perturbations
    to the original image. But they are trained using score-matching
    and Langevin dynamics. Nonetheless, they end up in a similar
    objective.

  * The diffusion process can be formulated as an SDE. Solving the
    reverse SDE allows us to generate new samples.

Finally, for more associations between diffusion models and VAE or AE
check out these really nice blogs.

Cite as

@article{karagiannakos2022diffusionmodels,
    title   = "Diffusion models: toward state-of-the-art image generation",
    author  = "Karagiannakos, Sergios, Adaloglou, Nikolaos",
    journal = "https://theaisummer.com/",
    year    = "2022",
    howpublished = {https://theaisummer.com/diffusion-models/},
  }

References

[1] Sohl-Dickstein, Jascha, et al. Deep Unsupervised Learning Using
Nonequilibrium Thermodynamics. arXiv:1503.03585, arXiv, 18 Nov. 2015

[2] Ho, Jonathan, et al. Denoising Diffusion Probabilistic Models.
arXiv:2006.11239, arXiv, 16 Dec. 2020

[3] Nichol, Alex, and Prafulla Dhariwal. Improved Denoising Diffusion
Probabilistic Models. arXiv:2102.09672, arXiv, 18 Feb. 2021

[4] Dhariwal, Prafulla, and Alex Nichol. Diffusion Models Beat GANs
on Image Synthesis. arXiv:2105.05233, arXiv, 1 June 2021

[5] Nichol, Alex, et al. GLIDE: Towards Photorealistic Image
Generation and Editing with Text-Guided Diffusion Models.
arXiv:2112.10741, arXiv, 8 Mar. 2022

[6] Ho, Jonathan, and Tim Salimans. Classifier-Free Diffusion
Guidance. 2021. openreview.net

[7] Ramesh, Aditya, et al. Hierarchical Text-Conditional Image
Generation with CLIP Latents. arXiv:2204.06125, arXiv, 12 Apr. 2022

[8] Saharia, Chitwan, et al. Photorealistic Text-to-Image Diffusion
Models with Deep Language Understanding. arXiv:2205.11487, arXiv, 23
May 2022

[9] Rombach, Robin, et al. High-Resolution Image Synthesis with
Latent Diffusion Models. arXiv:2112.10752, arXiv, 13 Apr. 2022

[10] Ho, Jonathan, et al. Cascaded Diffusion Models for High Fidelity
Image Generation. arXiv:2106.15282, arXiv, 17 Dec. 2021

[11] Weng, Lilian. What Are Diffusion Models? 11 July 2021

[12] O'Connor, Ryan. Introduction to Diffusion Models for Machine
Learning AssemblyAI Blog, 12 May 2022

[13] Rogge, Niels and Rasul, Kashif. The Annotated Diffusion Model .
Hugging Face Blog, 7 June 2022

[14] Das, Ayan. "An Introduction to Diffusion Probabilistic Models."
Ayan Das, 4 Dec. 2021

[15] Song, Yang, and Stefano Ermon. Generative Modeling by Estimating
Gradients of the Data Distribution. arXiv:1907.05600, arXiv, 10 Oct.
2020

[16] Song, Yang, and Stefano Ermon. Improved Techniques for Training
Score-Based Generative Models. arXiv:2006.09011, arXiv, 23 Oct. 2020

[17] Song, Yang, et al. Score-Based Generative Modeling through
Stochastic Differential Equations. arXiv:2011.13456, arXiv, 10 Feb.
2021

[18] Song, Yang. Generative Modeling by Estimating Gradients of the
Data Distribution, 5 May 2021

[19] Luo, Calvin. Understanding Diffusion Models: A Unified
Perspective. 25 Aug. 2022

Deep Learning in Production BookDeep Learning in Production Book

Deep Learning in Production Book 

Learn how to build, train, deploy, scale and maintain deep learning
models. Understand ML infrastructure and MLOps using hands-on
examples.

Learn more

* Disclosure: Please note that some of the links above might be
affiliate links, and at no additional cost to you, we will earn a
commission if you decide to make a purchase after clicking through.

  * Diffusion process
  * Forward diffusion
      + The reparameterization trick: tractable closed-form sampling
        at any timestep
      + Variance schedule
  * Reverse diffusion
      + Approximating the reverse process with a neural network
  * Training a diffusion model
  * Architecture
  * Conditional Image Generation: Guided Diffusion
      + Classifier guidance
      + Classifier-free guidance
  * Scaling up diffusion models
      + Cascade diffusion models
      + Stable diffusion: Latent diffusion models
  * Score-based generative models
      + Adding noise to score-based models: Noise Conditional Score
        Networks (NCSN)
      + Score-based generative modeling through stochastic
        differential equations (SDE)
  * Summary
  * Cite as
  * References

AI Summer
About
Start Here
Learn AI
Resources
Search
Contact
Newsletter
Privacy Policy
Support us
Books & Courses
Deep Learning in Production
Introduction to Deep Learning & Neural Networks
Representation Learning MSc course 2023
Get started with Machine Learning
Deep Reinforcement Learning Course
GANs in Computer Vision Free Ebook
Topics
Autoencoders
Attention and Transformers
Convolutional Neural Networks
Computer Vision
Generative Learning
Medical
Natural Language Processing
Reinforcement Learning
Software
 
AI SummerAI Summer
     
Copyright (c)2022 All rights reserved