[HN Gopher] Diffusion models from scratch, from a new theoretica...
___________________________________________________________________
Diffusion models from scratch, from a new theoretical perspective
Author : jxmorris12
Score : 116 points
Date : 2024-03-11 19:43 UTC (3 hours ago)
(HTM) web link (www.chenyang.co)
(TXT) w3m dump (www.chenyang.co)
| swyx wrote:
| oh this has code! great stuff. diffusion papers are famous for a
| lot of equations
| (https://twitter.com/cto_junior/status/1766518604395155830) but
| code is much more legible (and precise?) for the rest of us. all
| theory papers should come with reference impl code.
|
| i'd love an extension of this for the diffusion transformer,
| which drives Sora and other videogen models. maybe combine this
| post with https://jaykmody.com/blog/gpt-from-scratch/ and make
| the "diffusion transformer from scratch" intro
| GaggiX wrote:
| >i'd love an extension of this for the diffusion transformer
|
| All you need to do is replace the U-net with a transformer
| encoder (remove the embeddings, and project the image patches
| into vectors of size n_embd), and the diffusion process can
| remain the same.
| swyx wrote:
| seems too simple. isn't there also a temporal dimension you
| need to encode?
| GaggiX wrote:
| For the conditioning and t there are different
| possibilities, for example the unpooled text embeddings (if
| the model is conditioned on text) usually go in the cross-
| attn while the pooled text embedding plus t is used in
| adaLN blocks like StyleGAN (the first one), but there are
| many other different strategies.
| ycy wrote:
| Author here, when I tried to understand diffusion models I
| realized that the code and math can be greatly simplified, which
| led to me writing this blog post and diffusion library.
|
| Happy to answer any questions.
| xchip wrote:
| Your post is awesome and is explaining something nobody else
| did, thanks!
| thomasahle wrote:
| Your `get_sigma_embeds(batches, sigma)` seems to not use its
| first input? Did you mean to broadcast sigma to shape (batches,
| 1)?
| skybrian wrote:
| This is a nice explanation of the theory. It seems to be dataset-
| independent. I'm wondering about the specifics of generating
| images.
|
| For example, what is it about image generators makes it hard for
| them to generate piano keyboards? It seems like some better
| representation of medium-distance constraints is needed to get
| alternating groups of two and three black notes.
| Vecr wrote:
| It's the finger problem, you've got to get the number, size,
| angle, position, etc. right every single time or people can
| tell very fast. It's not like tree branches where people won't
| notice if the splitting positions are "wrong".
| strangecasts wrote:
| Super interesting!
|
| Immediately reminded of Iterative alpha-(de)Blending [1] which
| also sets out to set up a conceptually simpler diffusion model,
| and also arrives at formulating it as an approximate iterative
| projection process - I think this approach allows for more
| interesting experiments like the denoiser error analysis, though.
|
| [1] https://arxiv.org/pdf/2305.03486.pdf
| adamnemecek wrote:
| All machine learning models are convolutions, mark my words.
| hotdogscout wrote:
| There's a secret society using the comments on this post to send
| a message do not Google
___________________________________________________________________
(page generated 2024-03-11 23:00 UTC)