[HN Gopher] Diffusion models from scratch, from a new theoretica...
       ___________________________________________________________________
        
       Diffusion models from scratch, from a new theoretical perspective
        
       Author : jxmorris12
       Score  : 116 points
       Date   : 2024-03-11 19:43 UTC (3 hours ago)
        
 (HTM) web link (www.chenyang.co)
 (TXT) w3m dump (www.chenyang.co)
        
       | swyx wrote:
       | oh this has code! great stuff. diffusion papers are famous for a
       | lot of equations
       | (https://twitter.com/cto_junior/status/1766518604395155830) but
       | code is much more legible (and precise?) for the rest of us. all
       | theory papers should come with reference impl code.
       | 
       | i'd love an extension of this for the diffusion transformer,
       | which drives Sora and other videogen models. maybe combine this
       | post with https://jaykmody.com/blog/gpt-from-scratch/ and make
       | the "diffusion transformer from scratch" intro
        
         | GaggiX wrote:
         | >i'd love an extension of this for the diffusion transformer
         | 
         | All you need to do is replace the U-net with a transformer
         | encoder (remove the embeddings, and project the image patches
         | into vectors of size n_embd), and the diffusion process can
         | remain the same.
        
           | swyx wrote:
           | seems too simple. isn't there also a temporal dimension you
           | need to encode?
        
             | GaggiX wrote:
             | For the conditioning and t there are different
             | possibilities, for example the unpooled text embeddings (if
             | the model is conditioned on text) usually go in the cross-
             | attn while the pooled text embedding plus t is used in
             | adaLN blocks like StyleGAN (the first one), but there are
             | many other different strategies.
        
       | ycy wrote:
       | Author here, when I tried to understand diffusion models I
       | realized that the code and math can be greatly simplified, which
       | led to me writing this blog post and diffusion library.
       | 
       | Happy to answer any questions.
        
         | xchip wrote:
         | Your post is awesome and is explaining something nobody else
         | did, thanks!
        
         | thomasahle wrote:
         | Your `get_sigma_embeds(batches, sigma)` seems to not use its
         | first input? Did you mean to broadcast sigma to shape (batches,
         | 1)?
        
       | skybrian wrote:
       | This is a nice explanation of the theory. It seems to be dataset-
       | independent. I'm wondering about the specifics of generating
       | images.
       | 
       | For example, what is it about image generators makes it hard for
       | them to generate piano keyboards? It seems like some better
       | representation of medium-distance constraints is needed to get
       | alternating groups of two and three black notes.
        
         | Vecr wrote:
         | It's the finger problem, you've got to get the number, size,
         | angle, position, etc. right every single time or people can
         | tell very fast. It's not like tree branches where people won't
         | notice if the splitting positions are "wrong".
        
       | strangecasts wrote:
       | Super interesting!
       | 
       | Immediately reminded of Iterative alpha-(de)Blending [1] which
       | also sets out to set up a conceptually simpler diffusion model,
       | and also arrives at formulating it as an approximate iterative
       | projection process - I think this approach allows for more
       | interesting experiments like the denoiser error analysis, though.
       | 
       | [1] https://arxiv.org/pdf/2305.03486.pdf
        
       | adamnemecek wrote:
       | All machine learning models are convolutions, mark my words.
        
       | hotdogscout wrote:
       | There's a secret society using the comments on this post to send
       | a message do not Google
        
       ___________________________________________________________________
       (page generated 2024-03-11 23:00 UTC)