[HN Gopher] Direct pixel-space megapixel image generation with d...
       ___________________________________________________________________
        
       Direct pixel-space megapixel image generation with diffusion models
        
       Author : stefanbaumann
       Score  : 110 points
       Date   : 2024-01-23 18:38 UTC (4 hours ago)
        
 (HTM) web link (crowsonkb.github.io)
 (TXT) w3m dump (crowsonkb.github.io)
        
       | tbalsam wrote:
       | I enjoyed this paper (I share a discord with the author so I read
       | it a bit earlier).
       | 
       | It's not entirely clear from the comparison numbers at the end,
       | but I think the big argument here is efficiency for the amount of
       | performance achieved. One can get lower FID numbers, but also
       | with a ton of compute.
       | 
       | I can't really speak technically to it as I've not given it a
       | super in depth look, but this seems like a nice set of motifs for
       | going halfway between a standard attention network and a convnet
       | in terms of compute cost (and maybe performance)?
       | 
       | The large-resolution scaling seems to be a strong suit as a
       | result. :)
        
         | stefanbaumann wrote:
         | Thanks a lot!
         | 
         | Yeah, the main motivation was trying to find a way to enable
         | transformers to do high-resolution image synthesis:
         | transformers are known to scale well to extreme, multi-billion
         | parameter scales and typically offer superior coherency &
         | composition in image generation, but current architectures are
         | too expensive to train at scale for high-resolution inputs.
         | 
         | By using a hierarchical architecture and local attention at
         | high-resolution scales (but retaining global attention at low-
         | resolution scales), it becomes viable to apply transformers at
         | these scales. Additionally, this architecture can now directly
         | be trained on megapixel-scale inputs and generate high-quality
         | results without having to progressively grow the resolution
         | over the training or applying other "tricks" typically needed
         | to make models at these resolutions work well.
        
         | nwoli wrote:
         | Which discord if its open to the public? I was on one woth kath
         | in 2021 and loved her insights, would love to again
        
       | sorenjan wrote:
       | This is probably a stupid question, but what kind of image
       | generation does this do? The architecture overview shows "input
       | image", and I don't see anything about text to image. Is it super
       | resolution? Does class-conditional mean that it takes a class
       | like "car" or "face" and generate a new random image of that
       | class?
        
         | GaggiX wrote:
         | If it's Imagenet class-conditioned, FFHQ unconditioned.
         | 
         | >Does class-conditional mean that it takes a class like "car"
         | or "face" and generate a new random image of that class?
         | 
         | Yup
        
       | GaggiX wrote:
       | I hope that all these insights about diffusion model training
       | that have been explored in last few years will be used by
       | Stability AI to train their large text-to-image models, because
       | when it comes to that they just use to most basic pipeline you
       | can imagine with plenty of problems that get "solved" by some
       | workarounds, for example to train SDXL they used the scheduler
       | used by the DDPM paper(2020), epsilon-objective and noise-offset,
       | an ugly workaround that was created when people realized that SD
       | v1.5 wasn't able to generate images that were too dark or bright,
       | a problem related to the epsilon-objective that cause the model
       | to always generate images with a mean close to 0 (the same as the
       | gaussian noise).
       | 
       | A few people have finetuned Stable Diffusion models on
       | v-objective and solved the problem from the root.
        
       | Birch-san wrote:
       | I'm one of the authors; happy to answer questions. this arch is
       | of course nice for high-resolution synthesis, but there's some
       | other cool stuff worth mentioning..
       | 
       | activations are small! so you can enjoy bigger batch sizes. this
       | is due to the 4x patching we do on the ingress to the model, and
       | the effectiveness of neighbourhood attention in joining patches
       | at the seams.
       | 
       | the model's inductive biases are pretty different than (for
       | example) a convolutional UNet's. the innermost levels seem to
       | train easily, so images can have good global coherence early in
       | training.
       | 
       | there's no convolutions! so you don't need to worry about
       | artifacts stemming from convolution padding, or having canvas
       | edge padding artifacts leak an implicit position bias.
       | 
       | we can finally see what high-resolution diffusion outputs look
       | like _without_ latents! personally I think current latent VAEs
       | don't _really_ achieve the high resolutions they claim (otherwise
       | fine details like text would survive a VAE roundtrip faithfully);
       | it's common to see latent diffusion outputs with smudgy skin or
       | blurry fur. what I'd like to see in the future of latent
       | diffusion is to listen to the Emu paper and use more channels, or
       | a less ambitious upsample.
       | 
       | it's a transformer! so we can try applying to it everything we
       | know about transformers, like sigma reparameterisation or
       | multimodality. some tricks like masked training will require
       | extra support in [NATTEN](https://github.com/SHI-Labs/NATTEN),
       | but we're very happy with its featureset and performance so far.
       | 
       | but honestly I'm most excited about the efficiency. there's too
       | little work on making pretraining possible at GPU-poor scale. so
       | I was very happy to see HDiT could succeed at small-scale tasks
       | within the resources I had at home (you can get nice oxford
       | flowers samples at 256x256px with half an hour on a 4090). I
       | think with models that are better fits for the problem, perhaps
       | we can get good results with smaller models. and I'd like to see
       | big tech go that direction too!
       | 
       | -Alex Birch
        
       | artninja1988 wrote:
       | Looking at the output image examples, very nice, although they
       | seem a little blurry. But I guess that's a dataset issue? Have
       | you tried training anything above 1024x1024? Hope someone
       | releases a model based on this since open source pixel space
       | models are a rarity afaik
        
       ___________________________________________________________________
       (page generated 2024-01-23 23:00 UTC)