[HN Gopher] Direct pixel-space megapixel image generation with d...
___________________________________________________________________
Direct pixel-space megapixel image generation with diffusion models
Author : stefanbaumann
Score : 110 points
Date : 2024-01-23 18:38 UTC (4 hours ago)
(HTM) web link (crowsonkb.github.io)
(TXT) w3m dump (crowsonkb.github.io)
| tbalsam wrote:
| I enjoyed this paper (I share a discord with the author so I read
| it a bit earlier).
|
| It's not entirely clear from the comparison numbers at the end,
| but I think the big argument here is efficiency for the amount of
| performance achieved. One can get lower FID numbers, but also
| with a ton of compute.
|
| I can't really speak technically to it as I've not given it a
| super in depth look, but this seems like a nice set of motifs for
| going halfway between a standard attention network and a convnet
| in terms of compute cost (and maybe performance)?
|
| The large-resolution scaling seems to be a strong suit as a
| result. :)
| stefanbaumann wrote:
| Thanks a lot!
|
| Yeah, the main motivation was trying to find a way to enable
| transformers to do high-resolution image synthesis:
| transformers are known to scale well to extreme, multi-billion
| parameter scales and typically offer superior coherency &
| composition in image generation, but current architectures are
| too expensive to train at scale for high-resolution inputs.
|
| By using a hierarchical architecture and local attention at
| high-resolution scales (but retaining global attention at low-
| resolution scales), it becomes viable to apply transformers at
| these scales. Additionally, this architecture can now directly
| be trained on megapixel-scale inputs and generate high-quality
| results without having to progressively grow the resolution
| over the training or applying other "tricks" typically needed
| to make models at these resolutions work well.
| nwoli wrote:
| Which discord if its open to the public? I was on one woth kath
| in 2021 and loved her insights, would love to again
| sorenjan wrote:
| This is probably a stupid question, but what kind of image
| generation does this do? The architecture overview shows "input
| image", and I don't see anything about text to image. Is it super
| resolution? Does class-conditional mean that it takes a class
| like "car" or "face" and generate a new random image of that
| class?
| GaggiX wrote:
| If it's Imagenet class-conditioned, FFHQ unconditioned.
|
| >Does class-conditional mean that it takes a class like "car"
| or "face" and generate a new random image of that class?
|
| Yup
| GaggiX wrote:
| I hope that all these insights about diffusion model training
| that have been explored in last few years will be used by
| Stability AI to train their large text-to-image models, because
| when it comes to that they just use to most basic pipeline you
| can imagine with plenty of problems that get "solved" by some
| workarounds, for example to train SDXL they used the scheduler
| used by the DDPM paper(2020), epsilon-objective and noise-offset,
| an ugly workaround that was created when people realized that SD
| v1.5 wasn't able to generate images that were too dark or bright,
| a problem related to the epsilon-objective that cause the model
| to always generate images with a mean close to 0 (the same as the
| gaussian noise).
|
| A few people have finetuned Stable Diffusion models on
| v-objective and solved the problem from the root.
| Birch-san wrote:
| I'm one of the authors; happy to answer questions. this arch is
| of course nice for high-resolution synthesis, but there's some
| other cool stuff worth mentioning..
|
| activations are small! so you can enjoy bigger batch sizes. this
| is due to the 4x patching we do on the ingress to the model, and
| the effectiveness of neighbourhood attention in joining patches
| at the seams.
|
| the model's inductive biases are pretty different than (for
| example) a convolutional UNet's. the innermost levels seem to
| train easily, so images can have good global coherence early in
| training.
|
| there's no convolutions! so you don't need to worry about
| artifacts stemming from convolution padding, or having canvas
| edge padding artifacts leak an implicit position bias.
|
| we can finally see what high-resolution diffusion outputs look
| like _without_ latents! personally I think current latent VAEs
| don't _really_ achieve the high resolutions they claim (otherwise
| fine details like text would survive a VAE roundtrip faithfully);
| it's common to see latent diffusion outputs with smudgy skin or
| blurry fur. what I'd like to see in the future of latent
| diffusion is to listen to the Emu paper and use more channels, or
| a less ambitious upsample.
|
| it's a transformer! so we can try applying to it everything we
| know about transformers, like sigma reparameterisation or
| multimodality. some tricks like masked training will require
| extra support in [NATTEN](https://github.com/SHI-Labs/NATTEN),
| but we're very happy with its featureset and performance so far.
|
| but honestly I'm most excited about the efficiency. there's too
| little work on making pretraining possible at GPU-poor scale. so
| I was very happy to see HDiT could succeed at small-scale tasks
| within the resources I had at home (you can get nice oxford
| flowers samples at 256x256px with half an hour on a 4090). I
| think with models that are better fits for the problem, perhaps
| we can get good results with smaller models. and I'd like to see
| big tech go that direction too!
|
| -Alex Birch
| artninja1988 wrote:
| Looking at the output image examples, very nice, although they
| seem a little blurry. But I guess that's a dataset issue? Have
| you tried training anything above 1024x1024? Hope someone
| releases a model based on this since open source pixel space
| models are a rarity afaik
___________________________________________________________________
(page generated 2024-01-23 23:00 UTC)