[HN Gopher] Direct pixel-space megapixel image generation with d...
___________________________________________________________________
Direct pixel-space megapixel image generation with diffusion models
Author : stefanbaumann
Score : 268 points
Date : 2024-01-23 18:38 UTC (1 days ago)
(HTM) web link (crowsonkb.github.io)
(TXT) w3m dump (crowsonkb.github.io)
| tbalsam wrote:
| I enjoyed this paper (I share a discord with the author so I read
| it a bit earlier).
|
| It's not entirely clear from the comparison numbers at the end,
| but I think the big argument here is efficiency for the amount of
| performance achieved. One can get lower FID numbers, but also
| with a ton of compute.
|
| I can't really speak technically to it as I've not given it a
| super in depth look, but this seems like a nice set of motifs for
| going halfway between a standard attention network and a convnet
| in terms of compute cost (and maybe performance)?
|
| The large-resolution scaling seems to be a strong suit as a
| result. :)
| stefanbaumann wrote:
| Thanks a lot!
|
| Yeah, the main motivation was trying to find a way to enable
| transformers to do high-resolution image synthesis:
| transformers are known to scale well to extreme, multi-billion
| parameter scales and typically offer superior coherency &
| composition in image generation, but current architectures are
| too expensive to train at scale for high-resolution inputs.
|
| By using a hierarchical architecture and local attention at
| high-resolution scales (but retaining global attention at low-
| resolution scales), it becomes viable to apply transformers at
| these scales. Additionally, this architecture can now directly
| be trained on megapixel-scale inputs and generate high-quality
| results without having to progressively grow the resolution
| over the training or applying other "tricks" typically needed
| to make models at these resolutions work well.
| nwoli wrote:
| Which discord if its open to the public? I was on one woth kath
| in 2021 and loved her insights, would love to again
| fpgaminer wrote:
| Same; a good ML focused discord would be great. Training ViTs
| all day is lonely work. I'm mostly locked into skimming the
| "Research" channels of image generation discords. LAION used
| to be decent with a good amount of interesting discussion,
| but it seems to have devolved into toxicity in the last year.
| l33tman wrote:
| LAION is good
| SEGyges wrote:
| See my other comment replying to that.
| SEGyges wrote:
| You and the guy below you in this thread should probably tag
| me on twitter, same tag as here, I can point you. I do not
| especially want to leave the discord link in a frontpage hn
| thread.
| sorenjan wrote:
| This is probably a stupid question, but what kind of image
| generation does this do? The architecture overview shows "input
| image", and I don't see anything about text to image. Is it super
| resolution? Does class-conditional mean that it takes a class
| like "car" or "face" and generate a new random image of that
| class?
| GaggiX wrote:
| If it's Imagenet class-conditioned, FFHQ unconditioned.
|
| >Does class-conditional mean that it takes a class like "car"
| or "face" and generate a new random image of that class?
|
| Yup
| Birch-san wrote:
| > Is it super resolution?
|
| nope, we don't do Imagen-style super-resolution. we go direct
| to high resolution with a single-stage model.
| sorenjan wrote:
| I was referring to the input image in the diagram, what is
| that and how is the output image generated from it? Is it
| 256x256 noise that gets denoised into an image? I guess what
| I'm really asking is what guides the process into the final
| image if it's not text to image?
| stefanbaumann wrote:
| The "input image" is just the noisy sample from the
| previous timestep, yes.
|
| The overall architecture diagram does not explicitly show
| the conditioning mechanism, which is a small separate
| network. For this paper, we only trained on class-
| conditional ImageNet and completely unconditional
| megapixel-scale FFHQ.
|
| Training large-scale text-to-image models with this
| architecture is something we have not yet attempted,
| although there's no indication that this shouldn't work
| with a few tweaks.
| sorenjan wrote:
| Thank you, I'm not used to reading this kind of research
| papers but I think I got the gist of it now.
|
| Can this architecture be used to distill models that need
| fewer timesteps like LCMs or SDXL turbo?
| stefanbaumann wrote:
| Both Latent Consistency Models and Adversarial Diffusion
| Distillation (the method behind SDXL Turbo) are methods
| that do not depend on any specific properties of the
| backbone. So, as Hourglass Diffusion Transformers are
| just a new kind of backbone that can be used just like
| the Diffusion U-Nets in Stable Diffusion (XL), these
| methods should also be applicable to it.
| GaggiX wrote:
| I hope that all these insights about diffusion model training
| that have been explored in last few years will be used by
| Stability AI to train their large text-to-image models, because
| when it comes to that they just use to most basic pipeline you
| can imagine with plenty of problems that get "solved" by some
| workarounds, for example to train SDXL they used the scheduler
| used by the DDPM paper(2020), epsilon-objective and noise-offset,
| an ugly workaround that was created when people realized that SD
| v1.5 wasn't able to generate images that were too dark or bright,
| a problem related to the epsilon-objective that cause the model
| to always generate images with a mean close to 0 (the same as the
| gaussian noise).
|
| A few people have finetuned Stable Diffusion models on
| v-objective and solved the problem from the root.
| SEGyges wrote:
| I have good news about who wrote this paper
| GaggiX wrote:
| Two authors are from Stability AI, that's the reason why I
| wrote the comment.
| Birch-san wrote:
| I'm one of the authors; happy to answer questions. this arch is
| of course nice for high-resolution synthesis, but there's some
| other cool stuff worth mentioning..
|
| activations are small! so you can enjoy bigger batch sizes. this
| is due to the 4x patching we do on the ingress to the model, and
| the effectiveness of neighbourhood attention in joining patches
| at the seams.
|
| the model's inductive biases are pretty different than (for
| example) a convolutional UNet's. the innermost levels seem to
| train easily, so images can have good global coherence early in
| training.
|
| there's no convolutions! so you don't need to worry about
| artifacts stemming from convolution padding, or having canvas
| edge padding artifacts leak an implicit position bias.
|
| we can finally see what high-resolution diffusion outputs look
| like _without_ latents! personally I think current latent VAEs
| don't _really_ achieve the high resolutions they claim (otherwise
| fine details like text would survive a VAE roundtrip faithfully);
| it's common to see latent diffusion outputs with smudgy skin or
| blurry fur. what I'd like to see in the future of latent
| diffusion is to listen to the Emu paper and use more channels, or
| a less ambitious upsample.
|
| it's a transformer! so we can try applying to it everything we
| know about transformers, like sigma reparameterisation or
| multimodality. some tricks like masked training will require
| extra support in [NATTEN](https://github.com/SHI-Labs/NATTEN),
| but we're very happy with its featureset and performance so far.
|
| but honestly I'm most excited about the efficiency. there's too
| little work on making pretraining possible at GPU-poor scale. so
| I was very happy to see HDiT could succeed at small-scale tasks
| within the resources I had at home (you can get nice oxford
| flowers samples at 256x256px with half an hour on a 4090). I
| think with models that are better fits for the problem, perhaps
| we can get good results with smaller models. and I'd like to see
| big tech go that direction too!
|
| -Alex Birch
| sophrocyne wrote:
| Alex - I run Invoke (one of the popular OSS SD UIs for pros)
|
| Thanks for your work - it's been impactful since the early days
| of the project.
|
| Excited to see where we get to this year.
| Birch-san wrote:
| ah, originally lstein/stable-diffusion? yeah that was an
| important fork for us Mac users in the early days. I have to
| confess I've still never used a UI. :)
|
| this year I'm hoping for efficiency and small models! even if
| it's proprietary. if our work can reduce some energy usage
| behind closed doors that'd still be a good outcome.
| sophrocyne wrote:
| Yes, indeed. Lincoln's still an active maintainer.
|
| Energy efficiency is key - Especially with some of these
| extremely inefficient (wasteful, even) features like real-
| time canvas.
|
| Good luck - Let us know if/how we can help.
| ttul wrote:
| Hi Alex Amazing work. I scanned the paper and dusted off my
| aging memories of Jeremy Howard's course. Will your model live
| happily alongside the existing SD infrastructure such as
| ControlNet, IPAdapter, and the like? Obviously we will have to
| retrain these to fit onto your model, but conceptually, does
| your model have natural places where adapters of various kinds
| can be attached?
| Birch-san wrote:
| regarding ControlNet: we have a UNet backbone, so the idea of
| "make trainable copies of the encoder blocks" sounds
| possible. the other part, "use a zero-inited dense layer to
| project the peer-encoder output and add it to the frozen-
| decoder output" also sounds fine. not quite sure what they do
| with the mid-block but I doubt there'd be any problem there.
|
| regarding IPAdapter: I'm not familiar with it, but from the
| code it looks like they just run cross-attention again and
| sum the two attention outputs. feels a bit weird to me,
| because the attention probabilities add up to 2 instead of 1.
| and they scale the bonus attention output only instead of
| lerping. it'd make more sense to me to formulate it as a
| cross-cross attention (Q against cat([key0, key1]) and
| cat([val0, val1])), but maybe they wanted it to begin as a
| no-op at the start of training or something. anyway.. yes,
| all of that should work fine with HDiT. the paper doesn't
| implement cross-attention, but it can be added in the
| standard way (e.g. like stable-diffusion) or as self-cross
| attention (e.g. DeepFloyd IF or Imagen).
|
| I'd recommend though to make use of HDiT's mapping network.
| in our attention blocks, the input gets AdaNormed against the
| condition from the mapping network. this is currently used to
| convey stuff like class conditions, Karras augmentation
| conditions and timestep embeddings. but it supports
| conditioning on custom (single-token) conditions of your
| choosing. so you could use this to condition on an image
| embed (this would give you the same image-conditioning
| control as IPAdapter but via a simpler mechanism).
| bravura wrote:
| IPAdapter, I am curious if there are useful GUIs for this?
| Creating image masks through uploading to colab is not so
| cute.
| orbital-decay wrote:
| Here's one example: https://github.com/Acly/krita-ai-
| diffusion/
|
| But generally, most other UIs support it. It has serious
| limitations though, for example it center-crops the input
| to 224x224px. (which is enough for a surprisingly large
| amount of uses, but not enough for many others)
| ttul wrote:
| Yes. I discussed this issue with the author of the
| ComfyUI IP-Adapter nodes. It would doubtless be handy if
| someone could end-to-end train a higher resolution IP-
| Adapter model that integrated its own variant of
| CLIPVision that is not subject to the 224px constraint. I
| have no idea what kind of horsepower would be required
| for that.
|
| A latent space CLIPVision model would be cool too.
| Presumably you could leverage the semantic richness of
| the latent space to efficiently train a more powerful
| CLIPVision. I don't know whether anyone has tried this.
| Maybe there is a good reason for that.
| bertdb wrote:
| Did you do any inpainting experiments? I can imagine a pixel-
| space diffusion model to be better at it than one with a latent
| auto-encoder.
| stefanbaumann wrote:
| Not yet, we focused on the architecture for this paper. I
| totally agree with you though - pixel space is generally less
| limiting than a latent space for diffusion, so we would
| expect good performance inpainting behavior and other editing
| tasks.
| michaelt wrote:
| I appreciate the restraint of showing the speedup on a log-
| scale chart rather than trying to show a 99% speed up any other
| way.
|
| I see your headline speed comparison is to "Pixel-space
| DiT-B/4" - but how does your model compare to the likes of
| SDXL? I gather they spent $$$$$$ on training etc, so I'd
| understand if direct comparisons don't make sense.
|
| And do you have any results on things that are traditionally
| challenging for generative AI, like clocks and mirrors?
| artninja1988 wrote:
| Looking at the output image examples, very nice, although they
| seem a little blurry. But I guess that's a dataset issue? Have
| you tried training anything above 1024x1024? Hope someone
| releases a model based on this since open source pixel space
| models are a rarity afaik
| Birch-san wrote:
| the FFHQ-1024 examples shouldn't be blurry. you can download
| the originals from the project page[0] -- click any image in
| the teaser, or download our 50k samples.
|
| the ImageNet-256 examples also aren't typically blurry (but
| they are 256x256 so your viewer may be bicubic scaling them or
| something). the ImageNet dataset _can_ have blurry, compressed
| or low resolution training samples, which can afflict some
| classes more than others, and we learn to produce samples like
| the training set.
|
| [0] https://crowsonkb.github.io/hourglass-diffusion-
| transformers...
| gutianpei wrote:
| I think NATTEN does not support cross attention, wonder if the
| authors have tried any text-conditioned cases? Does the cross-
| attention can only add to regular attention? Or added through
| adanorm?
| Birch-san wrote:
| cross-attention doesn't need to involve NATTEN. there's no
| neighbourhood involved because it's not self-attention. so you
| can do it the stable-diffusion way: after self-attention, run
| torch sdp with Q=image and K=V=text.
|
| I tried adding "stable-diffusion-style" cross-attn to HDiT,
| text-conditioning on small class-conditional datasets (oxford
| flowers), embedding the class labels as text prompts with
| Phi-1.5. trained it for a few minutes, and the images were
| relevant to the prompts, so it seemed to be working fine.
|
| but if instead of a text condition you have a single-token
| condition (class label) then yeah the adanorm would be a
| simpler way.
| fpgaminer wrote:
| Seems like a solid paper from a skim through it. My rough
| summary:
|
| The popular large scale diffusion models like StableDiffusion are
| CNN based at their heart, with attention layers sprinkled
| throughout. This paper builds on recent research exploring
| whether competitive image diffusion models can be built out of
| purely transformers, no CNN layers.
|
| In this paper they build a similar U-Net like structure, but out
| of transformer layers, to improve efficiency compared to a
| straight Transformer. They also use local attention when the
| resolution is high to save on computational cost, but regular
| global attention in the middle to maintain global coherence.
|
| Based on ablation studies this allows them to maintain or
| slightly improve FID score compared to Transformer-only diffusion
| models that don't do U-net like structures, but at 1/10th the
| computation cost. An incredible feat for sure.
|
| There is a variety of details: RoPE positional encoding, GEGELU
| activations, RMSNorm, learnable skip connections, learnable
| cosine-sim attention, neighborhood attention for the local
| attention, etc.
|
| The biggest gains in FID occur when the authors use "soft-min-
| snr" as the loss function; FID drops from 41 to 28!
|
| Lots of ablation study was done across all their changes (see
| Table 1).
|
| Training is otherwise completely standard AdamW, 5e-4, 0.01, 256
| batch, constant LR, 400k steps for most experiments at 128x128
| resolution.
|
| So yeah, overall seems like solid work that combines a great
| mixture of techniques and pushes Transformer based diffusion
| forward.
|
| If scaled up I'm not sure it would be "revolutionary" in terms of
| FID compared to SDXL or DALLE3, mostly because SD and DALLE
| already use attention obviating the scaling issue, and lots of
| tricks like diffusion based VAEs. But it's likely to provide a
| nice incremental improvement in FID, since in general
| Transformers perform better than CNNs unless the CNNs are
| _heavily_ tuned.
|
| And being pixel based rather than latent based has many
| advantages.
| Birch-san wrote:
| FID doesn't reward high-resolution detail. the inception
| feature size is 299x299! so we are forced to _downsample_ our
| FFHQ-1024 samples to compute FID.
|
| it also doesn't punish poor detail either! this advantages
| latent diffusion, which can claim to achieve a high resolution
| but without actually needing to have correct textures to get
| good metrics.
| imjonse wrote:
| Are there any public pretrained checkpoints available or planned?
| clauer2024 wrote:
| Can I ask one basic thing --> From what are the images generated?
| ShamelessC wrote:
| Not text but an id representing a class/category of images from
| the dataset. Or they are "unconditional" and the model tries to
| output something similar to a random image from the dataset
| each time.
| stefanbaumann wrote:
| The models presented in the paper are trained on class-
| conditional ImageNet (where the input is Gaussian noise and one
| of 1000 classes, e.g., "car") and unconditional FFHQ (where the
| input is only Gaussian noise).
___________________________________________________________________
(page generated 2024-01-24 23:02 UTC)