[HN Gopher] Direct pixel-space megapixel image generation with d...
       ___________________________________________________________________
        
       Direct pixel-space megapixel image generation with diffusion models
        
       Author : stefanbaumann
       Score  : 268 points
       Date   : 2024-01-23 18:38 UTC (1 days ago)
        
 (HTM) web link (crowsonkb.github.io)
 (TXT) w3m dump (crowsonkb.github.io)
        
       | tbalsam wrote:
       | I enjoyed this paper (I share a discord with the author so I read
       | it a bit earlier).
       | 
       | It's not entirely clear from the comparison numbers at the end,
       | but I think the big argument here is efficiency for the amount of
       | performance achieved. One can get lower FID numbers, but also
       | with a ton of compute.
       | 
       | I can't really speak technically to it as I've not given it a
       | super in depth look, but this seems like a nice set of motifs for
       | going halfway between a standard attention network and a convnet
       | in terms of compute cost (and maybe performance)?
       | 
       | The large-resolution scaling seems to be a strong suit as a
       | result. :)
        
         | stefanbaumann wrote:
         | Thanks a lot!
         | 
         | Yeah, the main motivation was trying to find a way to enable
         | transformers to do high-resolution image synthesis:
         | transformers are known to scale well to extreme, multi-billion
         | parameter scales and typically offer superior coherency &
         | composition in image generation, but current architectures are
         | too expensive to train at scale for high-resolution inputs.
         | 
         | By using a hierarchical architecture and local attention at
         | high-resolution scales (but retaining global attention at low-
         | resolution scales), it becomes viable to apply transformers at
         | these scales. Additionally, this architecture can now directly
         | be trained on megapixel-scale inputs and generate high-quality
         | results without having to progressively grow the resolution
         | over the training or applying other "tricks" typically needed
         | to make models at these resolutions work well.
        
         | nwoli wrote:
         | Which discord if its open to the public? I was on one woth kath
         | in 2021 and loved her insights, would love to again
        
           | fpgaminer wrote:
           | Same; a good ML focused discord would be great. Training ViTs
           | all day is lonely work. I'm mostly locked into skimming the
           | "Research" channels of image generation discords. LAION used
           | to be decent with a good amount of interesting discussion,
           | but it seems to have devolved into toxicity in the last year.
        
             | l33tman wrote:
             | LAION is good
        
             | SEGyges wrote:
             | See my other comment replying to that.
        
           | SEGyges wrote:
           | You and the guy below you in this thread should probably tag
           | me on twitter, same tag as here, I can point you. I do not
           | especially want to leave the discord link in a frontpage hn
           | thread.
        
       | sorenjan wrote:
       | This is probably a stupid question, but what kind of image
       | generation does this do? The architecture overview shows "input
       | image", and I don't see anything about text to image. Is it super
       | resolution? Does class-conditional mean that it takes a class
       | like "car" or "face" and generate a new random image of that
       | class?
        
         | GaggiX wrote:
         | If it's Imagenet class-conditioned, FFHQ unconditioned.
         | 
         | >Does class-conditional mean that it takes a class like "car"
         | or "face" and generate a new random image of that class?
         | 
         | Yup
        
         | Birch-san wrote:
         | > Is it super resolution?
         | 
         | nope, we don't do Imagen-style super-resolution. we go direct
         | to high resolution with a single-stage model.
        
           | sorenjan wrote:
           | I was referring to the input image in the diagram, what is
           | that and how is the output image generated from it? Is it
           | 256x256 noise that gets denoised into an image? I guess what
           | I'm really asking is what guides the process into the final
           | image if it's not text to image?
        
             | stefanbaumann wrote:
             | The "input image" is just the noisy sample from the
             | previous timestep, yes.
             | 
             | The overall architecture diagram does not explicitly show
             | the conditioning mechanism, which is a small separate
             | network. For this paper, we only trained on class-
             | conditional ImageNet and completely unconditional
             | megapixel-scale FFHQ.
             | 
             | Training large-scale text-to-image models with this
             | architecture is something we have not yet attempted,
             | although there's no indication that this shouldn't work
             | with a few tweaks.
        
               | sorenjan wrote:
               | Thank you, I'm not used to reading this kind of research
               | papers but I think I got the gist of it now.
               | 
               | Can this architecture be used to distill models that need
               | fewer timesteps like LCMs or SDXL turbo?
        
               | stefanbaumann wrote:
               | Both Latent Consistency Models and Adversarial Diffusion
               | Distillation (the method behind SDXL Turbo) are methods
               | that do not depend on any specific properties of the
               | backbone. So, as Hourglass Diffusion Transformers are
               | just a new kind of backbone that can be used just like
               | the Diffusion U-Nets in Stable Diffusion (XL), these
               | methods should also be applicable to it.
        
       | GaggiX wrote:
       | I hope that all these insights about diffusion model training
       | that have been explored in last few years will be used by
       | Stability AI to train their large text-to-image models, because
       | when it comes to that they just use to most basic pipeline you
       | can imagine with plenty of problems that get "solved" by some
       | workarounds, for example to train SDXL they used the scheduler
       | used by the DDPM paper(2020), epsilon-objective and noise-offset,
       | an ugly workaround that was created when people realized that SD
       | v1.5 wasn't able to generate images that were too dark or bright,
       | a problem related to the epsilon-objective that cause the model
       | to always generate images with a mean close to 0 (the same as the
       | gaussian noise).
       | 
       | A few people have finetuned Stable Diffusion models on
       | v-objective and solved the problem from the root.
        
         | SEGyges wrote:
         | I have good news about who wrote this paper
        
           | GaggiX wrote:
           | Two authors are from Stability AI, that's the reason why I
           | wrote the comment.
        
       | Birch-san wrote:
       | I'm one of the authors; happy to answer questions. this arch is
       | of course nice for high-resolution synthesis, but there's some
       | other cool stuff worth mentioning..
       | 
       | activations are small! so you can enjoy bigger batch sizes. this
       | is due to the 4x patching we do on the ingress to the model, and
       | the effectiveness of neighbourhood attention in joining patches
       | at the seams.
       | 
       | the model's inductive biases are pretty different than (for
       | example) a convolutional UNet's. the innermost levels seem to
       | train easily, so images can have good global coherence early in
       | training.
       | 
       | there's no convolutions! so you don't need to worry about
       | artifacts stemming from convolution padding, or having canvas
       | edge padding artifacts leak an implicit position bias.
       | 
       | we can finally see what high-resolution diffusion outputs look
       | like _without_ latents! personally I think current latent VAEs
       | don't _really_ achieve the high resolutions they claim (otherwise
       | fine details like text would survive a VAE roundtrip faithfully);
       | it's common to see latent diffusion outputs with smudgy skin or
       | blurry fur. what I'd like to see in the future of latent
       | diffusion is to listen to the Emu paper and use more channels, or
       | a less ambitious upsample.
       | 
       | it's a transformer! so we can try applying to it everything we
       | know about transformers, like sigma reparameterisation or
       | multimodality. some tricks like masked training will require
       | extra support in [NATTEN](https://github.com/SHI-Labs/NATTEN),
       | but we're very happy with its featureset and performance so far.
       | 
       | but honestly I'm most excited about the efficiency. there's too
       | little work on making pretraining possible at GPU-poor scale. so
       | I was very happy to see HDiT could succeed at small-scale tasks
       | within the resources I had at home (you can get nice oxford
       | flowers samples at 256x256px with half an hour on a 4090). I
       | think with models that are better fits for the problem, perhaps
       | we can get good results with smaller models. and I'd like to see
       | big tech go that direction too!
       | 
       | -Alex Birch
        
         | sophrocyne wrote:
         | Alex - I run Invoke (one of the popular OSS SD UIs for pros)
         | 
         | Thanks for your work - it's been impactful since the early days
         | of the project.
         | 
         | Excited to see where we get to this year.
        
           | Birch-san wrote:
           | ah, originally lstein/stable-diffusion? yeah that was an
           | important fork for us Mac users in the early days. I have to
           | confess I've still never used a UI. :)
           | 
           | this year I'm hoping for efficiency and small models! even if
           | it's proprietary. if our work can reduce some energy usage
           | behind closed doors that'd still be a good outcome.
        
             | sophrocyne wrote:
             | Yes, indeed. Lincoln's still an active maintainer.
             | 
             | Energy efficiency is key - Especially with some of these
             | extremely inefficient (wasteful, even) features like real-
             | time canvas.
             | 
             | Good luck - Let us know if/how we can help.
        
         | ttul wrote:
         | Hi Alex Amazing work. I scanned the paper and dusted off my
         | aging memories of Jeremy Howard's course. Will your model live
         | happily alongside the existing SD infrastructure such as
         | ControlNet, IPAdapter, and the like? Obviously we will have to
         | retrain these to fit onto your model, but conceptually, does
         | your model have natural places where adapters of various kinds
         | can be attached?
        
           | Birch-san wrote:
           | regarding ControlNet: we have a UNet backbone, so the idea of
           | "make trainable copies of the encoder blocks" sounds
           | possible. the other part, "use a zero-inited dense layer to
           | project the peer-encoder output and add it to the frozen-
           | decoder output" also sounds fine. not quite sure what they do
           | with the mid-block but I doubt there'd be any problem there.
           | 
           | regarding IPAdapter: I'm not familiar with it, but from the
           | code it looks like they just run cross-attention again and
           | sum the two attention outputs. feels a bit weird to me,
           | because the attention probabilities add up to 2 instead of 1.
           | and they scale the bonus attention output only instead of
           | lerping. it'd make more sense to me to formulate it as a
           | cross-cross attention (Q against cat([key0, key1]) and
           | cat([val0, val1])), but maybe they wanted it to begin as a
           | no-op at the start of training or something. anyway.. yes,
           | all of that should work fine with HDiT. the paper doesn't
           | implement cross-attention, but it can be added in the
           | standard way (e.g. like stable-diffusion) or as self-cross
           | attention (e.g. DeepFloyd IF or Imagen).
           | 
           | I'd recommend though to make use of HDiT's mapping network.
           | in our attention blocks, the input gets AdaNormed against the
           | condition from the mapping network. this is currently used to
           | convey stuff like class conditions, Karras augmentation
           | conditions and timestep embeddings. but it supports
           | conditioning on custom (single-token) conditions of your
           | choosing. so you could use this to condition on an image
           | embed (this would give you the same image-conditioning
           | control as IPAdapter but via a simpler mechanism).
        
           | bravura wrote:
           | IPAdapter, I am curious if there are useful GUIs for this?
           | Creating image masks through uploading to colab is not so
           | cute.
        
             | orbital-decay wrote:
             | Here's one example: https://github.com/Acly/krita-ai-
             | diffusion/
             | 
             | But generally, most other UIs support it. It has serious
             | limitations though, for example it center-crops the input
             | to 224x224px. (which is enough for a surprisingly large
             | amount of uses, but not enough for many others)
        
               | ttul wrote:
               | Yes. I discussed this issue with the author of the
               | ComfyUI IP-Adapter nodes. It would doubtless be handy if
               | someone could end-to-end train a higher resolution IP-
               | Adapter model that integrated its own variant of
               | CLIPVision that is not subject to the 224px constraint. I
               | have no idea what kind of horsepower would be required
               | for that.
               | 
               | A latent space CLIPVision model would be cool too.
               | Presumably you could leverage the semantic richness of
               | the latent space to efficiently train a more powerful
               | CLIPVision. I don't know whether anyone has tried this.
               | Maybe there is a good reason for that.
        
         | bertdb wrote:
         | Did you do any inpainting experiments? I can imagine a pixel-
         | space diffusion model to be better at it than one with a latent
         | auto-encoder.
        
           | stefanbaumann wrote:
           | Not yet, we focused on the architecture for this paper. I
           | totally agree with you though - pixel space is generally less
           | limiting than a latent space for diffusion, so we would
           | expect good performance inpainting behavior and other editing
           | tasks.
        
         | michaelt wrote:
         | I appreciate the restraint of showing the speedup on a log-
         | scale chart rather than trying to show a 99% speed up any other
         | way.
         | 
         | I see your headline speed comparison is to "Pixel-space
         | DiT-B/4" - but how does your model compare to the likes of
         | SDXL? I gather they spent $$$$$$ on training etc, so I'd
         | understand if direct comparisons don't make sense.
         | 
         | And do you have any results on things that are traditionally
         | challenging for generative AI, like clocks and mirrors?
        
       | artninja1988 wrote:
       | Looking at the output image examples, very nice, although they
       | seem a little blurry. But I guess that's a dataset issue? Have
       | you tried training anything above 1024x1024? Hope someone
       | releases a model based on this since open source pixel space
       | models are a rarity afaik
        
         | Birch-san wrote:
         | the FFHQ-1024 examples shouldn't be blurry. you can download
         | the originals from the project page[0] -- click any image in
         | the teaser, or download our 50k samples.
         | 
         | the ImageNet-256 examples also aren't typically blurry (but
         | they are 256x256 so your viewer may be bicubic scaling them or
         | something). the ImageNet dataset _can_ have blurry, compressed
         | or low resolution training samples, which can afflict some
         | classes more than others, and we learn to produce samples like
         | the training set.
         | 
         | [0] https://crowsonkb.github.io/hourglass-diffusion-
         | transformers...
        
       | gutianpei wrote:
       | I think NATTEN does not support cross attention, wonder if the
       | authors have tried any text-conditioned cases? Does the cross-
       | attention can only add to regular attention? Or added through
       | adanorm?
        
         | Birch-san wrote:
         | cross-attention doesn't need to involve NATTEN. there's no
         | neighbourhood involved because it's not self-attention. so you
         | can do it the stable-diffusion way: after self-attention, run
         | torch sdp with Q=image and K=V=text.
         | 
         | I tried adding "stable-diffusion-style" cross-attn to HDiT,
         | text-conditioning on small class-conditional datasets (oxford
         | flowers), embedding the class labels as text prompts with
         | Phi-1.5. trained it for a few minutes, and the images were
         | relevant to the prompts, so it seemed to be working fine.
         | 
         | but if instead of a text condition you have a single-token
         | condition (class label) then yeah the adanorm would be a
         | simpler way.
        
       | fpgaminer wrote:
       | Seems like a solid paper from a skim through it. My rough
       | summary:
       | 
       | The popular large scale diffusion models like StableDiffusion are
       | CNN based at their heart, with attention layers sprinkled
       | throughout. This paper builds on recent research exploring
       | whether competitive image diffusion models can be built out of
       | purely transformers, no CNN layers.
       | 
       | In this paper they build a similar U-Net like structure, but out
       | of transformer layers, to improve efficiency compared to a
       | straight Transformer. They also use local attention when the
       | resolution is high to save on computational cost, but regular
       | global attention in the middle to maintain global coherence.
       | 
       | Based on ablation studies this allows them to maintain or
       | slightly improve FID score compared to Transformer-only diffusion
       | models that don't do U-net like structures, but at 1/10th the
       | computation cost. An incredible feat for sure.
       | 
       | There is a variety of details: RoPE positional encoding, GEGELU
       | activations, RMSNorm, learnable skip connections, learnable
       | cosine-sim attention, neighborhood attention for the local
       | attention, etc.
       | 
       | The biggest gains in FID occur when the authors use "soft-min-
       | snr" as the loss function; FID drops from 41 to 28!
       | 
       | Lots of ablation study was done across all their changes (see
       | Table 1).
       | 
       | Training is otherwise completely standard AdamW, 5e-4, 0.01, 256
       | batch, constant LR, 400k steps for most experiments at 128x128
       | resolution.
       | 
       | So yeah, overall seems like solid work that combines a great
       | mixture of techniques and pushes Transformer based diffusion
       | forward.
       | 
       | If scaled up I'm not sure it would be "revolutionary" in terms of
       | FID compared to SDXL or DALLE3, mostly because SD and DALLE
       | already use attention obviating the scaling issue, and lots of
       | tricks like diffusion based VAEs. But it's likely to provide a
       | nice incremental improvement in FID, since in general
       | Transformers perform better than CNNs unless the CNNs are
       | _heavily_ tuned.
       | 
       | And being pixel based rather than latent based has many
       | advantages.
        
         | Birch-san wrote:
         | FID doesn't reward high-resolution detail. the inception
         | feature size is 299x299! so we are forced to _downsample_ our
         | FFHQ-1024 samples to compute FID.
         | 
         | it also doesn't punish poor detail either! this advantages
         | latent diffusion, which can claim to achieve a high resolution
         | but without actually needing to have correct textures to get
         | good metrics.
        
       | imjonse wrote:
       | Are there any public pretrained checkpoints available or planned?
        
       | clauer2024 wrote:
       | Can I ask one basic thing --> From what are the images generated?
        
         | ShamelessC wrote:
         | Not text but an id representing a class/category of images from
         | the dataset. Or they are "unconditional" and the model tries to
         | output something similar to a random image from the dataset
         | each time.
        
         | stefanbaumann wrote:
         | The models presented in the paper are trained on class-
         | conditional ImageNet (where the input is Gaussian noise and one
         | of 1000 classes, e.g., "car") and unconditional FFHQ (where the
         | input is only Gaussian noise).
        
       ___________________________________________________________________
       (page generated 2024-01-24 23:02 UTC)