https://crowsonkb.github.io/hourglass-diffusion-transformers/

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass
Diffusion Transformers

Katherine Crowson^*1, Stefan Andreas Baumann^*2, Alex Birch^*3, 
Tanishq Mathew Abraham^1, Daniel Z Kaplan^4 Enrico Shippole^4
Stability AI^1, LMU Munich^2, Birchlabs^3, Independent Researchers^4
Preprint, 2024
^*Indicates Equal Contribution
Paper Code arXiv

Samples generated directly in RGB pixel space using our HDiT models
trained on FFHQ-1024^2 and ImageNet-256^2.

Abstract

We present the Hourglass Diffusion Transformer (HDiT), an image
generative model that exhibits linear scaling with pixel count,
supporting training at high-resolution (e.g. 1024^2) directly in
pixel-space. Building on the Transformer architecture, which is known
to scale to billions of parameters, it bridges the gap between the
efficiency of convolutional U-Nets and the scalability of
Transformers. HDiT trains successfully without typical
high-resolution training techniques such as multiscale architectures,
latent autoencoders or self-conditioning. We demonstrate that HDiT
performs competitively with existing models on ImageNet-256^2, and
sets a new state-of-the-art for diffusion models on FFHQ-1024^2.

Efficiency

MY ALT TEXT

Scaling of computational cost w.r.t. target resolution of our HDiT-B/
4 model vs. DiT-B/4 (Peebles & Xie, 2023), both in pixel space. At
megapixel resolutions, our model incurs less than 1% of the
computational cost compared to the standard diffusion transformer DiT
at a comparable size.

High-level Architecture Overview

MY ALT TEXT

High-level overview of our HDiT architecture, specifically the
version for ImageNet at input resolutions of 256^2 at patch size p =
4, which has three levels. For any doubling in target resolution,
another neighborhood attention block is added. "lerp" denotes a
linear interpolation with learnable interpolation weight. All HDiT
blocks have the noise level and the conditioning (embedded jointly
using a mapping network) as additional inputs.

Files

We provide the 50k generated samples used for FID computation for our
557M ImageNet model without CFG (part 1, 2, 3, 4, 5, 6, 7, 8), with
CFG = 1.3 (part 1, 2, 3, 4, 5, 6, 7, 8), and for our FFHQ-1024^2
model without CFG (part 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14
, 15, 16, 17, 18, 19, 20, 21).

BibTeX

@misc{crowson2024hourglass,
    title = {{S}calable {H}igh-{R}esolution {P}ixel-{S}pace {I}mage {S}ynthesis with {H}ourglass {D}iffusion {T}ransformers},
    author = {Katherine Crowson and Stefan Andreas Baumann and Alex Birch and Tanishq Mathew Abraham and Daniel Z Kaplan and Enrico Shippole},
    year = {2024}
}

This page was built using the Academic Project Page Template which
was adopted from the Nerfies project page. You are free to borrow the
of this website, we just ask that you link back to this page in the
footer.
This website is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License.