https://iterative-refinement.github.io/

Image Super-Resolution via Iterative Refinement

Denoising diffusion models for image super-resolution and cascaded
image generation.

Paper

  * Chitwan Saharia
  * Jonathan Ho
  * William Chan
  * Tim Salimans
  * David Fleet
  * Mohammad Norouzi

  * Google Research, Brain Team

Summary

We present SR3, an approach to image Super-Resolution via Repeated R
efinement. SR3 adapts denoising diffusion probabilistic models to
conditional image generation and performs super-resolution through a
stochastic denoising process. Inference starts with pure Gaussian
noise and iteratively refines the noisy output using a U-Net model
trained on denoising at various noise levels. SR3 exhibits strong
performance on super-resolution tasks at different magnification
factors, on faces and natural images. We conduct human evaluation on
a standard 8x face super-resolution task on CelebA-HQ, comparing with
SOTA GAN methods. SR3 achieves a confusion rate close to 50%,
suggesting photo-realistic outputs, while GANs do not exceed a
confusion rate of 34%. We further show the effectiveness of SR3 in
cascaded image generation, where generative models are chained with
super-resolution models, yielding a competitive FID score of 11.3 on
ImageNet.

Super-Resolution Results

We demonstrate the performance of SR3 on the tasks of face and
natural image super-resolution. We perform face super-resolution at
16x16 - 128x128 and 64x64 - 512x512. We also train face
super-resolution model for 64x64 - 256x256 and 256x256 - 1024x1024
effectively allowing us to do 16x super-resolution through cascading.
We also explore 64x64 - 256x256 super-resolution on natural images.

[super_res_] Super Resolution results: (Above) 64x64 - 512x512 face
super-resolution, (Below) 64x64 -> 256x256 natural image
super-resolution. [human_eval] We conduct 2-Alternatative Forced
Choice Experiment human evaluation experiment. Subjects are asked to
choose between reference high resolution image, and the model output.
We measure the performance of the model through confusion rates (% of
time, raters choose model output over reference images.) (Above) We
achieve close to 50% confusion rate on the task of 16x16 -> 128x128
faces outperforming state of the art face super-resolution methods.
(Below) We also achieve 40% confusion rate on the much difficult task
of 64x64 -> 256x256 natural images outperforming regression baseline
by a large margin.

Unconditional Generation Results

We generate unconditional 1024x1024 unconditional face images using a
cascade of an unconditional diffusion model at 64x64 resolution
followed by two 4x super-resolution models. We also generate 256x256
class conditional natural images by using a cascade of a class
conditional diffusion model at 64x64 resolution followed by a 4x
super-resolution model. Cascaded generation allows training different
models in parallel and inference is also efficient as lower
resolution models can use more iterations, while higher resolution
models use fewer iterations.

[cascade_fi] Cascaded generation of unconditional 1024x1024 faces.
[unconditio] Selected example generations of unconditional 1024x1024
faces. [class_cond] Selected example generations of class conditional
256x256 natural images. Each row contains examples from a particular
class.

Related projects

  * Cascaded Diffusion Models
  * Palette: Image-to-Image Diffusion Models

Citation

For more details and additional results, read the full paper.

@article{saharia2021image,
title={Image super-resolution via iterative refinement},
author={Saharia, Chitwan and Ho, Jonathan and Chan, William and
Salimans, Tim and Fleet, David J and Norouzi, Mohammad},
journal={arXiv:2104.07636},
year={2021}}
}