https://yashkant.github.io/spad/

SPAD : Spatially Aware Multiview Diffusers

  * Yash Kant
    UofToronto & Vector
  * Ziyi Wu
    UofToronto & Vector
  * Michael Vasilkovsky
    Snap Research
  * Guocheng Qian
    KAUST
  * Jian Ren
    Snap Research
  * Riza Alp Guler
    Snap Research
  * Bernard Ghanem
    KAUST
  * Sergey Tulyakov*
    Snap Research
  * Igor Gilitschenski*
    UofToronto & Vector
  * Aliaksandr Siarohin*
    Snap Research
      + [spad-ss]

        Paper

      + [ppt]

        Slides

    tl;dr

    overview
    -----------------------------------------------------------------
   
    Citation

    [@article{spad2023,  ]
    -----------------------------------------------------------------
   
    Overview: SPAD



    Given a text prompt, our method synthesizes 3D consistent views
    of the same object. Our model can generate many images from
    arbitrary camera viewpoints, while being trained only on four
    views. Here, we show eight views sampled uniformly at a fixed
    elevation generated in a single forward pass.
    -----------------------------------------------------------------
   
    Method: SPAD

    pipeline


    Model pipeline. (a) We fine-tune a pre-trained text-to-image
    diffusion model on multi-view rendering of 3D objects.
    (b) Our model jointly denoises noisy multi-view images
    conditioned on text and relative camera poses. To enable
    cross-view interaction, we apply 3D self-attention by
    concatenating all views, and enforce epipolar constraints on the
    attention map.
    (c) We further add Plucker Embedding to the attention layers as
    positional encodings, to enhance camera control.


    sub-modules

    Epipolar Attention (Left). For each source point S on a feature
    map, we compute its epipolar lines on all other views. S will
    only attend to features along these lines plus all the points on
    itself (blue points).
    Illustration of one block in SPAD (Right). We add Plucker
    Embedding to feature maps in the self-attention layer by
    inflating the original QKV projection layers with zero
    projections.


    -----------------------------------------------------------------
   
    Text-to-3D Generation: Multi-view Triplane Generator

    3D-Triplane-NeRF

    3D-Triplane-NeRF

    Similar to Instant3D, we train a multi-view to NeRF generator. We
    use four multi-view generations from SPAD (shown in bottom half)
    as input to generator and create 3D assets. The entire generation
    takes ~ 10 seconds.

    Text-to-3D Generation: Multiview SDS

    Thanks to our 3D consistent multi-view generation, we can
    leverage the multi-view Score Distillation Sampling (SDS) for 3D
    asset generation. We integrate SPAD into threestudio and follow
    the training setting of MVDream to train a NeRF.

    -----------------------------------------------------------------
   
    Quantitative Result: Novel View Synthesis

    To evaluate the 3D consistency of our method, we adapt SPAD to
    the image-conditioned novel view synthesis task. We test on
    unseen 1,000 Objaverse objects, and all objects from the Google
    Scanned Objects (GSO) dataset.


    NVS

    SPAD preserves structural and perceptual details faithfully. Our
    method achieves competitive results on PSNR and SSIM, which
    setting new state-of-the-art on LPIPS.

    -----------------------------------------------------------------
   
    Qualitative Result: Comparison with MVDream

    compare

    Qualitative Result: Close View Generation

    We demonstrate smooth transition between views by generate close
    viewpoints each varying by 10-degrees along azimuth.


    close_view_10deg

    Qualitative Result: Ablation Study

    We ablate our method on various design choices to demonstrate
    their importance.


    ablation

    Epipolar Attention promotes better camera control in SPAD.
    Directly applying 3D self-attention on all the views leads to
    content copying between generated images, as highlighted by the
    blue circles.
    Plucker Embeddings help prevent generation of flipped views.
    Without Plucker Embeddings, the model sometimes predict image
    regions that are rotated by 180deg, as highlighted by the red
    circles, due to the ambiguity in epipolar lines.

    -----------------------------------------------------------------
   

    The website template was borrowed from Michael Gharbi.