https://yashkant.github.io/spad/ SPAD : Spatially Aware Multiview Diffusers * Yash Kant UofToronto & Vector * Ziyi Wu UofToronto & Vector * Michael Vasilkovsky Snap Research * Guocheng Qian KAUST * Jian Ren Snap Research * Riza Alp Guler Snap Research * Bernard Ghanem KAUST * Sergey Tulyakov* Snap Research * Igor Gilitschenski* UofToronto & Vector * Aliaksandr Siarohin* Snap Research + [spad-ss] Paper + [ppt] Slides tl;dr overview ----------------------------------------------------------------- Citation [@article{spad2023, ] ----------------------------------------------------------------- Overview: SPAD Given a text prompt, our method synthesizes 3D consistent views of the same object. Our model can generate many images from arbitrary camera viewpoints, while being trained only on four views. Here, we show eight views sampled uniformly at a fixed elevation generated in a single forward pass. ----------------------------------------------------------------- Method: SPAD pipeline Model pipeline. (a) We fine-tune a pre-trained text-to-image diffusion model on multi-view rendering of 3D objects. (b) Our model jointly denoises noisy multi-view images conditioned on text and relative camera poses. To enable cross-view interaction, we apply 3D self-attention by concatenating all views, and enforce epipolar constraints on the attention map. (c) We further add Plucker Embedding to the attention layers as positional encodings, to enhance camera control. sub-modules Epipolar Attention (Left). For each source point S on a feature map, we compute its epipolar lines on all other views. S will only attend to features along these lines plus all the points on itself (blue points). Illustration of one block in SPAD (Right). We add Plucker Embedding to feature maps in the self-attention layer by inflating the original QKV projection layers with zero projections. ----------------------------------------------------------------- Text-to-3D Generation: Multi-view Triplane Generator 3D-Triplane-NeRF 3D-Triplane-NeRF Similar to Instant3D, we train a multi-view to NeRF generator. We use four multi-view generations from SPAD (shown in bottom half) as input to generator and create 3D assets. The entire generation takes ~ 10 seconds. Text-to-3D Generation: Multiview SDS Thanks to our 3D consistent multi-view generation, we can leverage the multi-view Score Distillation Sampling (SDS) for 3D asset generation. We integrate SPAD into threestudio and follow the training setting of MVDream to train a NeRF. ----------------------------------------------------------------- Quantitative Result: Novel View Synthesis To evaluate the 3D consistency of our method, we adapt SPAD to the image-conditioned novel view synthesis task. We test on unseen 1,000 Objaverse objects, and all objects from the Google Scanned Objects (GSO) dataset. NVS SPAD preserves structural and perceptual details faithfully. Our method achieves competitive results on PSNR and SSIM, which setting new state-of-the-art on LPIPS. ----------------------------------------------------------------- Qualitative Result: Comparison with MVDream compare Qualitative Result: Close View Generation We demonstrate smooth transition between views by generate close viewpoints each varying by 10-degrees along azimuth. close_view_10deg Qualitative Result: Ablation Study We ablate our method on various design choices to demonstrate their importance. ablation Epipolar Attention promotes better camera control in SPAD. Directly applying 3D self-attention on all the views leads to content copying between generated images, as highlighted by the blue circles. Plucker Embeddings help prevent generation of flipped views. Without Plucker Embeddings, the model sometimes predict image regions that are rotated by 180deg, as highlighted by the red circles, due to the ambiguity in epipolar lines. ----------------------------------------------------------------- The website template was borrowed from Michael Gharbi.