https://zero123.cs.columbia.edu/

Zero-1-to-3: Zero-shot One Image to 3D Object

Ruoshi Liu
Rundi Wu
Basile Van Hoorick
Pavel Tokmakov
Sergey Zakharov
Carl Vondrick

Columbia University

Columbia University

Columbia University

Toyota Research Institute

Toyota Research Institute

Columbia University

TL;DR: We learn to control the camera perspective in large-scale
diffusion models, enabling zero-shot novel view synthesis and 3D
reconstruction from a single image.

[teaser]
paper
paper
paper
Arxiv
Code
Pretrained models

Abstract

We introduce Zero-1-to-3, a framework for changing the camera
viewpoint of an object given just a single RGB image. To perform
novel view synthesis in this under-constrained setting, we capitalize
on the geometric priors that large-scale diffusion models learn about
natural images. Our conditional diffusion model uses a synthetic
dataset to learn controls of the relative camera viewpoint, which
allow new images to be generated of the same object under a specified
camera transformation. Even though it is trained on a synthetic
dataset, our model retains a strong zero-shot generalization ability
to out-of-distribution datasets as well as in-the-wild images,
including impressionist paintings. Our viewpoint-conditioned
diffusion approach can further be used for the task of 3D
reconstruction from a single image. Qualitative and quantitative
experiments show that our method significantly outperforms
state-of-the-art single-view 3D reconstruction and novel view
synthesis models by leveraging Internet-scale pre-training.

Method

We learn a view-conditioned diffusion model that can subsequently
control the viewpoint of an image containing a novel object (left).
Such diffusion model can also be used to train a NeRF for 3D
reconstruction (right). Please refer to our paper for more details or
checkout our code for implementation.

[method]

Novel View Synthesis

Here are some uncurated inference results from in-the-wild images we
tried, along with images from the Google Scanned Objects and RTMV
datsets. Note that the demo allows a limited selection of rotation
angles quantized by 30 degrees due to limited storage space of the
hosting server. If you want to try out a fully custom demo running on
a GPU server which allows you to upload your own image, please check
out our code!

Text to Image to Novel Views

Here are results of applying Zero-1-to-3 to images generated by
Dall-E-2.

[txt2img]

Single-View 3D Reconstruction

Here are results of applying Zero-1-to-3 to obtain a full 3D
reconstruction from the input image shown on the left. We compare our
reconstruction with state-of-the-art models in single-view 3D
reconstruction.

Input Image [assets]
Ground Truth
Ours
Point-E
MCC
[assets]
[assets]
[assets]

Related Work

Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for
3D Generation
DreamFusion: Text-to-3D using 2D Diffusion
SparseFusion: Distilling View-conditioned Diffusion for 3D
Reconstruction
NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with
360deg Views
NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as
General Image Priors
RealFusion: 360deg Reconstruction of Any Object from a Single Image

BibTeX

@misc{liu2023zero1to3,
      title={Zero-1-to-3: Zero-shot One Image to 3D Object},
      author={Ruoshi Liu and Rundi Wu and Basile Van Hoorick and Pavel Tokmakov and Sergey Zakharov and Carl Vondrick},
      year={2023},
      eprint={2303.11328},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}