[HN Gopher] Zero-1-to-3: Zero-shot One Image to 3D Object
___________________________________________________________________
Zero-1-to-3: Zero-shot One Image to 3D Object
Author : GaggiX
Score : 497 points
Date : 2023-03-21 03:24 UTC (19 hours ago)
(HTM) web link (zero123.cs.columbia.edu)
(TXT) w3m dump (zero123.cs.columbia.edu)
| nico wrote:
| This is feeling like almost thought to launch.
|
| In the last week, a lot of the ideas I've read about in the
| comments of HN, have then shown up as full blown projects in the
| front page.
|
| As if people are building at an insane speed from idea to
| launch/release.
| nmfisher wrote:
| Just yesterday I was literally musing to myself "I wonder if
| NeRFs would help with 3D object synthesis", and here we are.
|
| It's definitely a fun time to be involved.
| regegrt wrote:
| It's not based on the NeRF concept though, is it?
|
| Its outputs can provide the inputs for NeRF training, which
| is why they mention NeRFs. But it's not NeRF technology.
| [deleted]
| popinman322 wrote:
| NeRFs are a form of inverse renderer; this paper uses Score
| Jacobian Chaining[0] instead. Model reconstruction from NeRFs
| is also an active area of research. Check out the "Model
| Reconstruction" section of Awesome NeRF[1].
|
| From the SJC paper:
|
| > We introduce a method that converts a pretrained 2D
| diffusion generative model on images into a 3D generative
| model of radiance fields, without requiring access to any 3D
| data. The key insight is to interpret diffusion models as
| function f with parameters th, i.e., x = f (th). Applying the
| chain rule through the Jacobian [?]x/[?]th converts a
| gradient on image x into a gradient on the parameter th.
|
| > Our method uses differentiable rendering to aggregate 2D
| image gradients over multiple viewpoints into a 3D asset
| gradient, and lifts a generative model from 2D to 3D. We
| parameterize a 3D asset th as a radiance field stored on
| voxels and choose f to be the volume rendering function.
|
| Interpretation: they take multiple input views, then optimize
| parameters (a voxel grid in this case) to a differentiable
| renderer (the volume rendering function for voxels) such that
| they can reproduce the input views.
|
| [0]: https://pals.ttic.edu/p/score-jacobian-chaining [1]:
| https://github.com/awesome-NeRF/awesome-NeRF
| noduerme wrote:
| it's actually a really fun time to know how to sculpt in
| ZBrush and print out models.
| nmfisher wrote:
| If I had any artistic talent whatsoever, I'd probably agree
| with you!
| noduerme wrote:
| I won't lie... ZBrush is brutally hard. I got a
| subscription for work and only used it for one paid job,
| ever. But it's super satisfying if you just want to spend
| Sunday night making a clay elephant or rhinoceros, and
| drop $20 to have the file printed out and shipped to you
| by Thursday. I've fed lots of my sculpture renderings to
| Dali and gotten some pretty cool 2D results... but
| nothing nearly as cool as the little asymmetrical epoxy
| sculptures I can line up on the bookshelf...
| intelVISA wrote:
| GPT4 + Python... the product basically writes itself!
|
| Until the oceans boil...
| junon wrote:
| I know this is a joke but electronics cause an unmeasurably
| small amount of heat dissipation. It's how we generate power
| that's the problem.
| taneq wrote:
| Or what answers we ask the electronics for... "Univac, how
| do I increase entropy?" _distant rumble of cooling fans_
| arthurcolle wrote:
| You mean decrease entropy?
| robertlagrant wrote:
| ChatGPT-5 will be written by ChatGPT4? :)
| kindofabigdeal wrote:
| Doubt
| knodi123 wrote:
| if I've been reading it correctly, the power of chatgpt is
| in the training and data, not necessarily the algorithm.
|
| And I'm not sure if it's technically possible for one AI to
| train another AI _with the same algorithm_ and have better
| performance. Although I could be wrong about any and
| everything. :-)
| BizarroLand wrote:
| I know that NVidia is using AI that is running on NVidia
| chips to create new chips that they then run AI on.
|
| All you have left to do is to AI the process of training
| AI, kind of like building a lathe by hand makes a so-so
| lathe but that so-so lathe can then be used to build a
| better and more accurate lathe.
| digdugdirk wrote:
| I actually love this analogy. People tend to not
| appreciate just how precise modern manufacturing
| equipment is.
|
| All of that modern machinery was essentially bootstrapped
| off a couple of relatively flat rocks. Its going to be
| interesting to see where this LLM stuff goes when the
| feedback loop is this quick and so much brainpower is
| focused on it.
|
| One of my sneaky suspicions is that
| Facebook/Google/Amazon/Microsoft/etc would have been
| better off keeping employees on the books if for no other
| reason than keeping thousands of skilled developers
| occupied, rather than cutting loose thousands of people
| during a time of _rapid_ technological progress who now
| have an axe to grind.
| visarga wrote:
| A LLM by itself could generate data, code and iterate on
| its training process, thus it can create another LLM from
| scratch. There is a path to improve LLMs without organic
| text - connect them to real systems and allow them
| feedback. They can learn from feedback from their
| actions. It could be as simple as a Python execution
| environment, a game, simulator, other chat bots, or a
| more complex system like real world tests.
| amelius wrote:
| Is image classification at the point yet where you can train it
| with one or a few examples (plus perhaps some textual
| explanation)?
| f38zf5vdt wrote:
| Image classification is still a difficult task, especially if
| there are only a few examples. Training a high resolution 1k
| multi-class imagenet on 1m+ images is a drag involving
| hundreds or thousands of GPU hours from scratch. You can do
| low-resolution classifiers more easily, but they're less
| accurate.
|
| There are tricks to do it faster but they all involve using
| other vision models that themselves are trained for as long.
| amelius wrote:
| But can't something like GPT help here? For example you
| show it a picture of a cat, then you say "this is a cat;
| cats are furry creatures with claws, etc." and then you
| show it another image and ask if it is also a cat.
| aleph_infinity wrote:
| This paper
| https://cv.cs.columbia.edu/sachit/classviadescr/ (from
| the same lab as the main post, funnily) does something
| along those lines with GPT. It shows for things that are
| easy to describe like Wordle ("tiled letters, some are
| yellow and green") you can recognize them with zero
| training. For things that are harder to describe we'll
| probably need new approaches, but it's an interesting
| direction.
| f38zf5vdt wrote:
| You are humanizing token prediction. The multimodal
| models for text-vision were all established using a
| scaffold of architectures that unified text-token and
| vision-token similarity e.g. BLIP2. [1] It's possible
| that a model using unified representations might be able
| to establish that the set of visual tokens you are
| searching for corresponds to some set of text tokens, but
| only if the pretrained weights for the vision encoder are
| able to extract the features corresponding to the object
| to which you are describing to the vision model.
|
| And the pretrained vision encoder will have at some point
| been trained to minimize text-visual token cosine
| similarity on some training set, so it really depends on
| what exactly that training set had in it.
|
| [1] https://arxiv.org/pdf/2301.12597.pdf
| GaggiX wrote:
| If you have a few examples you can use an already trained
| encoder (like CLIP image encoder) and train a SVM on the
| embeddings, no need to train a neural network.
| dimatura wrote:
| People are definitely building at a high pace, but for what
| it's worth, this isn't the first work to tackle this problem,
| as you can see from the references. The results are impressive
| though!
| noduerme wrote:
| yeah, the road to hell is paved with a desperate need for
| upvotes (and angel investment).
| lofatdairy wrote:
| This is insanely impressive, looking at the 3D reconstruction
| results. If I'm not mistaken occlusions are where a lot of
| attentions being placed in pose estimation problems, and if there
| are enough annotated environmental spaces to create ground truths
| you could probably add environment reconstruction to pose
| reconstruction. What's nice there is that if you have multiple
| angles of an environment from a moving camera in a video, you can
| treat each previous frame as a prior which helps with prediction
| time and accuracy.
| jonplackett wrote:
| Would this be useful for a robot / car trying to navigate to be
| able to do this?
| elif wrote:
| unlikely. the front bumper of a car you are following has zero
| value for your ego's safety. most of the optimization of FSD is
| in removing extra data to improve latency of the mapping loop.
| eternalban wrote:
| Great idea. Processing latency may be an issue. It has to be
| fast, small, and energy efficient.
| HopenHeyHi wrote:
| 3D reconstruction from a single image. They stress the examples
| are not curated, appears to... well, gosh darnit, it appears to
| work.
|
| If it runs fast enough I wonder whether one could just drive
| around with a webcam and generate these 3d models on the fly and
| even import them into a sort of GTA type simulation/game engine
| in real time. (To generate a novel view, Zero-1-to-3 takes only 2
| seconds on an RTX A6000 GPU) This research is
| based on work partially supported by: - Toyota
| Research Institute - DARPA MCS program under Federal
| Agreement No. N660011924032 - NSF NRI Award #1925157
|
| Oh, huh. Interesting. Future Work From
| objects to scenes: Generalization to scenes with complex
| backgrounds remains an important challenge for our method.
| From scenes to videos: Being able to reason about geometry
| of dynamic scenes from a single view would open novel research
| directions -- such as understanding occlusions and dynamic
| object manipulation. A few approaches for diffusion-
| based video generation have been proposed recently and extending
| them to 3D would be key to opening up these opportunities.
| TylerE wrote:
| Seems like there is a bit of a gap between "runs at 0.5 fps on
| a $7000 workstation-grade GPU with 48GB of VRAM" and consumer
| applications.
|
| With the fairly shallow slope of the GPU performance curve
| overtime, I don't see them just Moores Lawing out of it either.
| this would need two, maybe three orders of magnitude more
| performance.
| HopenHeyHi wrote:
| Of course there is a gap. This is at the exploratory proof of
| concept stage. The fact that it works at all is what is
| interesting.
|
| Furthermore once you've identified the make and model of the
| car, its relative position in 3d, any anomalies -- that ain't
| just a Ford pickup, it is loaded with cargo that overhangs in
| a particular way -- its velocity, 'etc -- I'm quite sure that
| extrapolating additional information from the subsequent
| frames will be significantly cheaper as you don't have to
| generate a 3d model from scratch each time.
|
| I think this is a viable exploratory path forward.
| Make it work <- you are here Make it work correctly
| Make it work fast
|
| Edit: Scotty does know ;)
| scotty79 wrote:
| I prefer: Make it work <- you are here
| Make it work correctly Make it work fast
| [deleted]
| frozenport wrote:
| >> fairly shallow slope of the GPU performance curve overtime
|
| Not true.
| jiggawatts wrote:
| Computer power goes up exponentially thanks to Moore's law.
| Sprinkle some software optimisations on top, and it's
| conceivable for that to be running at interactive framerates
| on consumer GPUs within 5-10 years.
| ffitch wrote:
| the processing may as well shift to the clouds. With the
| subscription fee, of course : )
| TylerE wrote:
| Until we break the speed of light, I'm very bearish on
| cloud gaming. It just feels so bad. You've got like 9
| layers of latency between you and the screen.
| fooker wrote:
| You don't have to break the speed of light, just have the
| ping below human perception.
|
| ~20ms is that threshold, but even 40ms latency is barely
| noticeable for single player games.
| enlyth wrote:
| It's quite noticeable actually, and it adds up, it's not
| just an extra 20ms.
|
| For casual gamers and turn based games maybe it could
| work, as a niche. For FPS, multiplayer, ARPG, and so on,
| it's a dealbreaker, anything over 100ms feels too
| sluggish.
|
| We should be happy we have so much autonomy with our own
| hardware, I don't want some big cloud company to be able
| to tell me what I can play and render, unless we want the
| "you will own nothing and be happy" meme to become
| reality.
| TylerE wrote:
| I actually, in my testing, JRPG/other turn based games
| were amongst the worst because there is so much
| "management" (inventory, loot, gear, etc) and the extra
| lag really throws you off
| TylerE wrote:
| A wireless controller ALONE is already over 20ms, and
| that's before you touch the network, actually doing with
| that input, wait for the display to redraw...
|
| At a 20ms total round trip, that only buys you about a
| 1500 mile radius, again completely ignoring all other
| latencies.
| jimmySixDOF wrote:
| One possible definition of Edge Compute is GPU capacity
| at every last mile POP
| jlokier wrote:
| I agree, though my last mile latency to the nearest POP
| is about 85ms. Still a bit on the high side for action
| games compared with playing locally.
| kijiki wrote:
| 85ms, holy crap, who is your ISP?
|
| On Sonic fiber internet in San Francisco, I get 1.5ms to
| the POP. It is only 4.5ms to my VM in Hurricane Electrics
| Fremont DC.
| nitwit005 wrote:
| > Computer power goes up exponentially thanks to Moore's
| law
|
| If you look at a graph, that stopped being true well over a
| decade ago.
| ilaksh wrote:
| I wonder if this type of thing could be adapted to a vision
| system for a robot? So it would locate the camera and reconstruct
| an entire scene from a series of images as the robot moves
| around.
|
| Probably needs a ways to get there but to be able to do robust
| SLAM etc. With just a single camera would make things much less
| expensive.
| guyomes wrote:
| You might be interested in this related recent work [1] that
| fits simple ellipsoids to images and then use them for the pose
| estimation of a camera.
|
| [1]: https://ieeexplore.ieee.org/document/9127873
| lefrancais wrote:
| Same ref [1], but open :
| [https://hal.science/hal-02886633/document]
| qikInNdOutReply wrote:
| What happens, if you build a circle? As in this creates a 3d
| object from a image and another ai creates a image from a 3d
| object?
|
| https://www.youtube.com/watch?v=zPqJUrfKuqs
|
| Does it stabilize, or refine prejudices, or go on a fractal
| journey of errors over the weight landscape?
| throwaway4aday wrote:
| If you can produce any view angle you want of an object then
| can't you use photogrammetry to construct a 3D object?
| nwoli wrote:
| See the "Single-View 3D Reconstruction" section at the bottom
| where they do precisely that
| throwaway4aday wrote:
| Cool, I missed that.
| gs17 wrote:
| For anyone else who tried to download the weights and got Google
| Drive throwing a quota error at you, they're working on it:
| https://github.com/cvlab-columbia/zero123/issues/2
| King-Aaron wrote:
| That's honestly extremely impressive. I do hope that the 'in the
| wild' examples aren't completely curated and are actually being
| rendered on the fly (They appear to be, but it's hard for me to
| tell if that' truly the case). Pretty cool to see however.
| GaggiX wrote:
| >and are actually being rendered on the fly
|
| They are precomputed, "Note that the demo allows a limited
| selection of rotation angles quantized by 30 degrees due to
| limited storage space of the hosting server." but I don't think
| they are curated, the seeds probably correspond to the seeds of
| the live demo you can host (they released the code and the
| models)
| [deleted]
| desmond373 wrote:
| Would it be possible to generate cad files with this. As a base
| for part construction this could be game changing
| gs17 wrote:
| If you look at the example meshes, it doesn't seem very likely
| that it would be better than manually creating them, unless
| you're okay with lumpy parts that aren't exactly the right
| size. This is too early for it to not require a lot of cleanup
| to be usable.
| flangola7 wrote:
| In other words we just need to wait 6 more months
| [deleted]
| mitthrowaway2 wrote:
| > We compare our reconstruction with state-of-the-art models in
| single-view 3D reconstruction.
|
| Here they list "GT Mesh", "Ours", "Point-E", and "MCC". Does
| anyone know what technique "GT mesh" refers to? Is it simply the
| original mesh that generated the source image?
| haykmartiros wrote:
| Ground truth
| EGreg wrote:
| Well honestly the "Ground truth" algorithm seems a lot
| superior to their method, it has higher fidelity in ALL the
| examples
| Thorrez wrote:
| Ground truth means the original model that the image was
| generated from.
| sophiebits wrote:
| "Ground truth" doesn't refer to a particular algorithm; it
| refers to the ideal benchmark of what a perfect performance
| would look like, which they're grading against.
| razemio wrote:
| Haha, I am sorry. I spit my coffee reading this. It is ofc
| totally OK to not know what ground truth means but the
| irony was to funny. Yes ground truth will always be
| superior compared to anything else :)!
| yorwba wrote:
| Ground truth will always be superior on the "does this
| match the ground truth?" metric, but that's often just a
| proxy for output quality and the model will be judged
| differently once deployed (e.g. "do human users like
| this?")
|
| That's something to be aware of, especially when you're
| using convenience data of unknown quality to evaluate
| your model - many research datasets scraped off the
| internet with little curation and labeled in a rush by
| low-paid workers contain a lot of SEO garbage and
| labeling errors.
| simlevesque wrote:
| Ground truth means that a human person created the model.
| DarthNebo wrote:
| Not necessarily, could also be synthetic. Google did the
| same for hand poses in BlazePalm
| chaboud wrote:
| I read that with the sarcasm that I _hope_ was intended and
| had a good laugh.
| GaggiX wrote:
| "Ground Truth", the actual mesh
| hypertexthero wrote:
| Brings to mind the Blade Runner enhance scene:
| https://www.youtube.com/watch?v=hHwjceFcF2Q
| Sakos wrote:
| Reminds me of this at the time fantastical scene in Enemy of
| the State https://youtu.be/3EwZQddc3kY
| BiteCode_dev wrote:
| Given the data is (credible and beautiful) BS, I think it's
| closer to red dwarf:
|
| https://www.youtube.com/watch?v=6i3NWKbBaaU
| ar9av wrote:
| It's hard to tell for certain from the paper, without going deep
| into the code, but it seems they created the new model the same
| way the depth conditioned SD models were made i.e. normal
| finetune.
|
| It might be possible to create a "original view + new angle"
| conditioned model much more easily by taking the
| Controlnet/T2IAdapter/GLIDE route where you freeze the original
| model.
|
| Text to 3d seems almost close to being solved.
|
| It also makes me think a "original character image + new pose"
| conditioned model would also work quite well.
| hiccuphippo wrote:
| Can you obtain the 3d object from this or only an image with the
| new perspective? This could revolutionize indie gamedev.
| jxf wrote:
| You can obtain a 3D object, but it's more useful for the novel
| views than the object, because the object isn't very good and
| probably needs some processing. See the bottom of the paper.
| echelon wrote:
| Super cool results.
|
| This is what my startup is getting into. So I'm very interested.
|
| These aren't "game ready" - the sculpts are pretty gross. But
| we're clearly getting somewhere. It's only going to keep getting
| better.
|
| I expect we'll be building all new kinds of game engines, render
| pipelines, and 3D animation tools shortly.
| nico wrote:
| And 3D printing. So quickly building physical tools too.
| skybrian wrote:
| For printing parts, precision matters since they likely need
| to fit with something else. You'll want to be able to edit
| dimensions on the model to get the fit right.
|
| So maybe someday, but I think it would have to be a project
| that targets CAD.
| regularfry wrote:
| I'd be interested in zero-shot _two_ images to 3d object. You
| can see how a stereo pair ought to improve the amount of
| information it has available.
| redox99 wrote:
| While this is cool, this is not meant to target "game ready".
| For games and CGI, there's no reason to limit yourself to a
| single image. Photogrammetry is already extensively used, and
| it involves using tens or hundreds of images of the object to
| scan. Using many images as an input will obviously always be
| superior to a single one, as a single image means it has to
| literally make up the back side, and it has no parallax
| information.
| oefrha wrote:
| You appear to be thinking about scanning a physical object,
| whereas zero-shot one image to 3D object would be vastly more
| useful with a single (possibly AI-generated or AI-assisted)
| illustration. You get a 3D model in seconds at essentially
| zero cost, can iterate hundreds of times in a single day.
| redox99 wrote:
| I agree that for stylized, painting-like 3D models it could
| be very cool. I was indeed thinking of the typical pipeline
| for a photoreallistic asset.
| digilypse wrote:
| What if I have a dynamically generated character description
| in my game's world, generate a portrait for them using
| StableDiffusion and then turn that into a 3d model that can
| be posed and re-used?
| flangola7 wrote:
| This has DARPA and NSF behind it.
|
| They're not building this for games they're building it for
| autonomous weapons.
| bredren wrote:
| How do these kinds of tools complement actual 3d scanning?
|
| For example, Apple supposedly has put some time into 3d asset
| building (presumably in support of AR world building content).
|
| Can these inference techniques stack or otherwise help more
| detailed object data collection?
| yawnxyz wrote:
| Are there any models that take an image to SVG?
| noduerme wrote:
| Is there some kind of symmetry at work here in the deductive
| process?
| xotom20390 wrote:
| [dead]
| bmitc wrote:
| What if you give it a picture of a cardboard cutout or billboard?
| noduerme wrote:
| it'll build Angelyne for you, to distract your pathetic carbon-
| based intelligence.
|
| https://www.hollywoodreporter.com/wp-content/uploads/2017/07...
| mov wrote:
| People plugging it as output of Midjourney in 3, 2, 1...
| wslh wrote:
| I keep thinking in my project where we are taking multiple photos
| from the same angle with moving lights for rebuilding the 3D
| model. We are not using AI, just optic research like in [1]. We
| applied that on art at [2].
|
| [1] Methods for 3D digitization of Cultural Heritage:
| http://www.ipet.gr/~akoutsou/docs/M3DD.pdf
|
| [2] https://sublim.art
| bogwog wrote:
| So the business model there is: scanner + paper shredder + NFT
| = $$$?
|
| How many people have taken you up on that offer? Unless it's a
| shitty/low-effort painting, it seems insane to me that anyone
| would destroy their artwork in exchange for an NFT of that same
| artwork.
| wslh wrote:
| What is insane for you could be completely different for
| others: we have been in the last Miami Art Week and Art Basel
| and we don't have enough time for the number of artists that
| wanted to be in the process. Will expand more later (working
| now) but you can see AP coverage here [1].
|
| It is also important to highlight that we are doing this
| project at our own risk, with our own money, have built the
| hardware and software, and not charging artists for the
| process. Just the primary market sell is split between 85%
| for artists and the rest for the project. Pretty generous in
| this risky market.
|
| [1] https://youtu.be/ajDEHSLi0iE
| bogwog wrote:
| > we have been in the last Miami Art Week and Art Basel and
| we don't have enough time for the number of artists that
| wanted to be in the process. Will expand more later
|
| Please also include the number of those people who actually
| understand what an NFT is. As a native Miamian, I can
| guarantee you not a single one does. This city has always
| been a magnet for the _get rich quick scheme_ types, and
| crypto is a good match for that because it 's harder for a
| layman to grasp the scam part.
| tough wrote:
| It's Banksy as a Service
| brokensegue wrote:
| how is this different from the previous NeRF work? does it build
| a 3D model?
| GaggiX wrote:
| NeRF models are trained on several views with known location
| and viewing direction. This model takes one image (and you
| don't need to train a model for each object).
| amelius wrote:
| But if it takes only one image, isn't it likely to
| hallucinate information?
| gs17 wrote:
| Not just likely, it does. Try out the demo and see, e.g.
| what the backside of their Pikachu toy looks like. Or a
| little simpler, the paper has an example (the demo also has
| this) of the back of a car under different seeds.
| fooker wrote:
| Not hotdog.
| hombre_fatal wrote:
| Aside, I really like the UI indicators on the draggable models at
| the bottom that let you know you can rotate them.
___________________________________________________________________
(page generated 2023-03-21 23:02 UTC)