[HN Gopher] DUSt3R: Geometric 3D Vision Made Easy
___________________________________________________________________
DUSt3R: Geometric 3D Vision Made Easy
Author : smusamashah
Score : 95 points
Date : 2024-03-03 14:33 UTC (8 hours ago)
(HTM) web link (dust3r.europe.naverlabs.com)
(TXT) w3m dump (dust3r.europe.naverlabs.com)
| smusamashah wrote:
| People have been posting some really interesting and useful use
| cases of this tech
|
| Getting 3d view from few pictures of an apartment's listing
| https://twitter.com/JeromeRevaud/status/1764035510236758096
|
| Two pictures of kitchen
| https://x.com/janusch_patas/status/1764025964915302400
|
| Two pictures of office without any overlap
| https://x.com/JeromeRevaud/status/1763495315389165963
| reactordev wrote:
| This is awesome. Kudos. You have way more respect in my eyes
| since, not only did you post your paper, you posted the source.
|
| Too many times I've read claims without source so no one can
| reproduce and verify results. Now I can, and have, verified the
| results. Top notch.
| carbocation wrote:
| Agreed. And not just the source, but a fully functional local
| demo! Runs great on my M1 pro using:
|
| PYTORCH_ENABLE_MPS_FALLBACK=1 python3.10 demo.py --weights
| checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth --device
| 'mps'
| amelius wrote:
| Does this method also work when different cameras (or different
| camera zoom etc. settings) are used for every input image?
| carbocation wrote:
| I played around with DUSt3R last night using an iPhone with
| different lenses (whether you consider this a different camera
| or not, I defer to you). It worked well. Note that the camera
| intrinsics aren't used here, so it makes sense that it would
| tolerate different lenses or cameras. I did not test wildly
| divergent lens types (e.g., a fisheye lens).
| markisus wrote:
| Pretty impressive. I wonder though why it was necessary to put
| the point map of the second image into the coordinate frame of
| the first. Isn't it all the same from the point of the neural
| net?
| Lichtso wrote:
| Am I imagining things or is there a trend here?
|
| Seems like we get more and more generalist approaches which are
| less specific and combine a lot of what used to be individual
| steps and techniques. In doing so they don't only become
| conceptually simpler but surprisingly more accurate as well.
| Possibly because a unified approach is more integrated and thus
| better at filling the gaps in one sub-problem with information
| form other sub-problems.
| bonoboTP wrote:
| That's the trend since 2012 basically, when deep learning took
| over from hand-tuned feature extraction for image
| classification.
|
| The fiddly, brittle and multi-step nature of 3D vision endured
| longer but is going through the same transformation.
| xanderlewis wrote:
| Somewhat related to _The Bitter Lesson_ (though perhaps you've
| already read it):
|
| > One thing that should be learned from the bitter lesson is
| the great power of general purpose methods
|
| http://www.incompleteideas.net/IncIdeas/BitterLesson.html
| monkeydust wrote:
| Can this be used for body measurement, eg 4 shots different poses
| combined together? What kind of accuracy might you get if so
| ...just curious?
| carbocation wrote:
| Different poses? I don't think so, at least not with the
| current setup.
|
| For example, I tried this with a dog (walking around the seated
| dog, taking photos as I did so). The dog turned her head while
| I was taking photos. The portion of the head that moved was not
| represented in the final output.
| fxtentacle wrote:
| If I understand things correctly, this relies extremely heavily
| on learned prior shapes, meaning it'll guess depth from a
| monocular image and then go from there. In line with that, it
| uses a Vision Transformer like MiDaS (Towards Robust Monocular
| Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset
| Transfer).
|
| That's why it can still reconstruct a scene even if the images do
| not overlap at all:
| https://twitter.com/JeromeRevaud/status/1763495315389165963
|
| But what that also means is that this is closer to generative AI
| than to objective measurements. If the image to depth estimation
| goes very wrong, it might hallucinate shapes that aren't there.
| krasin wrote:
| > But what that also means is that this is closer to generative
| AI than to objective measurements. If the image to depth
| estimation goes very wrong, it might hallucinate shapes that
| aren't there.
|
| But people do that all the time too. Relying on priors is fine
| for many practical applications and sometimes there's no way
| around it.
___________________________________________________________________
(page generated 2024-03-03 23:00 UTC)