[HN Gopher] DUSt3R: Geometric 3D Vision Made Easy
       ___________________________________________________________________
        
       DUSt3R: Geometric 3D Vision Made Easy
        
       Author : smusamashah
       Score  : 95 points
       Date   : 2024-03-03 14:33 UTC (8 hours ago)
        
 (HTM) web link (dust3r.europe.naverlabs.com)
 (TXT) w3m dump (dust3r.europe.naverlabs.com)
        
       | smusamashah wrote:
       | People have been posting some really interesting and useful use
       | cases of this tech
       | 
       | Getting 3d view from few pictures of an apartment's listing
       | https://twitter.com/JeromeRevaud/status/1764035510236758096
       | 
       | Two pictures of kitchen
       | https://x.com/janusch_patas/status/1764025964915302400
       | 
       | Two pictures of office without any overlap
       | https://x.com/JeromeRevaud/status/1763495315389165963
        
       | reactordev wrote:
       | This is awesome. Kudos. You have way more respect in my eyes
       | since, not only did you post your paper, you posted the source.
       | 
       | Too many times I've read claims without source so no one can
       | reproduce and verify results. Now I can, and have, verified the
       | results. Top notch.
        
         | carbocation wrote:
         | Agreed. And not just the source, but a fully functional local
         | demo! Runs great on my M1 pro using:
         | 
         | PYTORCH_ENABLE_MPS_FALLBACK=1 python3.10 demo.py --weights
         | checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth --device
         | 'mps'
        
       | amelius wrote:
       | Does this method also work when different cameras (or different
       | camera zoom etc. settings) are used for every input image?
        
         | carbocation wrote:
         | I played around with DUSt3R last night using an iPhone with
         | different lenses (whether you consider this a different camera
         | or not, I defer to you). It worked well. Note that the camera
         | intrinsics aren't used here, so it makes sense that it would
         | tolerate different lenses or cameras. I did not test wildly
         | divergent lens types (e.g., a fisheye lens).
        
       | markisus wrote:
       | Pretty impressive. I wonder though why it was necessary to put
       | the point map of the second image into the coordinate frame of
       | the first. Isn't it all the same from the point of the neural
       | net?
        
       | Lichtso wrote:
       | Am I imagining things or is there a trend here?
       | 
       | Seems like we get more and more generalist approaches which are
       | less specific and combine a lot of what used to be individual
       | steps and techniques. In doing so they don't only become
       | conceptually simpler but surprisingly more accurate as well.
       | Possibly because a unified approach is more integrated and thus
       | better at filling the gaps in one sub-problem with information
       | form other sub-problems.
        
         | bonoboTP wrote:
         | That's the trend since 2012 basically, when deep learning took
         | over from hand-tuned feature extraction for image
         | classification.
         | 
         | The fiddly, brittle and multi-step nature of 3D vision endured
         | longer but is going through the same transformation.
        
         | xanderlewis wrote:
         | Somewhat related to _The Bitter Lesson_ (though perhaps you've
         | already read it):
         | 
         | > One thing that should be learned from the bitter lesson is
         | the great power of general purpose methods
         | 
         | http://www.incompleteideas.net/IncIdeas/BitterLesson.html
        
       | monkeydust wrote:
       | Can this be used for body measurement, eg 4 shots different poses
       | combined together? What kind of accuracy might you get if so
       | ...just curious?
        
         | carbocation wrote:
         | Different poses? I don't think so, at least not with the
         | current setup.
         | 
         | For example, I tried this with a dog (walking around the seated
         | dog, taking photos as I did so). The dog turned her head while
         | I was taking photos. The portion of the head that moved was not
         | represented in the final output.
        
       | fxtentacle wrote:
       | If I understand things correctly, this relies extremely heavily
       | on learned prior shapes, meaning it'll guess depth from a
       | monocular image and then go from there. In line with that, it
       | uses a Vision Transformer like MiDaS (Towards Robust Monocular
       | Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset
       | Transfer).
       | 
       | That's why it can still reconstruct a scene even if the images do
       | not overlap at all:
       | https://twitter.com/JeromeRevaud/status/1763495315389165963
       | 
       | But what that also means is that this is closer to generative AI
       | than to objective measurements. If the image to depth estimation
       | goes very wrong, it might hallucinate shapes that aren't there.
        
         | krasin wrote:
         | > But what that also means is that this is closer to generative
         | AI than to objective measurements. If the image to depth
         | estimation goes very wrong, it might hallucinate shapes that
         | aren't there.
         | 
         | But people do that all the time too. Relying on priors is fine
         | for many practical applications and sometimes there's no way
         | around it.
        
       ___________________________________________________________________
       (page generated 2024-03-03 23:00 UTC)