[HN Gopher] Depth Pro: Sharp monocular metric depth in less than...
       ___________________________________________________________________
        
       Depth Pro: Sharp monocular metric depth in less than a second
        
       Author : L_
       Score  : 101 points
       Date   : 2024-10-04 05:09 UTC (17 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | isoprophlex wrote:
       | The example images look convincing, but the sharp hairs of the
       | llama and the cat are pictured against an out-of-focus
       | background...
       | 
       | In real life, you'd use these models for synthetic depth-of-
       | field, adding fake bokeh to a very sharp image that's in focus
       | everywhere. so this seems too easy?
       | 
       | Impressive latency tho.
        
         | JBorrow wrote:
         | I don't think the only utility of a depth model is to provide
         | synthetic blurring of backgrounds. There are many things you'd
         | like to use them for, including feeding into object detection
         | pipelines.
        
         | dagmx wrote:
         | On visionOS 2, there's functionality to convert 2D images to 3D
         | images for stereo viewing.
         | 
         | https://youtu.be/pLfCdI0mjkI?si=8K7rPHu558P-Hf-Z
         | 
         | I assume the first pass is the depth inference here.
        
         | amluto wrote:
         | I'm not convinced that this type of model is the right solution
         | to fake bokeh, at least not if you use it as a black box.
         | Imagine you have the letter A in the background behind some
         | hair. You should end up with a blurry A and most in-focus hair.
         | Instead you end up with an erratic mess, because a fuzzy depth
         | map doesn't capture the relevant information.
         | 
         | Of course, lots of text-to-image models generate a mess,
         | because their training sets are highly contaminated by the
         | messes produced by "Portrait mode".
        
       | habitue wrote:
       | Does apple actually name their research papers "Pro" too? Like is
       | there an iLearning paper out there?
        
       | sockaddr wrote:
       | So what happens in the far future when we send autonomous
       | machines equipped with models trained on Earth life and
       | structures to other planets, are they going to have a hard time
       | detecting and measuring things? What happens when the model is
       | tasked with detecting the depth of an object that's made of
       | triangular glowing scales and the head has three eyes?
        
         | adolph wrote:
         | Assembly Theory
        
       | modeless wrote:
       | False color depth maps are extremely misleading. The way to judge
       | the quality of a depth map is to use it to reproject the image
       | into 3D and rotate it around a bit. Papers almost never do this
       | because it makes their artifacts extremely obvious.
       | 
       | I'd bet that if you did that on these examples you'd see that the
       | hair, rather than being attached to the animal, is floating
       | halfway between the animal and the background. Of course, depth
       | mapping is an ill-posed problem. The hair is not completely
       | opaque and the pixels in that region have contributions from both
       | the hair and the background, so the neural net is doing the best
       | it can. To really handle hair correctly you would have to output
       | a list of depths (and colors) per pixel, rather than a single
       | depth, so pixels with contributions from multiple objects could
       | be accurately represented.
        
         | jonas21 wrote:
         | > _The way to judge the quality of a depth map is to use it to
         | reproject the image into 3D and rotate it around a bit._
         | 
         | They do this. See figure 4 in the paper. Are the results
         | cherry-picked to look good? Probably. But so is everything
         | else.
        
           | threeseed wrote:
           | > Figure 4
           | 
           | We plug depth maps produced by Depth Pro, Marigold, Depth
           | Anything v2, and Metric3D v2 into a recent publicly available
           | novel view synthesis system.
           | 
           | We demonstrate results on images from AM-2k. Depth Pro
           | produces sharper and more accurate depth maps, yielding
           | cleaner synthesized views. Depth Anything v2 and Metric3D v2
           | suffer from misalignment between the input images and
           | estimated depth maps, resulting in foreground pixels bleeding
           | into the background.
           | 
           | Marigold is considerably slower than Depth Pro and produces
           | less accurate boundaries, yielding artifacts in synthesized
           | images.
        
         | incrudible wrote:
         | > I'd bet that if you did that on these examples you'd see that
         | the hair, rather than being attached to the animal, is floating
         | halfway between the animal and the background.
         | 
         | You're correct about that, but for something like matte/depth-
         | treshold that's exactly what you want to get a smooth and
         | controllable transition within the limited amount of resolution
         | you have. For that use case, especially with the fuzzy hair,
         | it's pretty good.
        
           | modeless wrote:
           | It's not exactly what you want because you will get both
           | background bleeding into the foreground and clipping of the
           | parts of the foreground that fall under your threshold. What
           | you want is for the neural net to estimate the different
           | color contributions of background and foreground at each
           | pixel so you can separate them without bleeding or clipping.
        
             | incrudible wrote:
             | It's what you'd want out of a _depth map_ used for that
             | purpose. What you 're describing is not a depth map.
        
               | zardo wrote:
               | Maybe the depthMap should only accept images that have
               | been typed as hairless.
        
             | Stedag wrote:
             | I work on Time-of-flight camera's that need to handle the
             | kind of data that you're referring too.
             | 
             | Each pixel takes a multiple measurements over time of the
             | intensity of reflected light that matches the emission
             | pulse encodings. The result is essentially a vector of
             | intensity over a set of distances.
             | 
             | A low depth resolution example of reflected intensity by
             | time (distance):
             | 
             | i: _ _ ^ _ ^ - _ _ d: 0 1 2 3 4 5 6 7
             | 
             | In the above example, the pixel would exhibit an ambiguity
             | between distances of 2 and 4.
             | 
             | The simplest solution is to select the weighted average or
             | median distance, which results in "flying pixels" or "mixed
             | pixels" for which there are existing efficient techniques
             | for filtration. The bottom line is that for applications
             | like low-latency obstacle detection on a cost-constrained
             | mobile robot, there's some compression of depth information
             | required.
             | 
             | For the sake of inferring a highly realistic model from an
             | image, Neural radiance fields or gaussian splats may best
             | generate the representation that you might be envisioning,
             | where there would be a volumetric representation of
             | material properties like hair. This comes with higher
             | compute costs however and doesn't factor in semantic
             | interpretation of a scene. The Top performing results in
             | photogrammetry have tended to use a combination of less
             | expensive techniques like this one to better handle
             | sparsity of scene coverage, and then refining the a result
             | using more expensive techniques [1].
             | 
             | 1: https://arxiv.org/pdf/2404.08252
        
       | tedunangst wrote:
       | Funny that Apple uses bash for a shell script that just runs
       | wget. https://github.com/apple/ml-depth-
       | pro/blob/main/get_pretrain...
        
         | threeseed wrote:
         | It would be pulling from an internal service in a development
         | branch.
         | 
         | So this just makes it easier to swap it out without making any
         | other changes.
        
       | brcmthrowaway wrote:
       | Does this take lens distortion into account?
        
       | dguest wrote:
       | What does this look like on an M. C. Escher drawing, e.g.
       | 
       | https://i.pinimg.com/originals/00/f4/8c/00f48c6b443c0ce14b51...
       | 
       | ?
        
         | yunohn wrote:
         | Looks like a screenshot from the Monument Valley games, full of
         | such Escher like levels.
        
       | cpgxiii wrote:
       | The monodepth space is full of people insisting that their models
       | can produce metric depth with no explanation other than "NN does
       | magic" for why metric depth is possible from generic mono images.
       | If you provide a single arbitrary image, you can't generate depth
       | that is immune from scale error (e.g. produce accurate depth for
       | a both an image of a real car and a scale model of the same car).
       | 
       | Plausibly, you can train a model that encodes sufficient
       | information about a specific set of imager+lens combinations such
       | that the lens distortion behavior of images captured by those
       | imagers+lenses provides the necessary information to resolve the
       | scale of objects, but that is a much weaker claim than what
       | monodepth researchers generally make.
       | 
       | Two notable cases where something like monodepth does reliably
       | work are actually ones where considerably more information is
       | present: in animal eyes there is considerable information about
       | focus available, let alone the fact that eyes are nothing like a
       | planar imager; and phase-detection autofocus uses an entirely
       | different set of data (phase offsets via special lenses) than is
       | used by monodepth models (and, arguably, is mostly a relative
       | incremental process rather than something that produces absolute
       | depth).
        
       ___________________________________________________________________
       (page generated 2024-10-04 23:01 UTC)