[HN Gopher] Depth Pro: Sharp monocular metric depth in less than...
___________________________________________________________________
Depth Pro: Sharp monocular metric depth in less than a second
Author : L_
Score : 101 points
Date : 2024-10-04 05:09 UTC (17 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| isoprophlex wrote:
| The example images look convincing, but the sharp hairs of the
| llama and the cat are pictured against an out-of-focus
| background...
|
| In real life, you'd use these models for synthetic depth-of-
| field, adding fake bokeh to a very sharp image that's in focus
| everywhere. so this seems too easy?
|
| Impressive latency tho.
| JBorrow wrote:
| I don't think the only utility of a depth model is to provide
| synthetic blurring of backgrounds. There are many things you'd
| like to use them for, including feeding into object detection
| pipelines.
| dagmx wrote:
| On visionOS 2, there's functionality to convert 2D images to 3D
| images for stereo viewing.
|
| https://youtu.be/pLfCdI0mjkI?si=8K7rPHu558P-Hf-Z
|
| I assume the first pass is the depth inference here.
| amluto wrote:
| I'm not convinced that this type of model is the right solution
| to fake bokeh, at least not if you use it as a black box.
| Imagine you have the letter A in the background behind some
| hair. You should end up with a blurry A and most in-focus hair.
| Instead you end up with an erratic mess, because a fuzzy depth
| map doesn't capture the relevant information.
|
| Of course, lots of text-to-image models generate a mess,
| because their training sets are highly contaminated by the
| messes produced by "Portrait mode".
| habitue wrote:
| Does apple actually name their research papers "Pro" too? Like is
| there an iLearning paper out there?
| sockaddr wrote:
| So what happens in the far future when we send autonomous
| machines equipped with models trained on Earth life and
| structures to other planets, are they going to have a hard time
| detecting and measuring things? What happens when the model is
| tasked with detecting the depth of an object that's made of
| triangular glowing scales and the head has three eyes?
| adolph wrote:
| Assembly Theory
| modeless wrote:
| False color depth maps are extremely misleading. The way to judge
| the quality of a depth map is to use it to reproject the image
| into 3D and rotate it around a bit. Papers almost never do this
| because it makes their artifacts extremely obvious.
|
| I'd bet that if you did that on these examples you'd see that the
| hair, rather than being attached to the animal, is floating
| halfway between the animal and the background. Of course, depth
| mapping is an ill-posed problem. The hair is not completely
| opaque and the pixels in that region have contributions from both
| the hair and the background, so the neural net is doing the best
| it can. To really handle hair correctly you would have to output
| a list of depths (and colors) per pixel, rather than a single
| depth, so pixels with contributions from multiple objects could
| be accurately represented.
| jonas21 wrote:
| > _The way to judge the quality of a depth map is to use it to
| reproject the image into 3D and rotate it around a bit._
|
| They do this. See figure 4 in the paper. Are the results
| cherry-picked to look good? Probably. But so is everything
| else.
| threeseed wrote:
| > Figure 4
|
| We plug depth maps produced by Depth Pro, Marigold, Depth
| Anything v2, and Metric3D v2 into a recent publicly available
| novel view synthesis system.
|
| We demonstrate results on images from AM-2k. Depth Pro
| produces sharper and more accurate depth maps, yielding
| cleaner synthesized views. Depth Anything v2 and Metric3D v2
| suffer from misalignment between the input images and
| estimated depth maps, resulting in foreground pixels bleeding
| into the background.
|
| Marigold is considerably slower than Depth Pro and produces
| less accurate boundaries, yielding artifacts in synthesized
| images.
| incrudible wrote:
| > I'd bet that if you did that on these examples you'd see that
| the hair, rather than being attached to the animal, is floating
| halfway between the animal and the background.
|
| You're correct about that, but for something like matte/depth-
| treshold that's exactly what you want to get a smooth and
| controllable transition within the limited amount of resolution
| you have. For that use case, especially with the fuzzy hair,
| it's pretty good.
| modeless wrote:
| It's not exactly what you want because you will get both
| background bleeding into the foreground and clipping of the
| parts of the foreground that fall under your threshold. What
| you want is for the neural net to estimate the different
| color contributions of background and foreground at each
| pixel so you can separate them without bleeding or clipping.
| incrudible wrote:
| It's what you'd want out of a _depth map_ used for that
| purpose. What you 're describing is not a depth map.
| zardo wrote:
| Maybe the depthMap should only accept images that have
| been typed as hairless.
| Stedag wrote:
| I work on Time-of-flight camera's that need to handle the
| kind of data that you're referring too.
|
| Each pixel takes a multiple measurements over time of the
| intensity of reflected light that matches the emission
| pulse encodings. The result is essentially a vector of
| intensity over a set of distances.
|
| A low depth resolution example of reflected intensity by
| time (distance):
|
| i: _ _ ^ _ ^ - _ _ d: 0 1 2 3 4 5 6 7
|
| In the above example, the pixel would exhibit an ambiguity
| between distances of 2 and 4.
|
| The simplest solution is to select the weighted average or
| median distance, which results in "flying pixels" or "mixed
| pixels" for which there are existing efficient techniques
| for filtration. The bottom line is that for applications
| like low-latency obstacle detection on a cost-constrained
| mobile robot, there's some compression of depth information
| required.
|
| For the sake of inferring a highly realistic model from an
| image, Neural radiance fields or gaussian splats may best
| generate the representation that you might be envisioning,
| where there would be a volumetric representation of
| material properties like hair. This comes with higher
| compute costs however and doesn't factor in semantic
| interpretation of a scene. The Top performing results in
| photogrammetry have tended to use a combination of less
| expensive techniques like this one to better handle
| sparsity of scene coverage, and then refining the a result
| using more expensive techniques [1].
|
| 1: https://arxiv.org/pdf/2404.08252
| tedunangst wrote:
| Funny that Apple uses bash for a shell script that just runs
| wget. https://github.com/apple/ml-depth-
| pro/blob/main/get_pretrain...
| threeseed wrote:
| It would be pulling from an internal service in a development
| branch.
|
| So this just makes it easier to swap it out without making any
| other changes.
| brcmthrowaway wrote:
| Does this take lens distortion into account?
| dguest wrote:
| What does this look like on an M. C. Escher drawing, e.g.
|
| https://i.pinimg.com/originals/00/f4/8c/00f48c6b443c0ce14b51...
|
| ?
| yunohn wrote:
| Looks like a screenshot from the Monument Valley games, full of
| such Escher like levels.
| cpgxiii wrote:
| The monodepth space is full of people insisting that their models
| can produce metric depth with no explanation other than "NN does
| magic" for why metric depth is possible from generic mono images.
| If you provide a single arbitrary image, you can't generate depth
| that is immune from scale error (e.g. produce accurate depth for
| a both an image of a real car and a scale model of the same car).
|
| Plausibly, you can train a model that encodes sufficient
| information about a specific set of imager+lens combinations such
| that the lens distortion behavior of images captured by those
| imagers+lenses provides the necessary information to resolve the
| scale of objects, but that is a much weaker claim than what
| monodepth researchers generally make.
|
| Two notable cases where something like monodepth does reliably
| work are actually ones where considerably more information is
| present: in animal eyes there is considerable information about
| focus available, let alone the fact that eyes are nothing like a
| planar imager; and phase-detection autofocus uses an entirely
| different set of data (phase offsets via special lenses) than is
| used by monodepth models (and, arguably, is mostly a relative
| incremental process rather than something that produces absolute
| depth).
___________________________________________________________________
(page generated 2024-10-04 23:01 UTC)