[HN Gopher] Splatt3R: Zero-Shot Gaussian Splatting from Uncalibr...
___________________________________________________________________
Splatt3R: Zero-Shot Gaussian Splatting from Uncalibrated Image
Pairs
Author : jasondavies
Score : 105 points
Date : 2024-08-27 10:23 UTC (12 hours ago)
(HTM) web link (splatt3r.active.vision)
(TXT) w3m dump (splatt3r.active.vision)
| jonhohle wrote:
| The mirror in the example with the washing machine is amazing.
| Obviously the model doesn't understand that it's a mirror so
| renders it as if it were a window with volume behind the wall.
| But it does it so realistically that it produces the same effect
| as a mirror when viewed from different angles. This feels like
| something out of a sci-fi detective movie.
| scoopdewoop wrote:
| Ha, reminds me of Duke Nukem mirrors which are essentially the
| same thing, looking through a window to mirrored geometry
| kridsdale3 wrote:
| Damn, I'm lookin' good.
| recursive wrote:
| I think most video game mirrors use basically the same
| technique.
| HappMacDonald wrote:
| Would love to see it try to handle a scene where the real
| volume behind the mirror were also available then. :9
| S0y wrote:
| This is really awesome. A question for someone who knows more
| about this: How much harder would it be to make this work using
| any number of photos? I'm assuming this is the end goal for a
| model like this.
|
| Imagine being able to create an accurate enough 3D rendering of
| any interior with just a bunch of snapshots anyone can take with
| their phone.
| dagmx wrote:
| That's already how Gaussian splats work.
|
| They're novelty of splattr (though I contest that they're the
| first to do so) is that they need fewer images than usual.
| Arkanum wrote:
| I think the novelty is that they don't have to optimise the
| splats at all, they're directly predicted in a single forward
| pass.
| dagmx wrote:
| That's not really novel either imho, though google search
| is escaping me on the specific papers I saw at siggraph.
|
| Imho it's an interesting combination of technologies but
| not novel in an off itself.
| GaggiX wrote:
| The novelty here is that it does work on uncalibrated images.
| milleramp wrote:
| Not really, it is using Mast3r to determine camera poses.
| dagmx wrote:
| A lot of splats systems do work on uncalibrated images so
| that's not novel either. They all just do a camera solve,
| which arguable isn't terrible for a stereo pair with low
| divergence.
| Arkanum wrote:
| Probably not much harder, but you wouldn't get the same massive
| jump in quality that you get going from 1 image to 2.
| NeRF/Gaussian Splatting in general is what you're describing,
| but from the looks of it, this just does it in a single forward
| pass rather than optimising the gaussian/network weights.
| rkagerer wrote:
| What is a splat?
| llm_nerd wrote:
| You have a car in real life and want to visualize it in
| software. You take some pictures of the car from various angles
| -- each picture a 2D array of pixel data -- and process it
| through software, transforming it into effectively 3D pixels:
| Splats.
|
| The splats are individual elongated 3D spheres -- thousands to
| millions of them -- floating in a 3D coordinate space. 3D
| pixels, essentially. Each with radiance colour properties so
| they might have different appearances from different angles
| (e.g. environmental lighting or reflection, etc.)
|
| The magic is obviously figuring out how each pixel in a set of
| pictures correlates when translated to a 3D space filled with
| splats. Traditionally it took a load of pictures for it to
| rationalize, so doing it with two pictures is pretty amazing.
| CamperBob2 wrote:
| So, voxels, then...?
| llm_nerd wrote:
| Similar, but with the significant difference that splats
| are elongated spheres with variable orientation and
| elongation. Voxels are fixed sized, fixed orientation
| cubes. Splatting can be much more efficient for many
| scenarios than voxels.
| kridsdale3 wrote:
| Fuzzy, round-ish voxels.
| bredren wrote:
| What do you call the processing after a splat, that
| identifies what's in the model and generates what should
| exist on the other side?
| boltzmann64 wrote:
| When you throw a balloon of colored water at a wall, the
| impression it makes on the wall is called a splat. Say you have
| a function which takes a point in 3d and outputs a density
| value which goes to zero as you move away to infinity from the
| from the functions location (mean) like a bell curve
| (literally), and you throw (project) that function to plane
| (your camera film), you get a splat.
|
| Note: I've made some simplifying assumption in the above
| explanation.
| dimatura wrote:
| I'm not a computer graphics expert, but traditionally (since
| long before the latest 3D gaussian splatting) I've seen
| splatting used in computer graphics to describes a way of
| rendering 3D elements onto a 2D canvas with some
| "transparency", similar to 2D alpha compositing. I think the
| word derives from "splatter" - like what happens when you throw
| a tomato against a wall, except here you're throwing some 3D
| entity onto the camera plane. In the current context of 3D
| gaussian splatting, the entities that are splatted are 3D
| gaussians, and the parameters of those 3D gaussians are
| inferred with optimization at run time and/or predicted from a
| trained model.
| teqsun wrote:
| Just to check my understanding, the novel part of this is the
| fact that it generates it from two pictures from any camera
| without custom hand-calibration for that particular camera, and
| everything else involved is existing technology?
| refibrillator wrote:
| Novel view synthesis via 3DGS requires knowledge of the camera
| pose for every input image, ie the cam position and orientation
| in 3D space.
|
| Historically camera poses have been estimated via 2D image
| matching techniques like SIFT [1], through software packages like
| COLMAP.
|
| These algorithms work well when you have many images that
| methodically cover a scene. However they often struggle to
| produce accurate estimates in the few image regime, or "in the
| wild" where photos are taken casually with less rigorous scene
| coverage.
|
| To address this, a major trend in the field is to move away from
| classical 2D algorithms, instead leveraging methods that
| incorporate 3D "priors" or knowledge of the world.
|
| To that end, this paper builds heavily on MASt3R [2], which is a
| vision transformer model that has been trained to reconstruct a
| 3D scene from 2D image pairs. The authors added another
| projection head to output the initial parameters for each
| gaussian primitive. They further optimize the gaussians through
| some clever use of the original image pair and randomly selected
| and rendered novel views, which is basically the original 3DGS
| algorithm but using synthesized target images instead (hence
| "zero-shot" in the title).
|
| I do think this general approach will dominate the field in the
| coming years, but it brings its own unique challenges.
|
| In particular, the quadratic time complexity of transformers is
| the main computational bottleneck preventing this technique from
| being scaled up to more than two images at a time, and to
| resolutions beyond 512 x 512.
|
| Also, naive image matching itself has quadratic time complexity,
| which is really painful with large dense latent vectors and can't
| be accelerated with kd-trees due to the curse of dimensionality.
| That's why the authors use a hierarchical coarse to fine
| algorithm that approximates the exact computation and achieves
| linear time complexity wrt to image resolution.
|
| [1] https://en.m.wikipedia.org/wiki/Scale-
| invariant_feature_tran...
|
| [2] https://github.com/naver/mast3r
___________________________________________________________________
(page generated 2024-08-27 23:00 UTC)