[HN Gopher] Splatt3R: Zero-Shot Gaussian Splatting from Uncalibr...
       ___________________________________________________________________
        
       Splatt3R: Zero-Shot Gaussian Splatting from Uncalibrated Image
       Pairs
        
       Author : jasondavies
       Score  : 105 points
       Date   : 2024-08-27 10:23 UTC (12 hours ago)
        
 (HTM) web link (splatt3r.active.vision)
 (TXT) w3m dump (splatt3r.active.vision)
        
       | jonhohle wrote:
       | The mirror in the example with the washing machine is amazing.
       | Obviously the model doesn't understand that it's a mirror so
       | renders it as if it were a window with volume behind the wall.
       | But it does it so realistically that it produces the same effect
       | as a mirror when viewed from different angles. This feels like
       | something out of a sci-fi detective movie.
        
         | scoopdewoop wrote:
         | Ha, reminds me of Duke Nukem mirrors which are essentially the
         | same thing, looking through a window to mirrored geometry
        
           | kridsdale3 wrote:
           | Damn, I'm lookin' good.
        
           | recursive wrote:
           | I think most video game mirrors use basically the same
           | technique.
        
         | HappMacDonald wrote:
         | Would love to see it try to handle a scene where the real
         | volume behind the mirror were also available then. :9
        
       | S0y wrote:
       | This is really awesome. A question for someone who knows more
       | about this: How much harder would it be to make this work using
       | any number of photos? I'm assuming this is the end goal for a
       | model like this.
       | 
       | Imagine being able to create an accurate enough 3D rendering of
       | any interior with just a bunch of snapshots anyone can take with
       | their phone.
        
         | dagmx wrote:
         | That's already how Gaussian splats work.
         | 
         | They're novelty of splattr (though I contest that they're the
         | first to do so) is that they need fewer images than usual.
        
           | Arkanum wrote:
           | I think the novelty is that they don't have to optimise the
           | splats at all, they're directly predicted in a single forward
           | pass.
        
             | dagmx wrote:
             | That's not really novel either imho, though google search
             | is escaping me on the specific papers I saw at siggraph.
             | 
             | Imho it's an interesting combination of technologies but
             | not novel in an off itself.
        
           | GaggiX wrote:
           | The novelty here is that it does work on uncalibrated images.
        
             | milleramp wrote:
             | Not really, it is using Mast3r to determine camera poses.
        
             | dagmx wrote:
             | A lot of splats systems do work on uncalibrated images so
             | that's not novel either. They all just do a camera solve,
             | which arguable isn't terrible for a stereo pair with low
             | divergence.
        
         | Arkanum wrote:
         | Probably not much harder, but you wouldn't get the same massive
         | jump in quality that you get going from 1 image to 2.
         | NeRF/Gaussian Splatting in general is what you're describing,
         | but from the looks of it, this just does it in a single forward
         | pass rather than optimising the gaussian/network weights.
        
       | rkagerer wrote:
       | What is a splat?
        
         | llm_nerd wrote:
         | You have a car in real life and want to visualize it in
         | software. You take some pictures of the car from various angles
         | -- each picture a 2D array of pixel data -- and process it
         | through software, transforming it into effectively 3D pixels:
         | Splats.
         | 
         | The splats are individual elongated 3D spheres -- thousands to
         | millions of them -- floating in a 3D coordinate space. 3D
         | pixels, essentially. Each with radiance colour properties so
         | they might have different appearances from different angles
         | (e.g. environmental lighting or reflection, etc.)
         | 
         | The magic is obviously figuring out how each pixel in a set of
         | pictures correlates when translated to a 3D space filled with
         | splats. Traditionally it took a load of pictures for it to
         | rationalize, so doing it with two pictures is pretty amazing.
        
           | CamperBob2 wrote:
           | So, voxels, then...?
        
             | llm_nerd wrote:
             | Similar, but with the significant difference that splats
             | are elongated spheres with variable orientation and
             | elongation. Voxels are fixed sized, fixed orientation
             | cubes. Splatting can be much more efficient for many
             | scenarios than voxels.
        
             | kridsdale3 wrote:
             | Fuzzy, round-ish voxels.
        
           | bredren wrote:
           | What do you call the processing after a splat, that
           | identifies what's in the model and generates what should
           | exist on the other side?
        
         | boltzmann64 wrote:
         | When you throw a balloon of colored water at a wall, the
         | impression it makes on the wall is called a splat. Say you have
         | a function which takes a point in 3d and outputs a density
         | value which goes to zero as you move away to infinity from the
         | from the functions location (mean) like a bell curve
         | (literally), and you throw (project) that function to plane
         | (your camera film), you get a splat.
         | 
         | Note: I've made some simplifying assumption in the above
         | explanation.
        
         | dimatura wrote:
         | I'm not a computer graphics expert, but traditionally (since
         | long before the latest 3D gaussian splatting) I've seen
         | splatting used in computer graphics to describes a way of
         | rendering 3D elements onto a 2D canvas with some
         | "transparency", similar to 2D alpha compositing. I think the
         | word derives from "splatter" - like what happens when you throw
         | a tomato against a wall, except here you're throwing some 3D
         | entity onto the camera plane. In the current context of 3D
         | gaussian splatting, the entities that are splatted are 3D
         | gaussians, and the parameters of those 3D gaussians are
         | inferred with optimization at run time and/or predicted from a
         | trained model.
        
       | teqsun wrote:
       | Just to check my understanding, the novel part of this is the
       | fact that it generates it from two pictures from any camera
       | without custom hand-calibration for that particular camera, and
       | everything else involved is existing technology?
        
       | refibrillator wrote:
       | Novel view synthesis via 3DGS requires knowledge of the camera
       | pose for every input image, ie the cam position and orientation
       | in 3D space.
       | 
       | Historically camera poses have been estimated via 2D image
       | matching techniques like SIFT [1], through software packages like
       | COLMAP.
       | 
       | These algorithms work well when you have many images that
       | methodically cover a scene. However they often struggle to
       | produce accurate estimates in the few image regime, or "in the
       | wild" where photos are taken casually with less rigorous scene
       | coverage.
       | 
       | To address this, a major trend in the field is to move away from
       | classical 2D algorithms, instead leveraging methods that
       | incorporate 3D "priors" or knowledge of the world.
       | 
       | To that end, this paper builds heavily on MASt3R [2], which is a
       | vision transformer model that has been trained to reconstruct a
       | 3D scene from 2D image pairs. The authors added another
       | projection head to output the initial parameters for each
       | gaussian primitive. They further optimize the gaussians through
       | some clever use of the original image pair and randomly selected
       | and rendered novel views, which is basically the original 3DGS
       | algorithm but using synthesized target images instead (hence
       | "zero-shot" in the title).
       | 
       | I do think this general approach will dominate the field in the
       | coming years, but it brings its own unique challenges.
       | 
       | In particular, the quadratic time complexity of transformers is
       | the main computational bottleneck preventing this technique from
       | being scaled up to more than two images at a time, and to
       | resolutions beyond 512 x 512.
       | 
       | Also, naive image matching itself has quadratic time complexity,
       | which is really painful with large dense latent vectors and can't
       | be accelerated with kd-trees due to the curse of dimensionality.
       | That's why the authors use a hierarchical coarse to fine
       | algorithm that approximates the exact computation and achieves
       | linear time complexity wrt to image resolution.
       | 
       | [1] https://en.m.wikipedia.org/wiki/Scale-
       | invariant_feature_tran...
       | 
       | [2] https://github.com/naver/mast3r
        
       ___________________________________________________________________
       (page generated 2024-08-27 23:00 UTC)