[HN Gopher] MVSplat: Efficient 3D Gaussian Splatting from Sparse...
       ___________________________________________________________________
        
       MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View
       Images
        
       Author : jasondavies
       Score  : 130 points
       Date   : 2024-08-12 09:23 UTC (2 days ago)
        
 (HTM) web link (donydchen.github.io)
 (TXT) w3m dump (donydchen.github.io)
        
       | axoltl wrote:
       | I'm having a hard time finding a reference to the hardware the
       | inference is run on. The paper mentions training was done on a
       | single A100 GPU so I'm going to assume inference was run on that
       | same platform. The 22fps result is somewhat meaningless without
       | that information.
       | 
       | It does feel like we're getting closer and closer to being able
       | to synthesize novel views in realtime from a small set of images
       | at a framerate and quality high enough for use in AR, which is an
       | interesting concept. I'd love to be able to 'walk around' in my
       | photo library.
        
         | tomp wrote:
         | Once the Gaussian Splats are computed (whether via ML or
         | classical optimisation), they're _very_ efficient to render
         | (similar to 3D meshes used in games). High fps isn't
         | incredible.
         | 
         | Having said that (I have yet to read the paper), "efficiency"
         | probably refers to the first part (calculating the gaussians in
         | the first place) not rendering.
        
           | axoltl wrote:
           | You are correct. I was confusing this technique with Novel
           | View Synthesis through diffusion (recent paper:
           | https://arxiv.org/abs/2408.06157) where inference means
           | generating frames rather than points.
        
           | dagmx wrote:
           | They're not "very efficient ". They have a significant amount
           | of overdraw due to their transparency and will be a lot more
           | inefficient if you're only considering material-less surface
           | representation.
           | 
           | They're more efficient to capture however . They're also more
           | constant in their render time, but meshes will easily be
           | faster in most scenes cases, but scale worse with complexity.
           | 
           | The "efficiency" of splats is more about the material
           | response and capturing complexity there, than it is about
           | geometric representation.
        
         | iforgotpassword wrote:
         | > I'd love to be able to 'walk around' in my photo library.
         | 
         | Yes this. I've been dreaming about this since I digitized my
         | childhood photos a few years ago. There should be more than
         | enough photos to reconstruct the entire apartment. Or my
         | grandparents' house. Not sure though what happens if items and
         | furniture moves around between shots.
         | 
         | I haven't looked much into this yet and just assumed it will
         | need a bit more time until there is a batteries included
         | solution I can just download and run without reading ten pages
         | of instructions and buying a GPU cluster.
        
       | programjames wrote:
       | Where would you use 3D Gaussian splatting? Static environments
       | for video games?
        
         | halfbreed wrote:
         | I still wonder this myself, but the most obvious area that
         | comes to mind is real estate virtual tours. Once a splat can
         | render in the browser at high fps, then I see this replacing
         | most all other technologies currently being used.
        
         | 55555 wrote:
         | Virtual tours for real estate
        
           | littlestymaar wrote:
           | Are there businesses doing it already or is the tech too
           | immature to be used IRL right now?
        
             | apinstein wrote:
             | I started and ran a real estate photography platform from
             | 2004-2018. We started r&d on this in ~2016 when consumer VR
             | first came out. At the time we used photogrammetry and it
             | was "dreadful" to try to capture due to mirrors, glass,
             | etc.
             | 
             | So I have been following GS tech for a while. I've not yet
             | seen anything (open source / papers) that quite gets there
             | yet. I do think it will.
             | 
             | In my opinion, there are two useful ways GS can bring to
             | this industry.
             | 
             | The first is ability to use photo capture to re-render as a
             | high production quality video similar to what people do
             | with Luma AI today. While this is a really cool capability,
             | it's also not really that hard to do anymore with drones
             | and gimbals. So, the experience of creating the same thing
             | via GS has to be better and easier, and it's not clear when
             | that will likely happen due to how painful the capture side
             | is. You really need good real time capture feedback to make
             | sure you have good coverage. Finding out there's a hole
             | once you're off location is a deal breaker.
             | 
             | The second is to create VR capable experiences. I think the
             | first real useful thing for consumers will be so you can
             | walk around in a small three or 4 foot area and get a
             | stereo sense of what it's like to be there. This is an
             | amazing consumer experience. But the practicality of
             | scaling this depends on VR hardware and adoption, and that
             | hasn't yet become commonplace enough to make consumer use
             | "adjacent possible" for broad deployment.
             | 
             | I could see it being used on super high end to start out.
        
         | noduerme wrote:
         | Could be very useful for prototyping camera moves and lighting
         | for film / commercial shoots on location. You might not even
         | need to send a scout, just get a few pictures and be able to
         | thumbnail a whole scene.
         | 
         | I could also see a market for people who want to recreate
         | virtual environments from old photos.
         | 
         | Also, load the model on a single-lens 360 camera and infer
         | stereoscopic output.
        
         | dagmx wrote:
         | No, Gaussian splats are pretty poor for video games. There's a
         | significant amount of overdraw and they're not art directable
         | or dynamic.
         | 
         | Gaussian splats are much better suited for capturing things
         | where you don't have artists available and don't have a ton of
         | performance requirements with regards to frame time.
         | 
         | So things like capturing real estate , or historical venues
         | etc.
        
           | vlovich123 wrote:
           | Isn't that a "for now" problem rather than something
           | intractable for the performance anyway? Presumably HW and SW
           | algorithms will continue to improve. Art directable may be a
           | problem but it feels like Gaussian splats + genAI models
           | could be a match made in heaven with the genAI mode
           | generating the starting image and splats generating the 3d
           | scene from it
        
             | dagmx wrote:
             | Sure, given an unlimited amount of time and resources, it's
             | possible that Gaussian splats could be performant. But
             | that's just too vague a discussion point to be meaningful.
             | 
             | It's definitely not in the cards in the near term without a
             | dramatic breakthrough. Splats have been a thing for decades
             | so I'm not holding my breath.
        
               | vlovich123 wrote:
               | I mean here it is running at 22fps. In another 5 years
               | it's reasonable to conservatively believe hardware and
               | software to be 3x as powerful which gets you to a smooth
               | 60fps.
               | 
               | What am I missing on the performance front?
        
               | dagmx wrote:
               | Well my critique of your comment is just that it's
               | unbounded. Yes, eventually all compute will get better
               | and we can use once slow technologies. But that's not a
               | very valuable discussion because nobody is saying it'll
               | never be useful, just that it isn't for games today.
               | 
               | It also ignores that everything else will be faster too
               | then as well, and ignores needing to target different
               | baselines of hardware.
               | 
               | Either way 5 years for a 3x improvement seems
               | unrealistic. 4 years saw a little over a doubling of
               | performance at the highest end with a significant
               | increase in power requirements as well, where we're now
               | hitting realistic power limits.
               | 
               | Taking the 2080 vs 4080 as their respective tiers
               | 
               | 153% performance increase 50% more power consumption 50%
               | price increase.
               | 
               | So yes performance at the high end will increase, but
               | it's scaling pretty poorly with cost and power. And the
               | lower end isn't scaling as linearly.
               | 
               | On the lower/mid end (1060 Ti vs 2060 Super) we saw only
               | a 53% increase in that same time period.
        
               | vlovich123 wrote:
               | I guess it's to me that's still just an pessimistic
               | perception. Ray tracing was also extremely slow for a
               | long time until Nvidia built dedicated HW to accelerate
               | it. Is there reason to believe that splats are already
               | well served by generic GPU compute that dedicated HW
               | won't accelerate it in a meaningful way?
               | 
               | Here's splats from 2020 working at 50-60fps [1]. I think
               | my overall point is I don't think it's performance that's
               | holding it back in games but tooling & whether it saves
               | meaningful costs elsewhere in the game development
               | pipeline.
               | 
               | [1] https://x.com/8Infinite8/status/1699460316529090568
        
           | hansworst wrote:
           | > they're not art directable or dynamic
           | 
           | This is not true I believe. There are plenty of papers out
           | there revolving around dynamic/animated splat-based models,
           | some using generative models for that aspect too.
           | 
           | There are also some tools out there that let you touch up/rig
           | splat models. Still not near what you can do with meshes but
           | I think fundamentally it's not impossible.
        
             | dagmx wrote:
             | You can touch up a splat in the same way you can apply
             | gross edits to an image (cropping, color corrections etc),
             | but you can't easily change it in a way like "make this
             | bicycle handle bar more rounded". Ergo it's not art
             | directable.
             | 
             | With regards to dynamicism, there's some papers yes but
             | with heavy limitations. Rigging is doable but relighting is
             | still hit and miss, while most complex rigs require a mesh
             | underneath to drive a splats surface. There's also the
             | issue of making sure the splats are tight to the surface
             | boundary, which is difficult without significant other
             | input.
             | 
             | Other dynamics like animation operate at a very gross
             | level, but you can't for example do a voronoi fracture for
             | destruction along a surface easily. And again, even at a
             | large scale motion, you still have the issue of splat
             | isolation and fitting to contend with.
             | 
             | The neural motion papers you mention are interesting, but
             | have a significant overhead currently outside of small use
             | cases.
             | 
             | Meshes are much more straightforward, and with advancements
             | in neutral materials and micropolygons (nanite etc) it's
             | really difficult to make a splat scene that isn't first
             | represented as a mesh have the quality and performance
             | needed. And if you're creating splats from a captured real
             | world scene, they need significant cleanup first.
        
           | nox101 wrote:
           | Are they good for that either? I haven't seen one where the
           | data isn't huge
        
             | dagmx wrote:
             | The data is definitely an issue, but they do make for
             | fairly convenient alternatives to something like matterport
             | where you need their cameras rented etc.
             | 
             | Though I think matterport will just start using them since
             | the other half of their product is the user experience on
             | the web.
        
         | deckar01 wrote:
         | Photography. A small cheap camera array could produce higher
         | resolution, alternate angles, and arbitrary lens parameters
         | that would otherwise require expensive or impossible lenses.
         | Then you can render an array of angles for holographic
         | displays.
        
         | kersplody wrote:
         | Volumetric live action performance capture. Basically a video
         | you can walk around in. Currently requires a large synchronized
         | camera array. Plays back on most mobile devices. Several major
         | industry efforts in this space ongoing.
        
         | twelvechairs wrote:
         | Basically when you don't want to spend time to pre-process e.g.
         | through traditional photogrammetry. So near-real-time events,
         | or where there's huge amounts of pointcloud capture and
         | comparatively little visualisation
         | 
         | Edit: others are mentioning real estate I'd think that will
         | prefer some pre processing but ymmv
        
           | tomp wrote:
           | Not really.
           | 
           | First if all, most GS take _posed_ images as input, so you
           | need to run a traditional photogrammetry pipeline (COLMAP)
           | anyways.
           | 
           | The purpose of GS is that the result is far beyond anything
           | that traditional photogrammetry (dense mesh reconstruction)
           | can manage, _especially_ when it comes to "weird" stuff
           | (semi-transparent objects).
        
         | t43562 wrote:
         | What about virtual tourism? See the pyramids without the
         | expense of going there.
        
         | littlestymaar wrote:
         | Have you watch the basketball games in the Olympics? Every once
         | in a while, they showed a replay of a key point with some
         | effect of the camera moving between two views in the middle of
         | the shoot.
         | 
         | It was not likely to be GS since there was tons of artifacts
         | that didn't look like the ones GS produces, but they could have
         | used it for such stuff.
         | 
         | For instance with some kind of 4D GS we could even remap the
         | camera view entirely to have a virtual camera allowing us to
         | see the shoot from the eyes of Steph Curry with Batum and
         | Fournier double teaming him.
        
         | jorgemf wrote:
         | Gaussian splatting transform images to a cloud points. GPUs can
         | render these points but it is a very slow process. You need to
         | transform the cloud points to meshes. So basically is the
         | initial process to capture environments before converting them
         | to 3D meshes that the GPUs can use for anything you want. It is
         | much cheaper to use pictures to have a 3D representantion of an
         | object or environment than buying professional stuff.
        
           | andybak wrote:
           | > Gaussian splatting transform images to a cloud points.
           | 
           | Not exactly. The "splats" are both spread out in space (big
           | ellipsoids), partially transparent (what you end up seeing is
           | the composite of all the splats you can see in a given
           | direction) AND view dependent (they render differently
           | depending on the direction you are looking.
           | 
           | Also - there's not a simple spatial relationship between
           | splats and solid objects. The resulting surfaces are a kind
           | of optical illusion based on all the splats you're seeing in
           | a specific direction. (some methods have attempted to lock
           | splats more closely to the surfaces they are meant to
           | represent but I don't know what the tradeoffs are).
           | 
           | Generating a mesh from splats is possible but then you've
           | thrown away everything that makes a splat special. You're
           | back to shitty photogrammetry. All the clever stuff (which is
           | a kind of radiance capture) is gone.
           | 
           | Splats are a lot faster to render than NeRFs - which is their
           | appeal. But heavier than triangles due to having to sort them
           | every frame (because transparent objects don't composite
           | correctly without depth sorting)
        
             | vessenes wrote:
             | Minor nit -- in what way do splats render differently
             | depending on direction of looking? To my mind these are
             | probabilistic ellipsoids in 3D (or 4D for motion splats)
             | space, and so while any novel view will see a slightly
             | different shape, that's an artifact of the view changing,
             | not the splat. Do I understand it (or you) correctly?
        
               | refibrillator wrote:
               | In 3DGS, spherical harmonics are used to model view-
               | dependent changes in color.
               | 
               | https://en.m.wikipedia.org/wiki/Spherical_harmonics
               | 
               | Basically for each Gaussian there is a set of
               | coefficients and those are used to calculate what color
               | should be rendered depending on the viewing angle of the
               | camera. And the SH coeffs are optimized through gradient
               | descent just like the other parameters including position
               | and shape.
        
               | vessenes wrote:
               | Ah, thank you. Taking into account say
               | reflection/refraction.
        
         | two_handfuls wrote:
         | Good question. One thing I know they are good for are 3D photos
         | because they solve a fundamental issue with the current tech:
         | IPD.
         | 
         | The current tech (Apple Vision Pro included) uses two photos:
         | one per eye. If the photos were taken from a distance that
         | matches the distance between your eyes, then the effect is
         | convincing. Otherwise, it looks a bit off.
         | 
         | The other problem is that a big part of the 3D perception comes
         | from parallax: how the image changes with head motions (even
         | small motions).
         | 
         | Techniques that are not limited to two fixed images, but
         | instead allow us to create new views for small motions, are
         | great for much more impressive 3D photos.
         | 
         | With more input photos you get a "walkable photo": a photo that
         | you can take a few steps in, say if you are wearing a VR
         | headset.
         | 
         | I'm sure 3D Gaussian splatting is good for other things too,
         | given the excitement around them. Backgrounds in movies maybe?
        
         | praveen9920 wrote:
         | One application I can think of is Google Street View. Gaussian
         | splatting can potentially "smoothen" the transition between the
         | images and make it look more realistic.
        
         | lawlessone wrote:
         | >Where would you use 3D Gaussian splatting?
         | 
         | The primary purpose of Gaussian splatting is to frontpage here
         | every two weeks.
        
       | rebuilder wrote:
       | The indoor example with the staircase and railing was really
       | surprising - there's only one view of much of what's behind the
       | doorframe and it still seems to reconstruct a pretty good 3d
       | scene there.
        
       | petargyurov wrote:
       | Someone help me understand inference here.
       | 
       | Every gaussian splat repo I have looked at doesn't mention how to
       | use the pre-trained models to "simply" take MY images as input
       | and output a GS. They all talk about evaluation, but the CMD
       | interface requires the eval datasets as input.
       | 
       | Is training/fine-tuning on my data the only way to get the
       | output?
        
         | littlestymaar wrote:
         | Is there really such thing as a pre-trained model when it comes
         | to Gaussian splatting?
         | 
         | I'm not familiar at all with the topic (nor have I read this
         | particular paper) but I remember that the original 3DGS paper
         | took pride in the fact that this was not "IA" or "deep
         | learning". There's still a gradient descent process to get the
         | Gaussian splats from the data, but as I understood it, there is
         | no "training on a large dataset then inference", building the
         | GS from your data is the "training phase" and then rendering it
         | is the equivalent of inference.
         | 
         | Maybe I understood it all wrong though, or maybe new variants
         | of Gaussian splatting use a deep learning network in addition
         | to what was done in the original work, so I'll be happy to be
         | corrected/clarified by someone with actual knowledge here.
        
         | jorgemf wrote:
         | Basically you train a model per each set of images. The model
         | is a neural network able to render the final image. Different
         | images will require different trained models. Initial gaussian
         | splatting models took hours to train, last year models took
         | minutes to train. I am not sure how much this one takes, but it
         | should be between minutes and hours (and probably more close to
         | minutes than hours).
        
           | petargyurov wrote:
           | Thank you, that explains it.
        
           | tomp wrote:
           | No, what you're describing is NeRF, the predecessor
           | technology.
           | 
           | The output of Gaussian Splat "training" is a set of 3d
           | gaussians, which can be rendered very quickly. No ML involved
           | at all (only optimisation)!
           | 
           | They usually require running COLMAP first (to get the
           | relative location of camera between different images), but
           | NVIDIA's InstantSplat doesn't (it however _does_ use a ML
           | model instead!)
        
             | dagmx wrote:
             | Nit: splats are significantly older than NeRFs. They just
             | had a resurgence after nerfs.
             | 
             | We've been using pretty similar technology for decades in
             | areas like Renderman radiance caches before RIS.
        
       | vessenes wrote:
       | The tech stack in the splat world is still really young. For
       | instance, I was thinking to myself: "Cool, MVSplat is pretty
       | fast. Maybe I'll use it to get some renderings of a field by my
       | house."
       | 
       | As far as I can tell, I will need to offer a bunch of photographs
       | with camera pose data added -- okay, fair enough, the splat
       | architecture exists to generate splats.
       | 
       | Now, what's the best way to get camera pose data from arbitrary
       | outdoor photos? ... Cue a long wrangle through multiple papers.
       | Maybe, as of today... FAR? (https://crockwell.github.io/far/).
       | That claims up to 80% pose accuracy depending on source data.
       | 
       | I have no idea how MVSplat will deal with 80% accurate camera
       | pose data... And I also don't understand if I should use a pre-
       | trained model from them or train my own or fine tune one of their
       | models on my photos... This is sounding like a long project.
       | 
       | I don't say this to complain, only to note where the edges are
       | right now, and think about the commercialization gap. There are
       | iPhone apps that will get (shitty) splats together for you right
       | now, and there are higher end commercial projects like Skydio
       | that will work with a drone to fill in a three dimensional
       | representation of an object (or maybe some land, not sure about
       | the outdoor support), but those are like multiple thousand-dollar
       | per month subscriptions + hardware as far as I can tell.
       | 
       | Anyway, interesting. I expect that over the next few years we'll
       | have push button stacks based on 'good enough' open models, and
       | those will iterate and go through cycles of being upsold /
       | improved / etc. We are still a ways away from a trawl through an
       | iPhone/gphoto library and a "hey, I made some environments for
       | you!" Type of feature. But not infinitely far away.
        
         | ryandamm wrote:
         | I think the barrier to commercialization is the lack of
         | demonstrated economic value to having push button splats.
         | There's no shortage of small teams wiring together open source
         | splats / NeRF / whatever papers; there's a dearth of valuable,
         | repeatable businesses that could make use of what those small
         | teams are building.
         | 
         | Would it be cool to just have content in 3D? Undoubtedly. But
         | figuring out a use case, that's where people need to be
         | focusing. I think there are a lot of opportunities, but it's
         | still early days -- and not just for the technology.
        
           | vessenes wrote:
           | Yes - agreed. There's a clear use case for indie content, but
           | tooling around editing/modifying/color/lighting has to
           | improve, and rendering engines or converters need to get
           | better. FWIW it doesn't seem like a dead-end tech to me
           | though; more likely a gateway tech to cost improvements.
           | We'll see.
        
         | algebra-pretext wrote:
         | COLMAP to generate pose data using structure-from-motion; if
         | you use Nerfstudio to make your splat (using Splatfacto method)
         | it includes a command that will do the COLMAP alignment. This
         | definitely is a weak spot though and a lot goes wrong in the
         | alignment process unless you have a smooth walkthrough video of
         | your subject with no other moving objects.
         | 
         | On iPhone, Scaniverse (owned by Niantic) produces splats far
         | more accurately than splatting from 2D video/images, because it
         | uses LiDAR to gather the depth information needed for good
         | alignment. I think even on older iPhones without LiDAR, it's
         | able to estimate depth if the phone has multiple camera lenses.
         | Like ryandamm said above, the main issue seems to be low
         | value/demand for novel technology like this. Most of the use
         | cases I can think of (real estate? shopping?) are usually
         | better served with 2D videos and imagery.
        
       ___________________________________________________________________
       (page generated 2024-08-14 23:01 UTC)