[HN Gopher] MVSplat: Efficient 3D Gaussian Splatting from Sparse...
___________________________________________________________________
MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View
Images
Author : jasondavies
Score : 130 points
Date : 2024-08-12 09:23 UTC (2 days ago)
(HTM) web link (donydchen.github.io)
(TXT) w3m dump (donydchen.github.io)
| axoltl wrote:
| I'm having a hard time finding a reference to the hardware the
| inference is run on. The paper mentions training was done on a
| single A100 GPU so I'm going to assume inference was run on that
| same platform. The 22fps result is somewhat meaningless without
| that information.
|
| It does feel like we're getting closer and closer to being able
| to synthesize novel views in realtime from a small set of images
| at a framerate and quality high enough for use in AR, which is an
| interesting concept. I'd love to be able to 'walk around' in my
| photo library.
| tomp wrote:
| Once the Gaussian Splats are computed (whether via ML or
| classical optimisation), they're _very_ efficient to render
| (similar to 3D meshes used in games). High fps isn't
| incredible.
|
| Having said that (I have yet to read the paper), "efficiency"
| probably refers to the first part (calculating the gaussians in
| the first place) not rendering.
| axoltl wrote:
| You are correct. I was confusing this technique with Novel
| View Synthesis through diffusion (recent paper:
| https://arxiv.org/abs/2408.06157) where inference means
| generating frames rather than points.
| dagmx wrote:
| They're not "very efficient ". They have a significant amount
| of overdraw due to their transparency and will be a lot more
| inefficient if you're only considering material-less surface
| representation.
|
| They're more efficient to capture however . They're also more
| constant in their render time, but meshes will easily be
| faster in most scenes cases, but scale worse with complexity.
|
| The "efficiency" of splats is more about the material
| response and capturing complexity there, than it is about
| geometric representation.
| iforgotpassword wrote:
| > I'd love to be able to 'walk around' in my photo library.
|
| Yes this. I've been dreaming about this since I digitized my
| childhood photos a few years ago. There should be more than
| enough photos to reconstruct the entire apartment. Or my
| grandparents' house. Not sure though what happens if items and
| furniture moves around between shots.
|
| I haven't looked much into this yet and just assumed it will
| need a bit more time until there is a batteries included
| solution I can just download and run without reading ten pages
| of instructions and buying a GPU cluster.
| programjames wrote:
| Where would you use 3D Gaussian splatting? Static environments
| for video games?
| halfbreed wrote:
| I still wonder this myself, but the most obvious area that
| comes to mind is real estate virtual tours. Once a splat can
| render in the browser at high fps, then I see this replacing
| most all other technologies currently being used.
| 55555 wrote:
| Virtual tours for real estate
| littlestymaar wrote:
| Are there businesses doing it already or is the tech too
| immature to be used IRL right now?
| apinstein wrote:
| I started and ran a real estate photography platform from
| 2004-2018. We started r&d on this in ~2016 when consumer VR
| first came out. At the time we used photogrammetry and it
| was "dreadful" to try to capture due to mirrors, glass,
| etc.
|
| So I have been following GS tech for a while. I've not yet
| seen anything (open source / papers) that quite gets there
| yet. I do think it will.
|
| In my opinion, there are two useful ways GS can bring to
| this industry.
|
| The first is ability to use photo capture to re-render as a
| high production quality video similar to what people do
| with Luma AI today. While this is a really cool capability,
| it's also not really that hard to do anymore with drones
| and gimbals. So, the experience of creating the same thing
| via GS has to be better and easier, and it's not clear when
| that will likely happen due to how painful the capture side
| is. You really need good real time capture feedback to make
| sure you have good coverage. Finding out there's a hole
| once you're off location is a deal breaker.
|
| The second is to create VR capable experiences. I think the
| first real useful thing for consumers will be so you can
| walk around in a small three or 4 foot area and get a
| stereo sense of what it's like to be there. This is an
| amazing consumer experience. But the practicality of
| scaling this depends on VR hardware and adoption, and that
| hasn't yet become commonplace enough to make consumer use
| "adjacent possible" for broad deployment.
|
| I could see it being used on super high end to start out.
| noduerme wrote:
| Could be very useful for prototyping camera moves and lighting
| for film / commercial shoots on location. You might not even
| need to send a scout, just get a few pictures and be able to
| thumbnail a whole scene.
|
| I could also see a market for people who want to recreate
| virtual environments from old photos.
|
| Also, load the model on a single-lens 360 camera and infer
| stereoscopic output.
| dagmx wrote:
| No, Gaussian splats are pretty poor for video games. There's a
| significant amount of overdraw and they're not art directable
| or dynamic.
|
| Gaussian splats are much better suited for capturing things
| where you don't have artists available and don't have a ton of
| performance requirements with regards to frame time.
|
| So things like capturing real estate , or historical venues
| etc.
| vlovich123 wrote:
| Isn't that a "for now" problem rather than something
| intractable for the performance anyway? Presumably HW and SW
| algorithms will continue to improve. Art directable may be a
| problem but it feels like Gaussian splats + genAI models
| could be a match made in heaven with the genAI mode
| generating the starting image and splats generating the 3d
| scene from it
| dagmx wrote:
| Sure, given an unlimited amount of time and resources, it's
| possible that Gaussian splats could be performant. But
| that's just too vague a discussion point to be meaningful.
|
| It's definitely not in the cards in the near term without a
| dramatic breakthrough. Splats have been a thing for decades
| so I'm not holding my breath.
| vlovich123 wrote:
| I mean here it is running at 22fps. In another 5 years
| it's reasonable to conservatively believe hardware and
| software to be 3x as powerful which gets you to a smooth
| 60fps.
|
| What am I missing on the performance front?
| dagmx wrote:
| Well my critique of your comment is just that it's
| unbounded. Yes, eventually all compute will get better
| and we can use once slow technologies. But that's not a
| very valuable discussion because nobody is saying it'll
| never be useful, just that it isn't for games today.
|
| It also ignores that everything else will be faster too
| then as well, and ignores needing to target different
| baselines of hardware.
|
| Either way 5 years for a 3x improvement seems
| unrealistic. 4 years saw a little over a doubling of
| performance at the highest end with a significant
| increase in power requirements as well, where we're now
| hitting realistic power limits.
|
| Taking the 2080 vs 4080 as their respective tiers
|
| 153% performance increase 50% more power consumption 50%
| price increase.
|
| So yes performance at the high end will increase, but
| it's scaling pretty poorly with cost and power. And the
| lower end isn't scaling as linearly.
|
| On the lower/mid end (1060 Ti vs 2060 Super) we saw only
| a 53% increase in that same time period.
| vlovich123 wrote:
| I guess it's to me that's still just an pessimistic
| perception. Ray tracing was also extremely slow for a
| long time until Nvidia built dedicated HW to accelerate
| it. Is there reason to believe that splats are already
| well served by generic GPU compute that dedicated HW
| won't accelerate it in a meaningful way?
|
| Here's splats from 2020 working at 50-60fps [1]. I think
| my overall point is I don't think it's performance that's
| holding it back in games but tooling & whether it saves
| meaningful costs elsewhere in the game development
| pipeline.
|
| [1] https://x.com/8Infinite8/status/1699460316529090568
| hansworst wrote:
| > they're not art directable or dynamic
|
| This is not true I believe. There are plenty of papers out
| there revolving around dynamic/animated splat-based models,
| some using generative models for that aspect too.
|
| There are also some tools out there that let you touch up/rig
| splat models. Still not near what you can do with meshes but
| I think fundamentally it's not impossible.
| dagmx wrote:
| You can touch up a splat in the same way you can apply
| gross edits to an image (cropping, color corrections etc),
| but you can't easily change it in a way like "make this
| bicycle handle bar more rounded". Ergo it's not art
| directable.
|
| With regards to dynamicism, there's some papers yes but
| with heavy limitations. Rigging is doable but relighting is
| still hit and miss, while most complex rigs require a mesh
| underneath to drive a splats surface. There's also the
| issue of making sure the splats are tight to the surface
| boundary, which is difficult without significant other
| input.
|
| Other dynamics like animation operate at a very gross
| level, but you can't for example do a voronoi fracture for
| destruction along a surface easily. And again, even at a
| large scale motion, you still have the issue of splat
| isolation and fitting to contend with.
|
| The neural motion papers you mention are interesting, but
| have a significant overhead currently outside of small use
| cases.
|
| Meshes are much more straightforward, and with advancements
| in neutral materials and micropolygons (nanite etc) it's
| really difficult to make a splat scene that isn't first
| represented as a mesh have the quality and performance
| needed. And if you're creating splats from a captured real
| world scene, they need significant cleanup first.
| nox101 wrote:
| Are they good for that either? I haven't seen one where the
| data isn't huge
| dagmx wrote:
| The data is definitely an issue, but they do make for
| fairly convenient alternatives to something like matterport
| where you need their cameras rented etc.
|
| Though I think matterport will just start using them since
| the other half of their product is the user experience on
| the web.
| deckar01 wrote:
| Photography. A small cheap camera array could produce higher
| resolution, alternate angles, and arbitrary lens parameters
| that would otherwise require expensive or impossible lenses.
| Then you can render an array of angles for holographic
| displays.
| kersplody wrote:
| Volumetric live action performance capture. Basically a video
| you can walk around in. Currently requires a large synchronized
| camera array. Plays back on most mobile devices. Several major
| industry efforts in this space ongoing.
| twelvechairs wrote:
| Basically when you don't want to spend time to pre-process e.g.
| through traditional photogrammetry. So near-real-time events,
| or where there's huge amounts of pointcloud capture and
| comparatively little visualisation
|
| Edit: others are mentioning real estate I'd think that will
| prefer some pre processing but ymmv
| tomp wrote:
| Not really.
|
| First if all, most GS take _posed_ images as input, so you
| need to run a traditional photogrammetry pipeline (COLMAP)
| anyways.
|
| The purpose of GS is that the result is far beyond anything
| that traditional photogrammetry (dense mesh reconstruction)
| can manage, _especially_ when it comes to "weird" stuff
| (semi-transparent objects).
| t43562 wrote:
| What about virtual tourism? See the pyramids without the
| expense of going there.
| littlestymaar wrote:
| Have you watch the basketball games in the Olympics? Every once
| in a while, they showed a replay of a key point with some
| effect of the camera moving between two views in the middle of
| the shoot.
|
| It was not likely to be GS since there was tons of artifacts
| that didn't look like the ones GS produces, but they could have
| used it for such stuff.
|
| For instance with some kind of 4D GS we could even remap the
| camera view entirely to have a virtual camera allowing us to
| see the shoot from the eyes of Steph Curry with Batum and
| Fournier double teaming him.
| jorgemf wrote:
| Gaussian splatting transform images to a cloud points. GPUs can
| render these points but it is a very slow process. You need to
| transform the cloud points to meshes. So basically is the
| initial process to capture environments before converting them
| to 3D meshes that the GPUs can use for anything you want. It is
| much cheaper to use pictures to have a 3D representantion of an
| object or environment than buying professional stuff.
| andybak wrote:
| > Gaussian splatting transform images to a cloud points.
|
| Not exactly. The "splats" are both spread out in space (big
| ellipsoids), partially transparent (what you end up seeing is
| the composite of all the splats you can see in a given
| direction) AND view dependent (they render differently
| depending on the direction you are looking.
|
| Also - there's not a simple spatial relationship between
| splats and solid objects. The resulting surfaces are a kind
| of optical illusion based on all the splats you're seeing in
| a specific direction. (some methods have attempted to lock
| splats more closely to the surfaces they are meant to
| represent but I don't know what the tradeoffs are).
|
| Generating a mesh from splats is possible but then you've
| thrown away everything that makes a splat special. You're
| back to shitty photogrammetry. All the clever stuff (which is
| a kind of radiance capture) is gone.
|
| Splats are a lot faster to render than NeRFs - which is their
| appeal. But heavier than triangles due to having to sort them
| every frame (because transparent objects don't composite
| correctly without depth sorting)
| vessenes wrote:
| Minor nit -- in what way do splats render differently
| depending on direction of looking? To my mind these are
| probabilistic ellipsoids in 3D (or 4D for motion splats)
| space, and so while any novel view will see a slightly
| different shape, that's an artifact of the view changing,
| not the splat. Do I understand it (or you) correctly?
| refibrillator wrote:
| In 3DGS, spherical harmonics are used to model view-
| dependent changes in color.
|
| https://en.m.wikipedia.org/wiki/Spherical_harmonics
|
| Basically for each Gaussian there is a set of
| coefficients and those are used to calculate what color
| should be rendered depending on the viewing angle of the
| camera. And the SH coeffs are optimized through gradient
| descent just like the other parameters including position
| and shape.
| vessenes wrote:
| Ah, thank you. Taking into account say
| reflection/refraction.
| two_handfuls wrote:
| Good question. One thing I know they are good for are 3D photos
| because they solve a fundamental issue with the current tech:
| IPD.
|
| The current tech (Apple Vision Pro included) uses two photos:
| one per eye. If the photos were taken from a distance that
| matches the distance between your eyes, then the effect is
| convincing. Otherwise, it looks a bit off.
|
| The other problem is that a big part of the 3D perception comes
| from parallax: how the image changes with head motions (even
| small motions).
|
| Techniques that are not limited to two fixed images, but
| instead allow us to create new views for small motions, are
| great for much more impressive 3D photos.
|
| With more input photos you get a "walkable photo": a photo that
| you can take a few steps in, say if you are wearing a VR
| headset.
|
| I'm sure 3D Gaussian splatting is good for other things too,
| given the excitement around them. Backgrounds in movies maybe?
| praveen9920 wrote:
| One application I can think of is Google Street View. Gaussian
| splatting can potentially "smoothen" the transition between the
| images and make it look more realistic.
| lawlessone wrote:
| >Where would you use 3D Gaussian splatting?
|
| The primary purpose of Gaussian splatting is to frontpage here
| every two weeks.
| rebuilder wrote:
| The indoor example with the staircase and railing was really
| surprising - there's only one view of much of what's behind the
| doorframe and it still seems to reconstruct a pretty good 3d
| scene there.
| petargyurov wrote:
| Someone help me understand inference here.
|
| Every gaussian splat repo I have looked at doesn't mention how to
| use the pre-trained models to "simply" take MY images as input
| and output a GS. They all talk about evaluation, but the CMD
| interface requires the eval datasets as input.
|
| Is training/fine-tuning on my data the only way to get the
| output?
| littlestymaar wrote:
| Is there really such thing as a pre-trained model when it comes
| to Gaussian splatting?
|
| I'm not familiar at all with the topic (nor have I read this
| particular paper) but I remember that the original 3DGS paper
| took pride in the fact that this was not "IA" or "deep
| learning". There's still a gradient descent process to get the
| Gaussian splats from the data, but as I understood it, there is
| no "training on a large dataset then inference", building the
| GS from your data is the "training phase" and then rendering it
| is the equivalent of inference.
|
| Maybe I understood it all wrong though, or maybe new variants
| of Gaussian splatting use a deep learning network in addition
| to what was done in the original work, so I'll be happy to be
| corrected/clarified by someone with actual knowledge here.
| jorgemf wrote:
| Basically you train a model per each set of images. The model
| is a neural network able to render the final image. Different
| images will require different trained models. Initial gaussian
| splatting models took hours to train, last year models took
| minutes to train. I am not sure how much this one takes, but it
| should be between minutes and hours (and probably more close to
| minutes than hours).
| petargyurov wrote:
| Thank you, that explains it.
| tomp wrote:
| No, what you're describing is NeRF, the predecessor
| technology.
|
| The output of Gaussian Splat "training" is a set of 3d
| gaussians, which can be rendered very quickly. No ML involved
| at all (only optimisation)!
|
| They usually require running COLMAP first (to get the
| relative location of camera between different images), but
| NVIDIA's InstantSplat doesn't (it however _does_ use a ML
| model instead!)
| dagmx wrote:
| Nit: splats are significantly older than NeRFs. They just
| had a resurgence after nerfs.
|
| We've been using pretty similar technology for decades in
| areas like Renderman radiance caches before RIS.
| vessenes wrote:
| The tech stack in the splat world is still really young. For
| instance, I was thinking to myself: "Cool, MVSplat is pretty
| fast. Maybe I'll use it to get some renderings of a field by my
| house."
|
| As far as I can tell, I will need to offer a bunch of photographs
| with camera pose data added -- okay, fair enough, the splat
| architecture exists to generate splats.
|
| Now, what's the best way to get camera pose data from arbitrary
| outdoor photos? ... Cue a long wrangle through multiple papers.
| Maybe, as of today... FAR? (https://crockwell.github.io/far/).
| That claims up to 80% pose accuracy depending on source data.
|
| I have no idea how MVSplat will deal with 80% accurate camera
| pose data... And I also don't understand if I should use a pre-
| trained model from them or train my own or fine tune one of their
| models on my photos... This is sounding like a long project.
|
| I don't say this to complain, only to note where the edges are
| right now, and think about the commercialization gap. There are
| iPhone apps that will get (shitty) splats together for you right
| now, and there are higher end commercial projects like Skydio
| that will work with a drone to fill in a three dimensional
| representation of an object (or maybe some land, not sure about
| the outdoor support), but those are like multiple thousand-dollar
| per month subscriptions + hardware as far as I can tell.
|
| Anyway, interesting. I expect that over the next few years we'll
| have push button stacks based on 'good enough' open models, and
| those will iterate and go through cycles of being upsold /
| improved / etc. We are still a ways away from a trawl through an
| iPhone/gphoto library and a "hey, I made some environments for
| you!" Type of feature. But not infinitely far away.
| ryandamm wrote:
| I think the barrier to commercialization is the lack of
| demonstrated economic value to having push button splats.
| There's no shortage of small teams wiring together open source
| splats / NeRF / whatever papers; there's a dearth of valuable,
| repeatable businesses that could make use of what those small
| teams are building.
|
| Would it be cool to just have content in 3D? Undoubtedly. But
| figuring out a use case, that's where people need to be
| focusing. I think there are a lot of opportunities, but it's
| still early days -- and not just for the technology.
| vessenes wrote:
| Yes - agreed. There's a clear use case for indie content, but
| tooling around editing/modifying/color/lighting has to
| improve, and rendering engines or converters need to get
| better. FWIW it doesn't seem like a dead-end tech to me
| though; more likely a gateway tech to cost improvements.
| We'll see.
| algebra-pretext wrote:
| COLMAP to generate pose data using structure-from-motion; if
| you use Nerfstudio to make your splat (using Splatfacto method)
| it includes a command that will do the COLMAP alignment. This
| definitely is a weak spot though and a lot goes wrong in the
| alignment process unless you have a smooth walkthrough video of
| your subject with no other moving objects.
|
| On iPhone, Scaniverse (owned by Niantic) produces splats far
| more accurately than splatting from 2D video/images, because it
| uses LiDAR to gather the depth information needed for good
| alignment. I think even on older iPhones without LiDAR, it's
| able to estimate depth if the phone has multiple camera lenses.
| Like ryandamm said above, the main issue seems to be low
| value/demand for novel technology like this. Most of the use
| cases I can think of (real estate? shopping?) are usually
| better served with 2D videos and imagery.
___________________________________________________________________
(page generated 2024-08-14 23:01 UTC)