[HN Gopher] World Labs: Generate 3D worlds from a single image
___________________________________________________________________
World Labs: Generate 3D worlds from a single image
Author : dmarcos
Score : 249 points
Date : 2024-12-02 16:18 UTC (6 hours ago)
(HTM) web link (www.worldlabs.ai)
(TXT) w3m dump (www.worldlabs.ai)
| Minor49er wrote:
| Apparently a "world" is about the size of your average walk-in
| closet these days
| dmarcos wrote:
| Don't know the exact details but I imagine the further from the
| original input image the more the system needs to make up
| stuff. Same why generative video models are limited to a few
| seconds. It will improve
| anticorporate wrote:
| Can you point at some data that would indicate it will
| improve? There are lots of statements today about GenAI akin
| to "that will get fixed later" but we don't actually seem to
| know what will actually improve and what will just get
| incrementally prettier without fixing the underlying issue.
| dmarcos wrote:
| Image generation has improved a lot in just 2 years: no
| more 7 fingers hands, text rendering, general image
| quality...
|
| We're just getting started with 3D and incentives for it to
| improve are strong
| marcsj wrote:
| The boundaries make it pretty obvious how flat this world still
| is and even on blurring the edges it's obvious there really isn't
| anything to the models. This is cool for sure, and I can see it
| being useful for better photogrammetry and assisting in building
| out worlds, but it isn't going to suddenly be used to make entire
| game worlds on its own.
| mattlondon wrote:
| Yeah I thought the same and was immediately disappointed that
| you could only step a tiny bit forwards.
|
| BUT, you can turn around and see something that I presume was
| entirely generated. So I don't think it is just doing some
| clever tricks to make the photo look 3D, but also "infilling"
| what is behind the camera too. That is kinda cool.
|
| I'd love to see this improved so I can walk around some more
| though, to see what is down those alleys etc.
| dmarcos wrote:
| I'm sure It'll improve. I imagine, the further from the input
| image the more the model has to make up stuff. Gen-AI video
| models are limited to a few seconds. In 3D you're constrained
| to a volume
| dyauspitr wrote:
| Is this just a perspective trick or are 3D models generated?
| dmarcos wrote:
| In some angles you can appreciate artifacts that resemble those
| of gaussian splats. I'd bet is a 3D representation but not your
| traditional mesh. Very cool
| echoangle wrote:
| Is that important? If the perspective trick works, you can
| probably do some variation of photogrammetry and get a 3D model
| chrsw wrote:
| I don't see any mention of being able to export 3d models from
| this.
| dag11 wrote:
| It's generating gaussian splats, so not quite 3D worlds. In
| short, they're pseudo 3D in that there is in fact a 3D point
| cloud, but the points are elliptical angular-dependent colored
| splats that get projected into 2D space[1]. They're better for
| reconstructing source-realistic renderings from a constrained
| view box, but break down outside of that.
|
| They're a really cool way to capture spatial memories though!
| Friends and I occasionally use Polycam or Luma Labs apps. But
| there's not _too much_ you can do with them due to the above
| limitations.
|
| From a brief look at the OP link, World Labs seems to be
| generating a 360o gaussian splat (for a limited view box) from
| a still photo, which is cool as hell! But we still have the
| same problem of "what do we do with gaussian splats".
|
| [1] This description is hand-wavey as I'm a relative layman
| when it comes to how these work. I'm sure someone can reply
| with a more precise answer if this one is bad.
| byteknight wrote:
| I feel like a lot of the good that projects like this do get
| muddied by the overly ambitious claims.
| echelon wrote:
| The overly ambitious claims are what led them to raise $230M+
| without a product.
|
| Fei-Fei Li is a luminary in the field, and she's assembled a
| stellar team of some of the best researchers in the space.
|
| Their gamble is that they'll be able to move faster than the
| open research and other companies looking to productionize that
| research.
|
| Time will tell if this can become an ElevenLabs or if it'll
| fizzle out like Character.ai.
|
| My worry is that without a product, they'll malinvestment their
| research into cool problems that don't satisfy market demand.
| There's nothing like the north star of customers. They'll also
| have a tough time with hiring going forward with that
| valuation.
|
| The market of open research and models is producing a lot of
| neat stuff in 3D. But there's no open pool of data yet, despite
| HuggingFace and others trying.
|
| We'll see what happens.
| mclau156 wrote:
| I believe Fei-Fei is focused on physical world interaction,
| https://behavior.stanford.edu/ is a project she works on for
| more physical interaction with AI
| bko wrote:
| I thought you were kidding but World Labs really did raise
| $230M after being founded less than 1 year ago. Andreessen
| Horowitz too.
|
| What would have to be true for this to be worth $100 billion
| after 5-10 years?
|
| https://www.crunchbase.com/organization/world-labs
| xnx wrote:
| I don't know enough about venture funding. Did $230M really
| get transferred into World Labs bank account, or is this a
| "commitment" of $230M which is trickled out a few million
| at a time?
| whiplash451 wrote:
| Part of it is often dedicated to compute through credit
| commitments at Azure/Google/AWS. Probably not all is cash
| and available at t0.
|
| That money will go fast though, given GPU costs and
| salary ranges in the bay
| corysama wrote:
| The general theme is...
|
| Before: You and your cofounders share 100% of the stock
| in your company that's valued at $X.
|
| After: Your company now has everything it had before plus
| $Y worth of "something" from the VCs. Your company is now
| valued at $X plus $Y. The VCs now hold stock in your
| company worth $Y. You and your cofounders still hold
| stock your company worth $X.
|
| "Something" might be anything. Cash, stocks, commitments
| to resources, whatever.
| whiplash451 wrote:
| Doesn't have to be 100 billions. They probably raised at a
| few billion valuation -- which I do agree is still a lot
| kfarr wrote:
| Not my project, but another approach recently published used
| Depth Anywhere to create a virtual depthmap for a given 360o
| equirectangular image and then apply to point cloud and render
| using three.js / A-Frame.
|
| Appears to be similar capability as OP for creating scene depth
| from 2D, but using point cloud instead of gaussian splatting for
| rendering so looks more pixelated:
| https://github.com/akbartus/360-Depth-in-WebXR
|
| Also unlike the World Lab example you have the ability to go
| further outside the bounds of the point cloud to inspect the
| deficiencies of the approach. It's getting there but still needs
| work.
| dmarcos wrote:
| Yeah A-Frame! It makes me happy to see my many years of
| maintaining it paying off
| kfarr wrote:
| Yes and this is a great example of how open A-Frame is
| compared to OP example. You can inspect every part of the
| experience from the code to the actual runtime inspector to
| see how Akbartus achieved the effect -- and then help to make
| it even better! :)
|
| I do think there is the possibility to use something like
| this eventually to do all the processing in the browser for
| Depth Anywhere + Splat reconstruction to fill in the holes of
| the current point cloud approach:
| https://github.com/ArthurBrussee/brush
| lastdong wrote:
| First reaction after trying it was a bit of a surprise when I got
| an "Out of bounds" message - not what I expected for 3D worlds.
| Scrolling down to the "Looking Ahead" section, they are working
| on improving both size and fidelity.
| latexr wrote:
| Once you try the demos, the animated image at the top feels
| misleading. Each segment cuts at just the right point to make you
| think you'd be able continue exploring these vast worlds, but in
| practice you can only walk a couple os steps before hitting an
| invisible wall, which becomes more frustrating than not being
| able to move at all. It feels like being trapped in a box. My
| reaction went from impressed to disappointed _fast_.
|
| I get these are early steps, but they oversold it.
| dmarcos wrote:
| In the looking ahead section of the post it says:
|
| "We are hard at work improving the size and fidelity of our
| generated worlds"
|
| I imagine the further you move from the input image, the more
| the model has to make up information and the harder to keep it
| consistent. Similar problem with video generation.
| latexr wrote:
| > I imagine the further you move from the input image, the
| more the model has to make up information and the harder to
| keep it consistent. Similar problem with video generation.
|
| Which is the same thing as saying this may turn out to be a
| dud, like so many other things in tech and the current crop
| of what we're calling AI.
|
| Like I said, I get this is an early demo, but don't oversell
| it. They could've started by being honest and clarifying
| they're generating _scenes_ (or whatever you want to call
| them, but they're _definitely_ not "worlds"), letting you
| play a bit, then explain the potential and progress. As it
| is, it just sounds like they want to immediately wow people
| with a fantasy and it detracts from what they do have.
| dmarcos wrote:
| Fair criticism. I'm also not a fan of hyperbole. Still find
| World Labs stuff super intriguing and I'm optimist about
| them to be able to fulfill the vision.
| add-sub-mul-div wrote:
| Maybe they think it's a good deal, producing some oversold
| tech demos in exchange for a decade's worth of funding and
| not having to produce anything more than an "Our Incredible
| Journey" letter at the end. The prospect of replacing all
| human labor has made it easier than ever to run the grift
| on investors in this time of peak FOMO.
| modeless wrote:
| Models are really great at making stuff up though. And video
| models already have very good consistency over thousands of
| frames. It seems like larger worlds shouldn't be a huge
| hurdle. I wonder why they launched without that, as this
| doesn't seem much better than previous work.
| dmarcos wrote:
| To be fair they haven't launched they are showing progress
| and laying out the vision.
|
| What previous work are you referring to?
| TeMPOraL wrote:
| In general, it depends on how much the model ends up
| "understanding" the input. (I use "understand" here in the
| sense some would claim SOTA LLMs do.)
|
| You can imagine this as a spectrum. On the one end you have
| models that, at each output pixel, try to predict pixels that
| are locally similar to ones in previous frame; on the other
| end, you could imagine models that "parse" the initial input
| image to understand the scene - objects (buildings, doors,
| people, etc.) and their relationships, and separately, the
| style with which they're painted, and use that to extrapolate
| further frames[0]. The latter would obviously fare better,
| remaining stylistically consistent for longer.
|
| (This model claims to be of the second kind.)
|
| The way I see it: a human could do it[1], so there's no
| reason an ML model wouldn't be able to.
|
| --
|
| [0] - Brute-force approach: 1) "style-untransfer" the input,
| i.e. style-transfer to some common style, e.g. photorealistic
| or sketch, 2) extrapolate the style-untransfered image, and
| 3) style-transfer result back using original input as style
| reference. Feels like it should work somewhat okay-ish;
| wonder if anyone tried that.
|
| [1] - And the hard part wouldn't be extrapolating the scene,
| but rather keeping the style.
| dmarcos wrote:
| This indeed looks more like photogrammetry than a diffusion
| model predicting the next frame. There's 3D information
| extracted from the input image and likely additional
| generated poses that allow reconstructing the scene with
| gaussian splats. Not sure how much segmentation
| (understanding of each part of the scene) is going on.
| Probably not much if I have to guess.
| Hakkin wrote:
| You can bypass the "Out of bound" message by setting a
| Javascript breakpoint after `let t =
| JSON.parse(d[e].config_str)` and then run
| `Object.values(t.camera.presets).map(o=>o.max_distance=50&&o)`
| in the console.
|
| It breaks down pretty quickly once you get outside the default
| bounds, as expected, though.
| dmarcos wrote:
| Good hack!
| jfactorial wrote:
| I wonder how much of the remaining work boils down to
| generating a new scene based on the camera's POV when the
| player hits one of the bounds, and keeping these generated
| scenes in a tree structure, joining scenes at boundaries.
| lukev wrote:
| Yes, and you wouldn't even need to do it in realtime as a
| user walks around.
|
| Generate incrementally using a pathfinding system for a bot
| to move around and "create the world" as it goes, as if a
| Google street view car followed the philosophy of George
| Berkeley.
| tayistay wrote:
| I suspect the problem there is that the multiple paths to a
| new location will not yield consistent results.
| kbutler wrote:
| Yes, infinite exploration, but inconsistent
| idunnoman1222 wrote:
| The same as a dream
| SubiculumCode wrote:
| It seems that you could take the image of the location near the
| boundary, then create a new 3d world from that, continually.
| jsheard wrote:
| You could try, but it would quickly devolve into non-
| euclidean nonsense without global knowledge of the areas it's
| already generated.
| boringg wrote:
| I mean its a marketing hype for their product. Its a pretty
| good starting step though - assuming they can build on it and
| expand that world space as opposed to just converting an image
| to 3D.
|
| Certainly has some value to it.. marketing, hiring, fundraising
| (Assuming its a private company)
|
| My take is that its a good start and 3-4 years from now it will
| have a lot of potential value in world creation if they can
| make the next steps.
| dmarcos wrote:
| It's definitely a balancing act. World labs was stealth for a
| bit. Without a brand, stated mission, examples / demos of
| what you are capable of... is harder to hire, fund raise or
| get the attention and mind-share you need once you are ready
| to ship product.
|
| The risk is setting expectations that can't be fulfilled.
|
| I'm in the 3D space and I'm optimistic about World Labs.
| qwertox wrote:
| I first got irritated a bit by this as well, but then the game
| Myst came to mind.
|
| So I'm willing to accept the limitation, and at this point we
| know that this can only get better. Next I thought about the
| likelihood of Nvidia releasing an AI game engine, or more of a
| renderer, fully AI based. It should be happening within the
| next 10 years.
|
| Imagine creating a game by describing scenes, like the ones in
| the article, with a good morphing technology between scenes, so
| that the transitions between them are like auto-generated
| scenes which are just as playable.
|
| The effects shown in the article were very interesting, like
| the ripple, sonar or wave. The wave made me think about how
| trippy games could get in the future, more extreme versions of
| the Subnautica video [0] which was released last month.
|
| We could generate video games which would periodically slip
| into hallucinations, a thing that is barely doable today, akin
| to shader effects in Far Cry or other games when the player
| gets poisoned.
|
| Fiebertraum engine.
|
| [0] https://www.youtube.com/watch?v=AJaV92DXN0s&t=218s
| dmarcos wrote:
| Yeah. That's the attitude! It's all about playing around the
| constraints. Tech has limitations? Yes, but also opens tons
| of new possibilities.
| latexr wrote:
| You're describing a pie in the sky. A vision. Not reality. We
| have been burned many times already, nothing in this field is
| a given.
|
| > at this point we know that this can only get better.
|
| We don't _know_ that. It will probably get better, but will
| it be better _enough_? No one knows.
|
| > It should be happening within the next 10 years.
|
| Every revolution in tech is always ten years away. By now
| that's a meme. Saying something is ten years away is about as
| valuable as saying one has no idea how doable it is.
|
| > Imagine
|
| Yes, I understand the goal. Everyone does, it's not
| complicated. We can all imagine Star Trek technology, we all
| know where the compass is pointed, that doesn't make it a
| given.
|
| In fact, the one thing we can say for sure about imagining
| how everything will be great in ten years is that we
| routinely fail to predict the bad parts. We don't live in
| fantasy land, advancements in tech are routinely used for
| detrimental reasons.
| Jach wrote:
| It's "old news" I guess at this point, but the AI Minecraft
| demo (every frame generated from the previous frame, no
| traditional "engine") is still the most impressive thing to
| me in this space https://oasis.us.decart.ai/welcome There are
| some interesting "speed runs" people have been doing like
| https://www.youtube.com/watch?v=3UaVQ5_euw8
|
| We might all be dead in 10 years, but with big tech companies
| making their plays, all the VC money flowing in to new
| startups, and nuclear plants being brought online to power
| the next base model training runs, there's room for a little
| mild entertainment like these sorts of gimmicks in the next 3
| years or so. I doubt anything that comes of it will top even
| my top 15 video games list though.
| latexr wrote:
| > We might all be dead in 10 years, but with big tech
| companies making their plays, all the VC money flowing in
| to new startups, and nuclear plants being brought online to
| power the next base model training runs, there's room for a
| little mild entertainment like these sorts of gimmicks in
| the next 3 years or so. I doubt anything that comes of it
| will top even my top 15 video games list though.
|
| That's a contestant for the most depressing tradeoff ever.
| "Yeah, we'll all die in agony way before our time, but at
| least we got to play with a neat but ultimately
| underwhelming tool for a bit".
| idunnoman1222 wrote:
| Obviously, the generation has to stop at some point and
| obviously from any key image you could continue generating if
| you had unlimited GPU, which I'm sorry they didn't provide for
| you.
| billconan wrote:
| what would be the business model?
| swframe2 wrote:
| To build an LLM that can reason about the 3d world. I suspect
| they will add the ability to reason about the physics of the
| world next. It's just another attempt to get closer to AGI.
|
| They most likely will have to pivot a few times but once they
| show their LLM solving problems that others can't, the others
| will quickly add these features too. Right now, it is cheaper
| to wait for World Labs to go first. The others are not that far
| behind: https://cat-4d.github.io/
| Uehreka wrote:
| I've been trying to get into this sort of 3D Gaussian Splatting
| stuff, particularly with this focus on environments as opposed to
| just individual objects or characters. Does anyone know of a
| model that's good at doing that and is openly distributed/locally
| runnable?
| evan_ wrote:
| When watching 3D movies with a VR headset you have to keep your
| head perfectly still or the lack of parallax destroys the 3D
| illusion. Compare to a 3D game where moving your head actually
| lets you move through space and actually look around objects.
|
| Something like this applied to every frame of the movie would
| allow you to move around a little and preserve the perspective
| shifts. The limitation that you can only move about 4 feet in any
| direction would not matter for this use case.
|
| Of course this comes at the expense of the director and
| cinematographer's intention, which is no small thing.
| dmarcos wrote:
| Definitely if there's a future for 3D and immersive video it
| depends on adding more cues other than just stereo. Lack of
| parallax one of main reasons causing discomfort for many.
| ChicagoBoy11 wrote:
| Have you ever seen the Google Lightfields demo? They have a rig
| they concocted to essentially capture a "volume" of video to
| allow for the stereoscopic effect in VR AND which then cleverly
| presents a different combination of the footage it captured
| based on your precise head position, so it makes up for these
| distortions. I found it absolutely breathtaking... first time
| seeing VR for a space that actually made me feel like I was in
| it. This was A LONG time ago and I suspected I'd be seeing a
| lot more of that content, but I was... very wrong, it seems.
|
| Your point is completely correct. Even Apple's awesome new
| stereoscopic 3D short film for the AVP immediately loses what
| it could be its total awesomeness from this basic fact. The
| perspective being perfectly fixed will never quite be there to
| fool our brains so used to dealing with these micro-movements.
| dmarcos wrote:
| Yeah parallax, reflexions, shadows are as important as
| stereo. We've been always sold that stereo = 3D but it's just
| one among many cues that the brain relies on.
| Stevvo wrote:
| Each frame was between 200 and 300mb, at a much lower
| resolution than AVP. The storage and bandwidth required is a
| bit wild.
| evan_ wrote:
| I have seen that, and I came close to buying one of those
| Lytro light field cameras so many times (but thankfully
| restrained myself). Light field seemed like a huge obvious
| "way of the future" thing in the 2010s but with the benefit
| of hindsight it did not exactly seem to have changed the
| world.
| julianeon wrote:
| I was interested to see that a co-founder is Stanford CS prof
| Fei-Fei Li. I'm reading her nonfiction book now, "The Worlds I
| See," about her experience with AI; she testified before Congress
| about it.
| doctorpangloss wrote:
| Their bet is that XYZ can generalize from Unreal and NVIDIA Isaac
| recordings.
|
| Is XYZ diffusion-transformers? Or is XYZ Chameleon? Or some novel
| architecture?
|
| It takes the absolute fastest teams, it seems, 7 months to
| develop a first version of a model. And it also seems that models
| are like babies, 9 moms do not produce a model in 1 month.
|
| The tough thing is that it may be possible to develop a great
| video model with DiTs for $220m; or it may be possible to develop
| a great video model with Chameleon for $1b; but if it's 3D +
| time, will it be too expensive for them to do?
|
| The craziest thing to me is that these guys are super talented,
| but they might not have _enough_ money!
| byearthithatius wrote:
| Then they need to sell _billions_ of dollars worth of these
| worlds to ... game studios? In order for this valuation to make
| sense they need to convince the majority of major game studios
| to spend all their world creation budget solely on this
| company. Seems unrealistic but I guess only time can tell.
| dmarcos wrote:
| if they fulfill the mission it will apply to domains other
| than games like movies, robotics, architecture...
| byearthithatius wrote:
| Good point, that's fair there are more use cases. IDK about
| architecture, typically you want more structure/determinism
| instead of probabilistic generation. Overall this is very
| cool. I like the consistency it has and it does generally
| amaze me nonetheless. But you gotta admit selling a billion
| dollars of anything is really hard. That is three times the
| budget of the highest budget movie ever created! (avengers
| endgame at 356 million). It is almost the ENTIRE budget of
| the biggest game ever, grand theft auto six.
| recursive wrote:
| I couldn't get the "tap to interact" panels to work. No mouse
| events had any effect. I had to take it very literally, but
| first, I had to drag my browser to my laptop screen, which did
| enable me to literally tap the screen.
| jcjohns wrote:
| That's weird, what device are you using?
|
| (I'm part of World Labs)
| recursive wrote:
| Firefox on Windows 11 on a Lenovo Thinkpad with a touch
| screen.
| FergusArgyll wrote:
| Not working for me either, Chrome win 11
| xnx wrote:
| Cool, but not as impressive as https://cat-4d.github.io/ to me.
| wordpad25 wrote:
| Does this also work for macro shots, like a landscape? all the
| examples are focused on specific objects
| robblbobbl wrote:
| The idea is good but the result must be better
| tnolet wrote:
| This is more like the moving "still" pictures in a Harry Potter
| movie. Not a 3D world.
| dmarcos wrote:
| In some angles you can see there's some gaussian splat / point
| cloud representation underneath. There's definitely a 3D
| representation. But yeah navigable volume is limited at the
| moment. It will improve
| vinkelhake wrote:
| This is neat I guess. Maybe I'm just blase with seeing yet
| another AI demo where I'm supposed to fill in the blanks in
| coming up with ways to make the tech actually _useful_.
|
| The "Step into Paintings" section cracked me up. As soon as you
| pan away from the source material, the craziness of the model is
| on full display. So sure, I can experience iconic pieces of art
| in a new way, it's just not a _good experience_.
| jsheard wrote:
| Who knew that Hopper's _Nighthawks_ had a biblically-accurate
| table and chairs just out of frame?
| lacoolj wrote:
| so there's a bunch of potential here, but how long did each of
| these take to generate from the model and what hardware was used
| for it?
| thrance wrote:
| Yet another of those AI image-to-grotesque-interactive-video
| model marketed as a "3D world from scratch!".
|
| Can you use this "3D world" with blender, unity or whatever else?
| Can you even do anything remotely useful with it?
| dmarcos wrote:
| You can definitely mix gaussian splats and "traditional" meshes
|
| Splats very new and still many things to figure out:
| relighting, animation, interactivity.
| ValentinA23 wrote:
| wasd isn't accessible for those of us who have the unfortunate
| disability of not using a qwerty keyboard. If your project isn't
| a competitve FPS, arrows are fine.
| aleph_minus_one wrote:
| You can install multiple keyboard layouts in your OS. Many
| users do this.
| jcjohns wrote:
| Arrow keys also work now, thanks for the feedback!
| PeterCorless wrote:
| Ugh. The AI-generated rear views being nowhere near at the level
| of detail of even the uncanny valley foreground images. Is not
| really generating a "3D 'world'" so much as extrapolating a 360o
| view from a single scene. There's no sense to the architecture or
| flora. Staircases that lead nowhere.
|
| It's more hypecycle nonsense. But they'll poor billions into this
| rather than pay human artists what they're worth.
| Falimonda wrote:
| Too many demos loading on the site at the same time makes it
| unusable
| wkat4242 wrote:
| This is amazing. 3D content generation is so time consuming..
| Vanit wrote:
| I'm keen to drop in a few PSX-era Final Fantasy backgrounds to
| see what it does!
| bastloing wrote:
| What a great start, it's only going to get better from here!
| iamleppert wrote:
| Is a 2D image really the best input primitive for 3D world
| construction? As a user, I'd prefer to have 3D primitives (plane,
| sphere, mesh) as tools when building my worlds.
| cchance wrote:
| People complaining that it's a small area, lol my man, this is
| fucking insane, i know AI is starting to get normalized, but they
| converted an image into a 3d world! even if its 1ft/1ft its still
| amazing.
| albtaiuti wrote:
| it looks like they're basing the infilling on 360 photos /
| videos. that's why you can't walk around freely: the inpainting
| must be done from the center of the sphere
| amelius wrote:
| Did anyone try it on famous paintings?
___________________________________________________________________
(page generated 2024-12-02 23:00 UTC)