[HN Gopher] AI video you can watch and interact with, in real-time
___________________________________________________________________
AI video you can watch and interact with, in real-time
Author : olivercameron
Score : 161 points
Date : 2025-05-28 18:33 UTC (3 days ago)
(HTM) web link (experience.odyssey.world)
(TXT) w3m dump (experience.odyssey.world)
| bkmeneguello wrote:
| This is amazing! I think the AI will completely replace the way
| we create and consume media currently. A well written story, with
| an amazing graphics generation AI can be both interactive and
| surprising every time you watch it again.
| lwo32k wrote:
| Wait till the bills arrive.
| qingcharles wrote:
| Note that it isn't being created from whole cloth, it is trained
| on videos of the places and then it is generating the frames:
|
| "To improve autoregressive stability for this research preview,
| what we're sharing today can be considered a narrow distribution
| model: it's pre-trained on video of the world, and post-trained
| on video from a smaller set of places with dense coverage. The
| tradeoff of this post-training is that we lose some generality,
| but gain more stable, long-running autoregressive generation."
|
| https://odyssey.world/introducing-interactive-video
| thetoon wrote:
| Could probably be (semi?)automated to run on 3d models of
| places that doesn't exist. Even ai-built 3d models.
| mvdtnz wrote:
| What's the point? You already have the 3d models. If you want
| an interactive video just use the 3d models.
| jowday wrote:
| The paper they're basing this off already does this.
|
| https://diamond-wm.github.io/
| doug_durham wrote:
| I recognized the Santa Cruz Beach Boardwalk channel. It was
| exactly as I remember.
| netsharc wrote:
| Well, that felt like entering a dream on my phone. Fuzzy virtual
| environments generated by "a mind" based on its memory of real
| environments...
|
| I wonder if it'd break our brains more if the environment changes
| as the viewpoint changes, but doesn't change back (e.g. if
| there's a horse, you pan left, pan back right, and the horse is
| now a tiger).
| afro88 wrote:
| Our minds are used to that: dreams
| mortenjorck wrote:
| I kept expecting that to happen, but it apparently has some
| mechanism to persist context outside the user's FOV.
|
| In a way, that almost makes it more dreamlike, in that you have
| what feels like high local coherence (just enough not to
| immediately tip you off that it's a dream) that de-coheres over
| time as you move through it.
|
| Fascinatingly strange demo.
| mensetmanusman wrote:
| Exploring babel's library!
| Daisywh wrote:
| it's super cool. I keep thinking it kind of feels like dream
| logic. It looks amazing at first but I'm not sure I'd want to
| stay in a world like that for too long. I actually like when
| things have limits. When the world pushes back a bit and gives
| you rules to work with.
| aswegs8 wrote:
| Doesn't it have rules? I couldn't move past a certain point and
| hitting a wall made you teleport. Maybe I was just
| rationalizing random events, though.
| Tade0 wrote:
| I kept getting teleported to the start when I picked the
| world channel that showed some sort of well-lit catacombs.
|
| Eventually managed to leave the first room, but then got
| teleported somewhere else.
| booleandilemma wrote:
| Sounds like something a deity would say before creating a
| universe such as ours.
| spzb wrote:
| This seems like a staggeringly inefficient way to develop what is
| essentially a FPS engine.
| xpl wrote:
| Only at first glance. It can easily render things that would be
| very hard to implement in an FPS engine.
|
| What AI can dream up in milliseconds could take hundreds of
| human hours to encode using traditional tech (meshes, shaders,
| ray tracing, animation, logic scripts, etc.), and it still
| wouldn't look as natural and smooth as AI renderings -- I refer
| to the latest developments in video synthesis like Google's Veo
| 3. Imagine it as a game engine running in real time.
| ivape wrote:
| Why do you think this is so hard, even for technical people
| here, to make the inductive leap on this one? Is it that
| close to magic? The AI is rendering pillars and also
| determining collision detection on it. As in, no one went in
| there and selected a bunch of pillars and marked it as a
| barrier. That means in the long run, I'll be able to take
| some video or pictures of the real world and have it be game
| level.
| ffsm8 wrote:
| Because that's been a thing for years already - and works
| way better then this research does.
|
| Unreal engine 5 has been demoing these features for a while
| now, I heard about it early 2020 iirc, but the techniques
| like gaussian splattering predate it.
|
| I have no experience in either of these, but I believe
| MegaScans and RealityCapture are two examples doing this.
| And the last nanite demo touched on it, too.
| ivape wrote:
| I'm sorry, what's a thing? Unreal engine 5 does those
| things with machine learning? Imagine someone shows me
| Claude generating a full React app, and I say "well you
| see, React apps have always been a thing". The thing
| we're talking about is AI, nothing else. There is no
| other _thing_ is the whole point of the AI hype.
| frotaur wrote:
| What they meant is that 3D scanning real places, and
| translating them into 3D worlds with collision already
| exists, and provides much, much better results than the
| AI videos here. Additionally, it does not need what is
| likely hours of random footage wandering in the space,
| just a few minutes of scans.
| whamlastxmas wrote:
| I think an actual 3D engine with AI that can make new high
| quality 3D models and environments on the fly would be the
| pinnacle. And maybe even add new game and control mechanics on
| the fly.
| johanyc wrote:
| Yeah. Why throw away a perfectly fine 3d engine
| qwerty59 wrote:
| This is cool. I think there is good chance that this is the
| future of videogames.
| yieldcrv wrote:
| Now this is an Assassin's Creed memory machine that I can get
| behind
| Traubenfuchs wrote:
| Thank you for this experience. Feels like you are exploring a
| dream.
|
| I LOVE dreamy AI content. That stuff where everything turned into
| dogs for example.
|
| As AI is maturing, we are slowly losing that im favor of boring
| realism and coherence.
| akomtu wrote:
| Interactive ads and interactive porn are the AI killer apps we
| miss so much.
| hx8 wrote:
| I found an interesting glitch where you could never actually
| reach a parked car, as you move forward the car also moved. It
| looked a lot like traffic moving through Google Street View.
| Hobadee wrote:
| Yeah. I found the same thing. Cars would disappear in front of
| me, then I reached the end of the world and it reset me. I'm
| not sure I believe this is AI, and instead some crappy street
| view interface.
| gcanyon wrote:
| Am I the only one stuck with a black screen as the audio plays?
| vouaobrasil wrote:
| I think this step towards a more immerse virtual reality can
| actually be dangerous. A lot of intellectual types might disagree
| but I do think that creating such immersion is a dangerous thing
| because it will reduce the value people place on the real world
| and especially the natural world, making them even less likely to
| care if big corporations screw it up with biospheric degradation.
|
| It seems like it has a high chance of leading to even more
| narcissism as well because we are reducing our dependence on
| others to such a degree that we will care about others less and
| less, which is something that has already started happening with
| increasingly advanced interactive technology like AI.
| TaupeRanger wrote:
| This is why we never see any alien life. When they reach a
| sufficient level of technology, they realize the virtual/mental
| universe is much more compelling and fun than the boring rule-
| bound physical one.
| mvdtnz wrote:
| People were saying literally the exact same thing when those
| crappy VR headsets were all the rage. I think we're ok.
| dragonwriter wrote:
| > I think this step towards a more immerse virtual reality can
| actually be dangerous
|
| I don't think its a step toward that; I think this is literally
| trained using techniques to generate more immersive virtual
| reality that already exists _and_ takes less compute, to
| produce a more computationally expensive and less accurate AI
| version.
|
| At least, that's what every other demo of a real-time
| interactive AI world model has been, and they aren't trumpeting
| any clear new distinction.
| Morizero wrote:
| I feel like we're so close to remaking the classic Rob Schneider
| full motion video game "A Fork in the Tale"
|
| https://m.youtube.com/watch?v=YXPIv7pS59o
| godelski wrote:
| That felt so wrong AND someone is cheating here. This felt really
| suspicious...
|
| I got to the graffiti world and there were some stairs right next
| to me. So I started going up them. It felt like I was walking
| forward and the stairs were pushing under me until I just got
| stuck. So I turned to go back down and half way around everything
| morphed and I ended up back down at the ground level where I
| originally was. I was teleported. That's why I feel like
| something is cheating here. If we had mode collapse I'm not sure
| how we should be able to completely recover our entire
| environment. Not unless the model is building mini worlds with
| boundaries. It was like the out of bond teleportation you get in
| some games but way more fever dream like. That's not what we want
| from these systems, we don't want to just build a giant poorly
| compressed videogame, we want continuous generation. If you have
| mode collapse and recover, it should recover to somewhere new,
| now where you've been. At least this is what makes me highly
| suspicious.
| AnotherGoodName wrote:
| Yes the thing that got me was i went through the channels
| multiple times (multiple browser sessions). The channels are
| the same everytime (the numbers don't align to any navigation
| though - flip back and forth between two numbers and you'll
| just hit a random channel everytime - don't be fooled by that).
| Every object is in the same position and the layout is the
| same.
|
| What makes this AI generated over just rendering a generated 3D
| scene?
|
| Like it may seem impressive to have no glitches (often in AI
| generated works you can turn around a full rotation and you're
| what's in front of you isn't what was there originally) but
| here it just acts as a fully modelled 3D scene rendering at low
| resolution? I can't even walk outside of certain bounds which
| doesn't make sense if this really is generated on the fly.
|
| This needs a lot of skepticism and i'm surprised you're the
| first commenting on the lack of actual generation here. It's a
| series of static scenes rendered at low fidelity with limited
| bounds.
| AnotherGoodName wrote:
| Ok playing with this more there's very subtle differences
| between sessions. As in there is some hallucination here with
| certain small differences.
|
| I think what's happening is this is AI generated but it is
| very very overfitted to real world 3D scenes. The AI is
| almost rendering exactly a real world scene and not much
| more. They can't travel out of bounds or the model stops
| working since it's so overfitted to these scenes. The
| overfitting solves hallucinations but it also makes it almost
| indistinguishable from pre modelled 3D scenes.
| echelon wrote:
| Odyssey Systems is six months behind way more impressive
| demos. They're following in the footsteps of this work:
|
| - Open Source Diamond WM that you can run on consumer
| hardware [1]
|
| - Google's Genie 2 (way better than this) [2]
|
| - Oasis [3]
|
| [1] https://diamond-wm.github.io/
|
| [2] https://deepmind.google/discover/blog/genie-2-a-large-
| scale-...
|
| [3] https://oasis.decart.ai/welcome
|
| There are a lot of papers and demos in this space. They
| have the same artifacts.
| olivercameron wrote:
| All of this is really great work, and I'm excited to see
| great labs pushing this research forward.
|
| From our perspective, what separates our work is two
| things:
|
| 1. Our model is able to be experienced by anyone today,
| and in real-time at 30 FPS.
|
| 2. Our data domain is real-world, meaning learning life-
| like pixels and actions. This is, from our perspective,
| more complex than learning from a video game.
| ollin wrote:
| I think the most likely explanation is that they trained a
| diffusion WM (like DIAMOND) on video rollouts recorded from
| within a 3D scene representation (like NeRF/GS), with some
| collision detection enabled.
|
| This would explain:
|
| 1. How collisions / teleportation work and why they're so
| rigid (the WM is mimicking hand-implemented scene-bounds
| logic)
|
| 2. Why the scenes are static and, in the case of should-be-
| dynamic elements like water/people/candles, blurred (the WM
| is mimicking artifacts from the 3D representation)
|
| 3. Why they are confident that "There's no map or explicit
| 3D representation in the outputs. This is a diffusion
| model, and video in/out"
| https://x.com/olivercameron/status/1927852361579647398 (the
| final product is indeed a diffusion WM trained on videos,
| they just have a complicated pipeline for getting those
| training videos)
| christianqchung wrote:
| Is it possible that this behavior is a result from training
| on Google Maps or something similar? I tried to walk off a
| bridge and you get completely stuck, which is the only reason
| I can think of that, other than not having first person video
| views of people walking off bridges.
| bufferoverflow wrote:
| Same. Got to the corner of the house, turned around, and got
| teleported back to the starting point.
|
| I call BS.
| magnat wrote:
| It feels like an interpolated Street View imagery. There is one
| scene with two people between cars in a parking lot. It is the
| only one I have found that has objects you would expect to
| change over time. When exploring the scene, those people
| sometimes disappear altogether and sometimes teleport around,
| as they would when exploring Street View panoramas. You can
| clearly tell when you are switching between photos taken a few
| seconds apart.
| jowday wrote:
| It's essentially this paper but applied to a bunch of video
| recordings of a bunch of different real world locations instead
| of counter strike maps. Each channel is just changing the
| location.
|
| https://diamond-wm.github.io/
| olivercameron wrote:
| Hi! CEO of Odyssey here. Thanks for giving this a shot.
|
| To clarify: this is a diffusion model trained on lots of video,
| that's learning realistic pixels and actions. This model takes
| in the prior video frame and a user action (e.g. move forward),
| with the model then generating a new video frame that resembles
| the intended action. This loop happens every ~40ms, so real-
| time.
|
| The reason you're seeing similar worlds with this production
| model is that one of the greatest challenges of world models is
| maintaining coherence of video over long time periods,
| especially with diverse pixels (i.e. not a single game). So, to
| increase reliability for this research preview--meaning
| multiple minutes of coherent video--we post-trained this model
| on video from a smaller set of places with dense coverage. With
| this, we lose generality, but increase coherence.
|
| We share a lot more about this in our blog post here
| (https://odyssey.world/introducing-interactive-video), and
| share outputs from a more generalized model.
|
| > One of the biggest challenges is that world models require
| autoregressive modeling, predicting future state based on
| previous state. This means the generated outputs are fed back
| into the context of the model. In language, this is less of an
| issue due to its more bounded state space. But in world models
| --with a far higher-dimensional state--it can lead to
| instability, as the model drifts outside the support of its
| training distribution. This is particularly true of real-time
| models, which have less capacity to model complex latent
| dynamics.
|
| > To improve autoregressive stability for this research
| preview, what we're sharing today can be considered a narrow
| distribution model: it's pre-trained on video of the world, and
| post-trained on video from a smaller set of places with dense
| coverage. The tradeoff of this post-training is that we lose
| some generality, but gain more stable, long-running
| autoregressive generation.
|
| > To broaden generalization, we're already making fast progress
| on our next-generation world model. That model--shown in raw
| outputs below--is already demonstrating a richer range of
| pixels, dynamics, and actions, with noticeably stronger
| generalization.
|
| Let me know any questions. Happy to go deeper!
| jowday wrote:
| Why are you going all in on world models instead of basing
| everything on top of a 3D engine that could be manipulated /
| rendered with separate models? If a world model was truly
| managing to model a manifold of a 3D scene, it should be
| pretty easy to extract a mesh or SDF from it and drop that
| into an engine where you could then impose more concrete
| rules or sanity check the output of the model. Then you could
| actually model player movement inside of the 3D engine
| instead of trying to train the world model to accept any kind
| of player input you might want to do now or in the future.
|
| Additionally, curious about what exactly the difference
| between the new mode of storytelling you're describing and
| something like a crpg or visual novel is - is your hope that
| you can just bake absolutely everything into the world model
| instead of having to implement systems for dialogue/camera
| controls/rendering/everything else that's difficult about
| working with a 3D engine?
| olivercameron wrote:
| Great questions!
|
| > Why are you going all in on world models instead of
| basing everything on top of a 3D engine that could be
| manipulated / rendered with separate models?
|
| I absolutely think there's going to be super cool startups
| that accelerate film and game dev as it is today, inside
| existing 3D engines. Those workflows could be made much
| faster with generative models.
|
| That said, our belief is that model-imagined experiences
| are going to become a totally new form of storytelling, and
| that these experiences might not be free to be as weird and
| whacky as they could because of heuristics or limitations
| in existing 3D engines. This is our focus, and why the
| model is video-in and video-out.
|
| Plus, you've got the very large challenge of learning a
| rich, high-quality 3D representation from a very small pool
| of 3D data. The volume of 3D data is just so small,
| compared to the volumes generative models really need to
| begin to shine.
|
| > Additionally, curious about what exactly the difference
| between the new mode of storytelling you're describing and
| something like a crpg or visual novel
|
| To be clear, we don't yet know what shape these new
| experiences will take. I'm hoping we can avoid an awkward
| initial phase where these experiences resemble traditional
| game mechanics too much (although we have much to learn
| from them), and just fast-forward to enabling totally new
| experiences that just aren't feasible with existing
| technologies and budgets. Let's see!
|
| > is your hope that you can just bake absolutely everything
| into the world model instead of having to implement systems
| for dialogue/camera controls/rendering/everything else
| that's difficult about working with a 3D engine?
|
| Yes, exactly. The model just learns better this way
| (instead of breaking it down into discrete components) and
| I think the end experience will be weirder and more
| wonderful for it.
| jowday wrote:
| > Plus, you've got the very large challenge of learning a
| rich, high-quality 3D representation from a very small
| pool of 3D data. The volume of 3D data is just so small,
| compared to the volumes generative models really need to
| begin to shine.
|
| Isn't the entire aim of world models (at least, in this
| particular case) to learn a very high quality 3D
| representation from 2D video data? My point is if that
| you manage to train a navigable world model for a
| particular location, that model has managed to fit a very
| high quality 3D representation of that location. There's
| lots of research dealing with NERFs that demonstrate how
| you can extract these 3D scenes as meshes once a model
| has managed to fit it. (NERFs are another great example
| of learning a high quality 3D representation from sparse
| 2D data.)
|
| >That said, our belief is that model-imagined experiences
| are going to become a totally new form of storytelling,
| and that these experiences might not be free to be as
| weird and whacky as they could because of heuristics or
| limitations in existing 3D engines. This is our focus,
| and why the model is video-in and video-out.
|
| There's a lot of focus in the material on your site about
| the models learning physics by training on real world
| video - wouldn't that imply that you're trying to
| converge on a physically accurate world model? I imagine
| that would make weirdness and wackiness rather difficult
|
| > To be clear, we don't yet know what shape these new
| experiences will take. I'm hoping we can avoid an awkward
| initial phase where these experiences resemble
| traditional game mechanics too much (although we have
| much to learn from them), and just fast-forward to
| enabling totally new experiences that just aren't
| feasible with existing technologies and budgets. Let's
| see!
|
| I see! Do you have any ideas about the kinds of
| experiences that you would want to see or experience
| personally? For me it's hard to imagine anything that
| substantially deviates from navigating and interacting
| with a 3D engine, especially given it seems like you want
| your world models to converge to be physically realistic.
| Maybe you could prompt it to warp to another scene?
| huvarda wrote:
| > one of the greatest challenges of world models is
| maintaining coherence of video over long time periods
|
| To be honest most of the appeal to me of this type of thing
| _is_ the fact that it gets incoherent and morph-y and
| rotating 360 degrees can completely change the scenery. It 's
| a trippy dreamlike experience whereas this kind of felt like
| a worse version of existing stuff.
| wewewedxfgdf wrote:
| What's the business model?
| Animats wrote:
| > Not unless the model is building mini worlds with boundaries.
|
| Right. I was never able to get very far from the starting
| point, and kept getting thrown back to the start. It looks like
| they generated a little spherical image, and they're able to
| extrapolate a bit from that. Try to go through a door or reach
| a distant building, and you don't get there.
| throwaway314155 wrote:
| I mean... https://news.ycombinator.com/item?id=44121671
| informed you of exactly why this happens a whole hour before
| you posted this comment and the creator is chatting with people
| in the comments. I get that you feel personally cheated, but I
| really don't think anyone was deliberately trying to cheat you.
| In light of that, your comment (and i only say this because
| it's the top comment on this post) is effectively a
| stereotypical "who needs dropbox" levels of shallow dismissal.
| arvindh-manian wrote:
| related (and quite cool) -- minecraft generated on-the-fly which
| you can interact with:
| https://news.ycombinator.com/item?id=42014650
| lerp-io wrote:
| going outside breaks the ai lol
| deadbabe wrote:
| It's pointless to do this with real world places. Why not do it
| for TV shows or a photograph? You could walk around inside and
| explore the scenes.
| abe94 wrote:
| very cool - what was the hardest part of building this?
| olivercameron wrote:
| If I had to choose one, I'd easily say maintaining video
| coherence over long periods of time. The typical failure case
| of world models that's attempting to generate diverse pixels
| (i.e. beyond a single video game) is that they degrade to a
| mush of incoherent pixels after 10-20 seconds of video.
|
| We talk about this challenge in our blog post here
| (https://odyssey.world/introducing-interactive-video). There's
| specifics in there on how we improved coherence for this
| production model, and our work to improve this further with our
| next-gen model. I'm really proud of our work here!
|
| > Compared to language, image, or video models, world models
| are still nascent--especially those that run in real-time. One
| of the biggest challenges is that world models require
| autoregressive modeling, predicting future state based on
| previous state. This means the generated outputs are fed back
| into the context of the model. In language, this is less of an
| issue due to its more bounded state space. But in world models
| --with a far higher-dimensional state--it can lead to
| instability, as the model drifts outside the support of its
| training distribution. This is particularly true of real-time
| models, which have less capacity to model complex latent
| dynamics. Improving this is an area of research we're deeply
| invested in.
|
| In second place would absolutely be model optimization to hit
| real-time. That's a gnarly problem, where you're delicately
| balancing model intelligence, resolution, and frame-rate.
| jowday wrote:
| This is pretty much the same thing as those models that baked
| dust2 into a diffusion model then used the last few frames as
| context to continue generating - same failure modes and
| everything.
|
| https://diamond-wm.github.io/
| amelius wrote:
| Would be more interesting with people in it.
| olivercameron wrote:
| I agree! Check out outputs from our next-gen world model
| here(https://odyssey.world/introducing-interactive-video),
| featuring richer pixels and dynamics.
| chenxi9649 wrote:
| do u personally feel like scaling this approach is going to be
| the end game for generating navigatable worlds?
|
| ie. as opposed to first generating a 3d env then doing some sorts
| of img2img on top of it?
| andrewstuart wrote:
| Love the atmosphere.
| fortran77 wrote:
| I'm unable to navigate anywhere. I'm on a laptop with a
| touchscreen and a trackpad. I clicked, double clicked, scrolled,
| and tried everything I could think of and the views just hovered
| around the same spot.
| lolinder wrote:
| This is similar to the Minecraft version of this from a few
| months back [0], but it does seem to have a better time keeping a
| memory of what you've already seen, at least for a bit. Spinning
| in circles doesn't lose your position quite as easily, but I did
| find that exiting a room and then turning back around and re-
| entering leaves you with a totally different room than you
| exited.
|
| [0] _Minecraft with object impermanence_ (229 points, 146
| comments) https://news.ycombinator.com/item?id=42762426
| jedberg wrote:
| In playing with this it was unclear to me how this differs from a
| pre-programmed 3d world with bit mapped walls. What is the AI
| adding that I wouldn't get otherwise?
| exe34 wrote:
| This reminds me of the scapes in Diaspora (by Greg Egan).
| squiffy wrote:
| Can you say where the underground cellar with the red painting
| is? It's compelling.
___________________________________________________________________
(page generated 2025-05-31 23:00 UTC)