[HN Gopher] Oasis: A Universe in a Transformer
       ___________________________________________________________________
        
       Oasis: A Universe in a Transformer
        
       Author : ChadNauseam
       Score  : 190 points
       Date   : 2024-11-01 07:02 UTC (15 hours ago)
        
 (HTM) web link (oasis-model.github.io)
 (TXT) w3m dump (oasis-model.github.io)
        
       | gunalx wrote:
       | Really cool tech demo. What for the most part impressed me is the
       | inference speed. But I don't really see any use for this unless a
       | way to store worldstate to avoid the issue of it forgetting what
       | it just said.
        
       | jmartin2683 wrote:
       | Why? Seems like a very expensive way to vaguely clone a game.
        
         | haccount wrote:
         | It's an early demo of interactive realtime inference but it
         | appears to have a promise of promptable game worlds and
         | mechanisms. Or "scriptable dynamic imagination" if you will.
         | 
         | The answer to "why?" when DeepDream demoed hallucinated dog
         | faces in 2015 was contemporary diffusion models.
        
           | dartos wrote:
           | > it appears to have a promise of promptable game worlds
           | 
           | Is it a world if there's no permanence?
           | 
           | We've seen demos like this for a while now (granted not as
           | fast) but the core problem is continuity. (Or some kind of
           | object permanence)
           | 
           | It's a problem for image generators as well.
           | 
           | I'd be more interesting if that was any closer to being
           | solved than to have a real time Minecraft screenshot
           | generator.
           | 
           | I may have missed it, but I didn't see anything about
           | prompting. I'd be surprised if this model could generalize
           | beyond Minecraft at all.
        
             | thirdacc wrote:
             | >It's a problem for image generators as well.
             | 
             | It was, about a year ago. It's a solved problem.
        
           | hnlmorg wrote:
           | The dog demo was introducing something new. This isn't.
           | 
           | I don't want to be negative about someone else's project but
           | I can completely understand why people are underwhelmed by
           | this.
           | 
           | What I think will be the real application for AI in gaming
           | isn't creating poorer versions of native code, it will be
           | creating virtual novels that evolve with the player. Where
           | characters are actual AI rather than following a predefined
           | script. Where you, as the player, can change and shape the
           | story as you wish. Think Star Trek Holodeck "holo-novels" or
           | MMORPGs but can be played fully offline.
           | 
           | Rendering the pixels is possibly the worst application for AI
           | at this stage because AI lacks reliable frame by frame
           | continuity, rendering speed, nor an understanding of basic
           | physics, which are all the bare minimum for any modern games
           | engine.
        
         | int_19h wrote:
         | It also seems like a pretty decent way to investigate emergent
         | world models in NNs.
        
       | blixt wrote:
       | Super cool, and really nice to see the continuous rapid progress
       | of these models! I have to wonder how long-term state (building a
       | base and coming back later) as well as potentially guided state
       | (e.g. game rules that are enforced in traditional code, or
       | multiplayer, or loading saved games, etc) will work.
       | 
       | It's probably not by just extending the context window or making
       | the model larger, though that will of course help, because
       | fundamentally external state and memory/simulation are two
       | different things (right?).
       | 
       | Either way it seems natural that these models will soon be used
       | for goal-oriented imagination of a task - e.g. imagine a computer
       | agent that needs to find a particular image on a computer, it
       | would continuously imagine the path between what it currently
       | sees and its desired state, and unlike this model which takes
       | user input, it would imagine that too. In some ways, to the best
       | of my understanding, this already happens with some robot control
       | networks, except without pixels.
        
         | aithrowawaycomm wrote:
         | There's not even the slightest hint of state in this demo: if
         | you hold "turn left" for a full rotation you don't end up where
         | you started. After a few rotations the details disappear and
         | you're left in the middle of a blank ocean. There's no way this
         | tech will ever make a playable version of Mario, let alone
         | Minecraft.
        
           | blixt wrote:
           | There's plenty of evidence of state, just a very short-term
           | memory. Examples:
           | 
           | - The inventory bar is mostly consistent throughout the play
           | 
           | - State transitions in response to key presses
           | 
           | - Block breakage over time is mostly consistent
           | 
           | - Toggling doors / hatches works as expected
           | 
           | - Jumping progresses with correct physics
           | 
           | Turning around and seeing approximately the same thing you
           | saw a minute ago is probably just a matter of extending a
           | context window, but it will inherently have limits when you
           | get to the scale of an entire world even if we somehow can
           | make context windows have excellent compression of redundant
           | data (which would greatly help LLM transformers too). And I
           | guess what I'm mostly wondering about is how would you
           | synchronize this state with a ground truth so that it can be
           | shared between different instances of the agent, or other
           | non-agent entities.
           | 
           | And again, I think it's important to remember games are just
           | great for training this type of technology, but it's probably
           | more useful in non-game fields such as computer automation,
           | robot control, etc.
        
             | bongodongobob wrote:
             | Nah, it doesn't even track which direction you're looking.
             | Looking straight ahead, walk into some sugar cane so your
             | whole screen is green. Now look up. It thinks you were
             | looking down.
        
               | blixt wrote:
               | I guess it comes down to your definition of state. I'm
               | not saying there's enough state for this to be playable,
               | but there is clearly state and I think it's important to
               | point out how impressive the amount of temporal
               | consistence and coherence this model is capable of,
               | considering not long ago the state of the art here
               | rapidly decohered into completely noisy pixels.
        
               | FeepingCreature wrote:
               | In other words: there's enough state now that the lack of
               | state _stands out._ It works well enough for its failures
               | to be notable.
        
               | bongodongobob wrote:
               | I guess if you consider knowing what color the pixels
               | were in the last frame "state". That's not a definition
               | anyone would use though. Spinning around and have the
               | world continuously regenerate or looking at the sky and
               | back down regenerating randomly is the opposite of state.
               | It's complete incoherence.
        
             | naed90 wrote:
             | Hey, developer of Oasis here! You are very correct. Here
             | are a few points: 1. We trained the model on a context
             | window of even 30 sec. What's the problem? It barely pays
             | any attention to frames beyond the past few ones. This
             | certainly makes sense as it's a question of the loss
             | function of the model during training. We are running now
             | many different training runs to experiment with a better
             | loss func (and datamix) to solve this issue. You'll see
             | newer versions soon! 2. In the long term, we believe the
             | "ultimate" solution is 2 models: 1 model that maintains
             | game state + 1 model that turns that state into pixel.
             | Think of it as having the first model be something
             | resembling more of an LLM that gets the current state +
             | user action and produces the new state, and then the second
             | model being a diffusion model that takes from this state
             | and maps to pixels. This would win the best of both worlds.
        
               | throwaway314155 wrote:
               | This stuff is all fascinating to me from a computer
               | vision perspective. I'm curious - if you have a second
               | model tasked with learning just the game state - does
               | that mean you would be using info from the game itself
               | (say, via a mod or with the developer console) as
               | training data? Or is the idea that the model somehow
               | learns the state (and only the state) on its own as it
               | does here?
        
               | naed90 wrote:
               | That's a great question -- lots of experiments will be
               | going into the future versions o Oasis. There are quite a
               | few different possibilities here and we'll have to
               | experiment with them a lot.
               | 
               | The nice thing is that we can run tons of experiments at
               | once. For Oasis v1, we ran over 1000 experiments (end-to-
               | end training a 500M model) on the model arch, datamix,
               | etc., before we created the final checkpoint that's
               | deployed on the site. At Decart (we just came out of
               | stealth yesterday:
               | https://www.theinformation.com/articles/why-sequoias-
               | shaun-m...) we have 2 teams: Decart Infrastructure and
               | Decart Experiences. The first team provides insanely fast
               | infra for training/inferencing (writes from scratch
               | everything from CUDA to redoing the python garbage
               | collector) -- we are able to get a 500M model to converge
               | during training in ~20h instead of 1-2 weeks. Then,
               | Decart Experiences uses this infra to create these new
               | types of end-to-end "Generated Experiences"
        
           | golol wrote:
           | Between the first half and the last sentence of your post is
           | a giant leap of conclusion.
        
             | blixt wrote:
             | Yeah probably, it remains to be seen if these models can
             | actually help guide a live session towards the goal. At
             | least it's been shown that these types of world models can
             | help a model become better at achieving a goal, in lieu of
             | a hard coded simulation environment or the real world, for
             | when those options are not tractable.
             | 
             | My favorite example is: https://worldmodels.github.io/ (Not
             | least of all because they actually simulate these
             | simplified world models in your browser!)
        
           | GaggiX wrote:
           | >There's no way this tech will ever make a playable version
           | of Mario
           | 
           | Wait a few months, if someone is willing to use their 4090 to
           | train the model, the technology is already here. If you could
           | play a level of Doom than Mario should be even easier.
        
       | whism wrote:
       | Allow the user to draw into the frame buffer during play and feed
       | that back, and you could have something very interesting.
        
         | dartos wrote:
         | It'd probably break wildly since it's really hard to draw
         | Minecraft by hand.
        
       | thrance wrote:
       | > It's a video game, but entirely generated by AI
       | 
       | I ctrl-F'ed the webpage and saw 0 occurrence of "Minecraft". Why?
       | This isn't a video game, this is a poor copy of a real video game
       | you didn't even bother to say the name of, let alone credit it.
        
         | dartos wrote:
         | Yeah it is strange how they make the model sound like it can
         | generate any environment, but only shows demos of the most
         | data-available game ever.
        
         | armchairhacker wrote:
         | It seems like an interesting attempt to get around legal
         | issues.
         | 
         | They can't say "Minecraft" because that's a Microsoft
         | trademark, but they _can_ use Minecraft images as training
         | data, because everyone (including Microsoft) is using
         | proprietary data to train diffusion models. There 's the issue
         | that the outputs _obviously_ resemble Minecraft, but Microsoft
         | has its own problems with Bing and DALL-E generating images
         | that obviously resemble trademarked things (despite
         | guardrails).
        
           | chefandy wrote:
           | Avoiding legal _and ethical_ issues. This stuff was made by a
           | bunch of real people, and people still get their name in
           | movie and game credits even if they got a paycheck from it.
           | Microsoft shamelessly vacuuming up proprietary content didn
           | 't change the norms of the way people get credited in these
           | mediums. It's sad to see how thoughtlessly so many people
           | using generative AI disregard the models' source material as
           | "data" while the models (and therefore their creators) almost
           | always get prominently credited for putting in a tiny
           | fraction of the effort. The dubious ethical defense against
           | crediting source works-- that models learn about media the
           | same way humans do and adapt it to suit their purposes-- is
           | obliterated when it is trained on one work to reproduce that
           | work. That this is equated to generating an image on
           | Midjourney it's a blatant example of a common practice--
           | people want to get credit for other people's work, but when
           | it's time to take responsibility the way a human artist would
           | have to, "The machine did it! It's not my responsibility!"
        
             | woah wrote:
             | The point of this paper is to demonstrate a method of
             | training an AI to output interactive game footage. They
             | could have trained it for similar results with DOOM videos.
             | Presumably the footage they trained on was OK to train on
             | (I don't think a video of someone playing a video game is
             | copyrighted by the video game's author), but they could
             | have used a variety of other games.
        
               | stared wrote:
               | Algorithms trained on DOOM, or using DOOM as a showcase,
               | mention DOOM.
        
               | rodiger wrote:
               | Surprisingly, Nintendo has a long history of copyright-
               | striking videos of folks playing their games.
               | 
               | https://www.ign.com/articles/2013/05/16/nintendo-
               | enforces-co...
        
             | FeepingCreature wrote:
             | Do you really think this would be materially different if
             | they used Minetest? To be frank, nothing in Minecraft as a
             | _game_ (rather than the code) deserves intellectual
             | property protection; it copies games that came before and
             | was copied by games that came after. It is an excellent,
             | well-designed implementation of a very basic concept.
        
               | chefandy wrote:
               | > nothing in Minecraft as a game (rather than the code)
               | deserves intellectual property protection
               | 
               | > excellent, well-designed implementation
               | 
               | And there we see the problem laid bare. Excellent designs
               | that are well-executed are not worthless facets of the
               | _real product_. As we can see from Minecraft 's success,
               | that is the _real product._ People play video games for
               | the experience, not to execute some logical construct a
               | formal proof showing that it 's fun. The reason that this
               | demo uses Minecraft as opposed to a Minecraft knockoff is
               | because Minecraft is better, and they are capitalizing on
               | that. Even if that game is based on a well-worn concept,
               | the many varied design problems you solve when making a
               | game are harder than development, which is why damn near
               | every game that starts open source is a knockoff of
               | something other people already designed. It's not Mojang
               | was some marketing powerhouse that knocked infiniminer
               | off it's perch without merit.
        
         | kiloDalton wrote:
         | There is one mention of Minecraft in the second paragraph of
         | the Architecture section, "...We train on a subset of open-
         | source Minecraft video data collected by OpenAI[9]." I can't
         | say whether this was added after your comment.
        
         | stared wrote:
         | It is weird - compare and contrast with https://diamond-
         | wm.github.io/, which explicitly mentions Counter Strike.
         | 
         | When a scientific work uses some work and does not credit it,
         | it is academic dishonesty.
         | 
         | Sure, they _could_ have trained the model on a different
         | dataset. No matter which source was used, it should be cited.
        
       | brap wrote:
       | Waiting line is too long so I gave up. Can anyone tell me, are
       | the pixels themselves generated by the model, or does it just
       | generate the environment which is rendered by "classical" means?
        
         | yorwba wrote:
         | If it were to generate an environment rendered by classical
         | means, it would most likely have object permanence instead of
         | regenerating something new after briefly looking away:
         | https://oasis-model.github.io/3_second_memory.webp
        
         | yokto wrote:
         | It generates the pixels, including the blurry UI at the bottom.
        
         | naed90 wrote:
         | Every pixel is generated! User actions go in, pixels come out
         | -- and there is only a transformer in the middle :)
         | 
         | Why is this interesting? Today, not too interesting (Oasis v1
         | is just a POC). In the future (and by future I literally mean a
         | few months away -- wait for future versions of Oasis coming out
         | soon), imagine that every single pixel you see will be
         | generated, including the pixels you see as you're reading this
         | message. Why is that interesting? It's a new interface for
         | communication between humans and machines. It's like why LLMs
         | are interesting for chat, because they provide humans and
         | machines an ability to interact in a way humans are used to
         | (chat) -- here, computers will be able to see the world as we
         | do and show back stuff to us in a way we are used to. TLDR:
         | imagine telling your computer "create a pink elephant" and just
         | seeing it popup in a game you're playing.
        
       | GaggiX wrote:
       | Kinda hyped to see how this model (or a much bigger one) will run
       | on Etched's transformer ASIC, Sohu, if it ever comes out.
        
       | sigh_again wrote:
       | Millions of hours of the world's most popular game to output a
       | blurry render with features matching pre-alpha, zero object
       | permanence, a non working inventory. Same trash as the recent
       | DOOM renders.
       | 
       | You could have rewritten the exact same game for less time and
       | less energy, but seemingly empowering talentless hacks is a
       | better business model.
        
         | kookamamie wrote:
         | Agreed. Looks like utter garbage, while the tech could be
         | "groundbreaking".
        
         | dartos wrote:
         | It's a cool concept, but I'd be very surprised if it gets much
         | better than this.
         | 
         | "Playable" statistical models seem to miss the point of games
         | entirely
        
       | petersonh wrote:
       | Very cool - has a very dreamlike quality to it
        
       | 0xTJ wrote:
       | Seems like a neat idea, but too bad that the demo it doesn't work
       | on Firefox.
        
         | naed90 wrote:
         | we really wanted it too! but webtrc was giving us lots of
         | trouble on FF :( trust me, most of the team here is on FF too,
         | and we're bummed we can't play it there haha
        
       | redblacktree wrote:
       | "If you were dreaming in Minecraft" is the impression that I get.
       | It feels very much like a dream with the lack of object
       | permanence. Also interesting is light level. If you stare at
       | something dark for a while or go "underwater" and get to the
       | point where the screen is black, it's difficult to get back to
       | anything but a black screen. (I didn't manage it in my one
       | playthrough)
       | 
       | Very odd sensation indeed.
        
       | vannevar wrote:
       | If anyone has ever read Tad Williams' Otherland series, this is
       | basically the core idea. "The dream that is dreaming us."
        
       | jiwidi wrote:
       | So basically trained a model on minecraft. This is not
       | generalistic at all or whatsoever. Is not like the game comes
       | from a prompt, it probably comes from a bunch of finetuning and
       | gigadatasets from playing minecraft.
       | 
       | Would love to see some work like this but with world/games coming
       | from a prompt.
        
         | naed90 wrote:
         | wait for Oasis v2, coming out soon :) (Disclaimer: I'm from the
         | Oasis team)
        
       | xyzal wrote:
       | Maybe we should train models on Mario games to make Nintendo
       | fight for the "Good Cause".
        
       | duendefm wrote:
       | It's not a videogame, it's a fast minecraft screenshot simulator
       | where the prompt between each frame is the state of the input and
       | the previous frames, with something of a resemblance of
       | coherence.
        
       | joshdavham wrote:
       | Incredible work! I think once we're able to solidly emulate these
       | tiny universes, we can then train agents within them to make even
       | more intelligent AI.
        
       | th0ma5 wrote:
       | This feels like a nice preview at the bottom of the kinds of
       | unsolvable issues these things will always have to some degree.
        
       | robotresearcher wrote:
       | I don't see how you design and ship a game like this. You can't
       | design a game by setting model weights directly. I do see how you
       | might clone a game, eventually without all the missing things
       | like object permanence and other long-term state. But the
       | inference engine is probably more expensive to run than the game
       | engine it (somewhat) emulates.
       | 
       | What is this tech useful for? Genuine question from a long-time
       | AI person.
        
         | naed90 wrote:
         | Yep! Which is why a key point for our next models is to get to
         | a state that you can "code" a new world using "prompting". I
         | agree that these tools become insanely useful only once there
         | is a very good way for creators to "develop" new worlds/games
         | on top of these systems and then users could interact with
         | those worlds.
         | 
         | At the end of the day, it should provide the same "API" as a
         | game engine does: creators develop worlds, users interact with
         | those worlds. The nice thing is that if AI can actually fill
         | this role, then it would be: 1. Potentially much easier to
         | create worlds/games (you could just "talk" to the AI -- "add a
         | flying pink elephant here") 2. Users could interact with a
         | world that could change to fit each game session -- this is
         | truly infinite worlds
         | 
         | Last point: are we there yet? Ofc not! Oasis v1 is a first POC.
         | Wait just a bit more for v2 ;)
        
         | andoando wrote:
         | I suppose you could potentially take a movie like avatar and
         | create a somewhat interactive experience with it?
        
         | notfed wrote:
         | Obviously this tool is not going to generate a "ship"pable game
         | for you. AI is a long way off from that. As for "design", I
         | don't find it very hard to see how incredibly useful being able
         | to rapidly prototype a game would be, even if it requires
         | massive GPU usage. And papers like these are only stepping
         | stones to getting there.
        
         | sangnoir wrote:
         | I found the visual artifacts annoying. I wonder if anyone has
         | trained models on pre-rasterizarion game engine output like
         | meshes/material, camera or even just the raw OpenGL calls. An
         | AI that generates inputs to an actual renderer/engine will
         | solve visual fidelity
        
       | gessha wrote:
       | I find this extremely disappointing. A diffusion transformer
       | trained on Minecraft frames and accelerated on an ASIC... Okay?
       | 
       | From the demo(that doesn't work on Firefox) you can see that it's
       | overfit to the training set and it doesn't have a consistent
       | state transition.
       | 
       | If you define it as a Markov decision process with states being
       | images, actions being keyboard/mouse inputs, the probability
       | transition being the transformer model, the model is a very poor
       | one. Turning the mouse around shouldn't result in a completely
       | different world, it should result in the exact same point of
       | space from different camera orientation. You can fake it by
       | fudging with the training data and augmenting with walking a bit,
       | doing a 360 camera rotation and continuing the exploration but
       | that will just overfit to that specific seed.
       | 
       | The page says their ASICs model inference supports 60+ players.
       | Where are they shown playing together? What's the point of
       | touting multiplayer performance when realistically, the poor
       | state transition will mean those 60+ players are playing single
       | player DeepDream Minecraft?
        
       | mrtnl wrote:
       | Very cool tech demo! Curious to see if we continue to generate
       | environments in this level or move more to generating the physics
        
       | therein wrote:
       | Queue makes it untestable. It isn't running client-side? What's
       | with the queueing?
        
       | robblbobbl wrote:
       | Me gusta!
        
       | amiramer wrote:
       | So cool! Curious to see how it evolves.. seems like a portal into
       | fully generated content, 0 applications. So exciting. Will it
       | also be promptable at some point?
        
       | aaladdin wrote:
       | How would you verify that real world physics actually hold here?
       | Otherwise, such breaches could be maliciously and unfairly
       | exploited.
        
       | TalAvner wrote:
       | This is next level! I can't believe it's all AI generated in real
       | time. Can't wait to see what's next.
        
       | goranim wrote:
       | Love it! this virtual world looks so goo and it is also changing
       | really fast so seems like a very powerful model!
        
       | keidartom wrote:
       | So cool!
        
       | shanim_ wrote:
       | Could you explain how the interaction between the spatial
       | autoencoder (ViT-based) and the latent diffusion backbone (DiT-
       | based) enables both rapid response to real-time input and
       | maintains temporal stability across long gameplay sequences?
       | Specifically, how does dynamic noising integrate with these
       | components to mitigate error compounding over time in an
       | autoregressive setup?
        
       | jhonj wrote:
       | tried their not-a-game and it was SICK to play knowing it's not a
       | game engine. really sick. When did these Decart ppl started
       | working on that. must be f genius ppl
        
       | hesyechter wrote:
       | Very very cool, i love it Good luck
        
       | djhworld wrote:
       | I think this is really cool as a sort of art piece? It's very
       | dreamlike and unsettling, especially with the music
        
       | drdeca wrote:
       | This apparently currently only supports chrome. I hope it will
       | support non-chrome browsers in the future.
        
       ___________________________________________________________________
       (page generated 2024-11-01 23:00 UTC)