[HN Gopher] Oasis: A Universe in a Transformer
___________________________________________________________________
Oasis: A Universe in a Transformer
Author : ChadNauseam
Score : 190 points
Date : 2024-11-01 07:02 UTC (15 hours ago)
(HTM) web link (oasis-model.github.io)
(TXT) w3m dump (oasis-model.github.io)
| gunalx wrote:
| Really cool tech demo. What for the most part impressed me is the
| inference speed. But I don't really see any use for this unless a
| way to store worldstate to avoid the issue of it forgetting what
| it just said.
| jmartin2683 wrote:
| Why? Seems like a very expensive way to vaguely clone a game.
| haccount wrote:
| It's an early demo of interactive realtime inference but it
| appears to have a promise of promptable game worlds and
| mechanisms. Or "scriptable dynamic imagination" if you will.
|
| The answer to "why?" when DeepDream demoed hallucinated dog
| faces in 2015 was contemporary diffusion models.
| dartos wrote:
| > it appears to have a promise of promptable game worlds
|
| Is it a world if there's no permanence?
|
| We've seen demos like this for a while now (granted not as
| fast) but the core problem is continuity. (Or some kind of
| object permanence)
|
| It's a problem for image generators as well.
|
| I'd be more interesting if that was any closer to being
| solved than to have a real time Minecraft screenshot
| generator.
|
| I may have missed it, but I didn't see anything about
| prompting. I'd be surprised if this model could generalize
| beyond Minecraft at all.
| thirdacc wrote:
| >It's a problem for image generators as well.
|
| It was, about a year ago. It's a solved problem.
| hnlmorg wrote:
| The dog demo was introducing something new. This isn't.
|
| I don't want to be negative about someone else's project but
| I can completely understand why people are underwhelmed by
| this.
|
| What I think will be the real application for AI in gaming
| isn't creating poorer versions of native code, it will be
| creating virtual novels that evolve with the player. Where
| characters are actual AI rather than following a predefined
| script. Where you, as the player, can change and shape the
| story as you wish. Think Star Trek Holodeck "holo-novels" or
| MMORPGs but can be played fully offline.
|
| Rendering the pixels is possibly the worst application for AI
| at this stage because AI lacks reliable frame by frame
| continuity, rendering speed, nor an understanding of basic
| physics, which are all the bare minimum for any modern games
| engine.
| int_19h wrote:
| It also seems like a pretty decent way to investigate emergent
| world models in NNs.
| blixt wrote:
| Super cool, and really nice to see the continuous rapid progress
| of these models! I have to wonder how long-term state (building a
| base and coming back later) as well as potentially guided state
| (e.g. game rules that are enforced in traditional code, or
| multiplayer, or loading saved games, etc) will work.
|
| It's probably not by just extending the context window or making
| the model larger, though that will of course help, because
| fundamentally external state and memory/simulation are two
| different things (right?).
|
| Either way it seems natural that these models will soon be used
| for goal-oriented imagination of a task - e.g. imagine a computer
| agent that needs to find a particular image on a computer, it
| would continuously imagine the path between what it currently
| sees and its desired state, and unlike this model which takes
| user input, it would imagine that too. In some ways, to the best
| of my understanding, this already happens with some robot control
| networks, except without pixels.
| aithrowawaycomm wrote:
| There's not even the slightest hint of state in this demo: if
| you hold "turn left" for a full rotation you don't end up where
| you started. After a few rotations the details disappear and
| you're left in the middle of a blank ocean. There's no way this
| tech will ever make a playable version of Mario, let alone
| Minecraft.
| blixt wrote:
| There's plenty of evidence of state, just a very short-term
| memory. Examples:
|
| - The inventory bar is mostly consistent throughout the play
|
| - State transitions in response to key presses
|
| - Block breakage over time is mostly consistent
|
| - Toggling doors / hatches works as expected
|
| - Jumping progresses with correct physics
|
| Turning around and seeing approximately the same thing you
| saw a minute ago is probably just a matter of extending a
| context window, but it will inherently have limits when you
| get to the scale of an entire world even if we somehow can
| make context windows have excellent compression of redundant
| data (which would greatly help LLM transformers too). And I
| guess what I'm mostly wondering about is how would you
| synchronize this state with a ground truth so that it can be
| shared between different instances of the agent, or other
| non-agent entities.
|
| And again, I think it's important to remember games are just
| great for training this type of technology, but it's probably
| more useful in non-game fields such as computer automation,
| robot control, etc.
| bongodongobob wrote:
| Nah, it doesn't even track which direction you're looking.
| Looking straight ahead, walk into some sugar cane so your
| whole screen is green. Now look up. It thinks you were
| looking down.
| blixt wrote:
| I guess it comes down to your definition of state. I'm
| not saying there's enough state for this to be playable,
| but there is clearly state and I think it's important to
| point out how impressive the amount of temporal
| consistence and coherence this model is capable of,
| considering not long ago the state of the art here
| rapidly decohered into completely noisy pixels.
| FeepingCreature wrote:
| In other words: there's enough state now that the lack of
| state _stands out._ It works well enough for its failures
| to be notable.
| bongodongobob wrote:
| I guess if you consider knowing what color the pixels
| were in the last frame "state". That's not a definition
| anyone would use though. Spinning around and have the
| world continuously regenerate or looking at the sky and
| back down regenerating randomly is the opposite of state.
| It's complete incoherence.
| naed90 wrote:
| Hey, developer of Oasis here! You are very correct. Here
| are a few points: 1. We trained the model on a context
| window of even 30 sec. What's the problem? It barely pays
| any attention to frames beyond the past few ones. This
| certainly makes sense as it's a question of the loss
| function of the model during training. We are running now
| many different training runs to experiment with a better
| loss func (and datamix) to solve this issue. You'll see
| newer versions soon! 2. In the long term, we believe the
| "ultimate" solution is 2 models: 1 model that maintains
| game state + 1 model that turns that state into pixel.
| Think of it as having the first model be something
| resembling more of an LLM that gets the current state +
| user action and produces the new state, and then the second
| model being a diffusion model that takes from this state
| and maps to pixels. This would win the best of both worlds.
| throwaway314155 wrote:
| This stuff is all fascinating to me from a computer
| vision perspective. I'm curious - if you have a second
| model tasked with learning just the game state - does
| that mean you would be using info from the game itself
| (say, via a mod or with the developer console) as
| training data? Or is the idea that the model somehow
| learns the state (and only the state) on its own as it
| does here?
| naed90 wrote:
| That's a great question -- lots of experiments will be
| going into the future versions o Oasis. There are quite a
| few different possibilities here and we'll have to
| experiment with them a lot.
|
| The nice thing is that we can run tons of experiments at
| once. For Oasis v1, we ran over 1000 experiments (end-to-
| end training a 500M model) on the model arch, datamix,
| etc., before we created the final checkpoint that's
| deployed on the site. At Decart (we just came out of
| stealth yesterday:
| https://www.theinformation.com/articles/why-sequoias-
| shaun-m...) we have 2 teams: Decart Infrastructure and
| Decart Experiences. The first team provides insanely fast
| infra for training/inferencing (writes from scratch
| everything from CUDA to redoing the python garbage
| collector) -- we are able to get a 500M model to converge
| during training in ~20h instead of 1-2 weeks. Then,
| Decart Experiences uses this infra to create these new
| types of end-to-end "Generated Experiences"
| golol wrote:
| Between the first half and the last sentence of your post is
| a giant leap of conclusion.
| blixt wrote:
| Yeah probably, it remains to be seen if these models can
| actually help guide a live session towards the goal. At
| least it's been shown that these types of world models can
| help a model become better at achieving a goal, in lieu of
| a hard coded simulation environment or the real world, for
| when those options are not tractable.
|
| My favorite example is: https://worldmodels.github.io/ (Not
| least of all because they actually simulate these
| simplified world models in your browser!)
| GaggiX wrote:
| >There's no way this tech will ever make a playable version
| of Mario
|
| Wait a few months, if someone is willing to use their 4090 to
| train the model, the technology is already here. If you could
| play a level of Doom than Mario should be even easier.
| whism wrote:
| Allow the user to draw into the frame buffer during play and feed
| that back, and you could have something very interesting.
| dartos wrote:
| It'd probably break wildly since it's really hard to draw
| Minecraft by hand.
| thrance wrote:
| > It's a video game, but entirely generated by AI
|
| I ctrl-F'ed the webpage and saw 0 occurrence of "Minecraft". Why?
| This isn't a video game, this is a poor copy of a real video game
| you didn't even bother to say the name of, let alone credit it.
| dartos wrote:
| Yeah it is strange how they make the model sound like it can
| generate any environment, but only shows demos of the most
| data-available game ever.
| armchairhacker wrote:
| It seems like an interesting attempt to get around legal
| issues.
|
| They can't say "Minecraft" because that's a Microsoft
| trademark, but they _can_ use Minecraft images as training
| data, because everyone (including Microsoft) is using
| proprietary data to train diffusion models. There 's the issue
| that the outputs _obviously_ resemble Minecraft, but Microsoft
| has its own problems with Bing and DALL-E generating images
| that obviously resemble trademarked things (despite
| guardrails).
| chefandy wrote:
| Avoiding legal _and ethical_ issues. This stuff was made by a
| bunch of real people, and people still get their name in
| movie and game credits even if they got a paycheck from it.
| Microsoft shamelessly vacuuming up proprietary content didn
| 't change the norms of the way people get credited in these
| mediums. It's sad to see how thoughtlessly so many people
| using generative AI disregard the models' source material as
| "data" while the models (and therefore their creators) almost
| always get prominently credited for putting in a tiny
| fraction of the effort. The dubious ethical defense against
| crediting source works-- that models learn about media the
| same way humans do and adapt it to suit their purposes-- is
| obliterated when it is trained on one work to reproduce that
| work. That this is equated to generating an image on
| Midjourney it's a blatant example of a common practice--
| people want to get credit for other people's work, but when
| it's time to take responsibility the way a human artist would
| have to, "The machine did it! It's not my responsibility!"
| woah wrote:
| The point of this paper is to demonstrate a method of
| training an AI to output interactive game footage. They
| could have trained it for similar results with DOOM videos.
| Presumably the footage they trained on was OK to train on
| (I don't think a video of someone playing a video game is
| copyrighted by the video game's author), but they could
| have used a variety of other games.
| stared wrote:
| Algorithms trained on DOOM, or using DOOM as a showcase,
| mention DOOM.
| rodiger wrote:
| Surprisingly, Nintendo has a long history of copyright-
| striking videos of folks playing their games.
|
| https://www.ign.com/articles/2013/05/16/nintendo-
| enforces-co...
| FeepingCreature wrote:
| Do you really think this would be materially different if
| they used Minetest? To be frank, nothing in Minecraft as a
| _game_ (rather than the code) deserves intellectual
| property protection; it copies games that came before and
| was copied by games that came after. It is an excellent,
| well-designed implementation of a very basic concept.
| chefandy wrote:
| > nothing in Minecraft as a game (rather than the code)
| deserves intellectual property protection
|
| > excellent, well-designed implementation
|
| And there we see the problem laid bare. Excellent designs
| that are well-executed are not worthless facets of the
| _real product_. As we can see from Minecraft 's success,
| that is the _real product._ People play video games for
| the experience, not to execute some logical construct a
| formal proof showing that it 's fun. The reason that this
| demo uses Minecraft as opposed to a Minecraft knockoff is
| because Minecraft is better, and they are capitalizing on
| that. Even if that game is based on a well-worn concept,
| the many varied design problems you solve when making a
| game are harder than development, which is why damn near
| every game that starts open source is a knockoff of
| something other people already designed. It's not Mojang
| was some marketing powerhouse that knocked infiniminer
| off it's perch without merit.
| kiloDalton wrote:
| There is one mention of Minecraft in the second paragraph of
| the Architecture section, "...We train on a subset of open-
| source Minecraft video data collected by OpenAI[9]." I can't
| say whether this was added after your comment.
| stared wrote:
| It is weird - compare and contrast with https://diamond-
| wm.github.io/, which explicitly mentions Counter Strike.
|
| When a scientific work uses some work and does not credit it,
| it is academic dishonesty.
|
| Sure, they _could_ have trained the model on a different
| dataset. No matter which source was used, it should be cited.
| brap wrote:
| Waiting line is too long so I gave up. Can anyone tell me, are
| the pixels themselves generated by the model, or does it just
| generate the environment which is rendered by "classical" means?
| yorwba wrote:
| If it were to generate an environment rendered by classical
| means, it would most likely have object permanence instead of
| regenerating something new after briefly looking away:
| https://oasis-model.github.io/3_second_memory.webp
| yokto wrote:
| It generates the pixels, including the blurry UI at the bottom.
| naed90 wrote:
| Every pixel is generated! User actions go in, pixels come out
| -- and there is only a transformer in the middle :)
|
| Why is this interesting? Today, not too interesting (Oasis v1
| is just a POC). In the future (and by future I literally mean a
| few months away -- wait for future versions of Oasis coming out
| soon), imagine that every single pixel you see will be
| generated, including the pixels you see as you're reading this
| message. Why is that interesting? It's a new interface for
| communication between humans and machines. It's like why LLMs
| are interesting for chat, because they provide humans and
| machines an ability to interact in a way humans are used to
| (chat) -- here, computers will be able to see the world as we
| do and show back stuff to us in a way we are used to. TLDR:
| imagine telling your computer "create a pink elephant" and just
| seeing it popup in a game you're playing.
| GaggiX wrote:
| Kinda hyped to see how this model (or a much bigger one) will run
| on Etched's transformer ASIC, Sohu, if it ever comes out.
| sigh_again wrote:
| Millions of hours of the world's most popular game to output a
| blurry render with features matching pre-alpha, zero object
| permanence, a non working inventory. Same trash as the recent
| DOOM renders.
|
| You could have rewritten the exact same game for less time and
| less energy, but seemingly empowering talentless hacks is a
| better business model.
| kookamamie wrote:
| Agreed. Looks like utter garbage, while the tech could be
| "groundbreaking".
| dartos wrote:
| It's a cool concept, but I'd be very surprised if it gets much
| better than this.
|
| "Playable" statistical models seem to miss the point of games
| entirely
| petersonh wrote:
| Very cool - has a very dreamlike quality to it
| 0xTJ wrote:
| Seems like a neat idea, but too bad that the demo it doesn't work
| on Firefox.
| naed90 wrote:
| we really wanted it too! but webtrc was giving us lots of
| trouble on FF :( trust me, most of the team here is on FF too,
| and we're bummed we can't play it there haha
| redblacktree wrote:
| "If you were dreaming in Minecraft" is the impression that I get.
| It feels very much like a dream with the lack of object
| permanence. Also interesting is light level. If you stare at
| something dark for a while or go "underwater" and get to the
| point where the screen is black, it's difficult to get back to
| anything but a black screen. (I didn't manage it in my one
| playthrough)
|
| Very odd sensation indeed.
| vannevar wrote:
| If anyone has ever read Tad Williams' Otherland series, this is
| basically the core idea. "The dream that is dreaming us."
| jiwidi wrote:
| So basically trained a model on minecraft. This is not
| generalistic at all or whatsoever. Is not like the game comes
| from a prompt, it probably comes from a bunch of finetuning and
| gigadatasets from playing minecraft.
|
| Would love to see some work like this but with world/games coming
| from a prompt.
| naed90 wrote:
| wait for Oasis v2, coming out soon :) (Disclaimer: I'm from the
| Oasis team)
| xyzal wrote:
| Maybe we should train models on Mario games to make Nintendo
| fight for the "Good Cause".
| duendefm wrote:
| It's not a videogame, it's a fast minecraft screenshot simulator
| where the prompt between each frame is the state of the input and
| the previous frames, with something of a resemblance of
| coherence.
| joshdavham wrote:
| Incredible work! I think once we're able to solidly emulate these
| tiny universes, we can then train agents within them to make even
| more intelligent AI.
| th0ma5 wrote:
| This feels like a nice preview at the bottom of the kinds of
| unsolvable issues these things will always have to some degree.
| robotresearcher wrote:
| I don't see how you design and ship a game like this. You can't
| design a game by setting model weights directly. I do see how you
| might clone a game, eventually without all the missing things
| like object permanence and other long-term state. But the
| inference engine is probably more expensive to run than the game
| engine it (somewhat) emulates.
|
| What is this tech useful for? Genuine question from a long-time
| AI person.
| naed90 wrote:
| Yep! Which is why a key point for our next models is to get to
| a state that you can "code" a new world using "prompting". I
| agree that these tools become insanely useful only once there
| is a very good way for creators to "develop" new worlds/games
| on top of these systems and then users could interact with
| those worlds.
|
| At the end of the day, it should provide the same "API" as a
| game engine does: creators develop worlds, users interact with
| those worlds. The nice thing is that if AI can actually fill
| this role, then it would be: 1. Potentially much easier to
| create worlds/games (you could just "talk" to the AI -- "add a
| flying pink elephant here") 2. Users could interact with a
| world that could change to fit each game session -- this is
| truly infinite worlds
|
| Last point: are we there yet? Ofc not! Oasis v1 is a first POC.
| Wait just a bit more for v2 ;)
| andoando wrote:
| I suppose you could potentially take a movie like avatar and
| create a somewhat interactive experience with it?
| notfed wrote:
| Obviously this tool is not going to generate a "ship"pable game
| for you. AI is a long way off from that. As for "design", I
| don't find it very hard to see how incredibly useful being able
| to rapidly prototype a game would be, even if it requires
| massive GPU usage. And papers like these are only stepping
| stones to getting there.
| sangnoir wrote:
| I found the visual artifacts annoying. I wonder if anyone has
| trained models on pre-rasterizarion game engine output like
| meshes/material, camera or even just the raw OpenGL calls. An
| AI that generates inputs to an actual renderer/engine will
| solve visual fidelity
| gessha wrote:
| I find this extremely disappointing. A diffusion transformer
| trained on Minecraft frames and accelerated on an ASIC... Okay?
|
| From the demo(that doesn't work on Firefox) you can see that it's
| overfit to the training set and it doesn't have a consistent
| state transition.
|
| If you define it as a Markov decision process with states being
| images, actions being keyboard/mouse inputs, the probability
| transition being the transformer model, the model is a very poor
| one. Turning the mouse around shouldn't result in a completely
| different world, it should result in the exact same point of
| space from different camera orientation. You can fake it by
| fudging with the training data and augmenting with walking a bit,
| doing a 360 camera rotation and continuing the exploration but
| that will just overfit to that specific seed.
|
| The page says their ASICs model inference supports 60+ players.
| Where are they shown playing together? What's the point of
| touting multiplayer performance when realistically, the poor
| state transition will mean those 60+ players are playing single
| player DeepDream Minecraft?
| mrtnl wrote:
| Very cool tech demo! Curious to see if we continue to generate
| environments in this level or move more to generating the physics
| therein wrote:
| Queue makes it untestable. It isn't running client-side? What's
| with the queueing?
| robblbobbl wrote:
| Me gusta!
| amiramer wrote:
| So cool! Curious to see how it evolves.. seems like a portal into
| fully generated content, 0 applications. So exciting. Will it
| also be promptable at some point?
| aaladdin wrote:
| How would you verify that real world physics actually hold here?
| Otherwise, such breaches could be maliciously and unfairly
| exploited.
| TalAvner wrote:
| This is next level! I can't believe it's all AI generated in real
| time. Can't wait to see what's next.
| goranim wrote:
| Love it! this virtual world looks so goo and it is also changing
| really fast so seems like a very powerful model!
| keidartom wrote:
| So cool!
| shanim_ wrote:
| Could you explain how the interaction between the spatial
| autoencoder (ViT-based) and the latent diffusion backbone (DiT-
| based) enables both rapid response to real-time input and
| maintains temporal stability across long gameplay sequences?
| Specifically, how does dynamic noising integrate with these
| components to mitigate error compounding over time in an
| autoregressive setup?
| jhonj wrote:
| tried their not-a-game and it was SICK to play knowing it's not a
| game engine. really sick. When did these Decart ppl started
| working on that. must be f genius ppl
| hesyechter wrote:
| Very very cool, i love it Good luck
| djhworld wrote:
| I think this is really cool as a sort of art piece? It's very
| dreamlike and unsettling, especially with the music
| drdeca wrote:
| This apparently currently only supports chrome. I hope it will
| support non-chrome browsers in the future.
___________________________________________________________________
(page generated 2024-11-01 23:00 UTC)