[HN Gopher] Diffusion models are real-time game engines
___________________________________________________________________
Diffusion models are real-time game engines
Author : jmorgan
Score : 999 points
Date : 2024-08-28 02:59 UTC (20 hours ago)
(HTM) web link (gamengen.github.io)
(TXT) w3m dump (gamengen.github.io)
| vessenes wrote:
| So, this is surprising. Apparently there's more cause, effect,
| and sequencing in diffusion models than what I expected, which
| would be roughly 'none'. Google here uses SD 1.4, as the core of
| the diffusion model, which is a nice reminder that open models
| are useful to even giant cloud monopolies.
|
| The two main things of note I took away from the summary were: 1)
| they got infinite training data using agents playing doom (makes
| sense), and 2) they added Gaussian noise to source frames and
| rewarded the agent for 'correcting' sequential frames back, and
| said this was critical to get long range stable 'rendering' out
| of the model.
|
| That last is intriguing -- they explain the intuition as teaching
| the model to do error correction / guide it to be stable.
|
| Finally, I wonder if this model would be easy to fine tune for
| 'photo realistic' / ray traced restyling -- I'd be super curious
| to see how hard it would be to get a 'nicer' rendering out of
| this model, treating it as a doom foundation model of sorts.
|
| Anyway, a fun idea that worked! Love those.
| refibrillator wrote:
| Just want to clarify a couple possible misconceptions:
|
| The diffusion model doesn't maintain any state itself, though
| its weights may encode some notion of cause/effect. It just
| renders one frame at a time (after all it's a text to image
| model, not text to video). Instead of text, the previous states
| and frames are provided as inputs to the model to predict the
| next frame.
|
| Noise is added to the previous frames before being passed into
| the SD model, so the RL agents were not involved with
| "correcting" it.
|
| De-noising objectives are widespread in ML, intuitively it
| forces a predictive model to leverage context, ie surrounding
| frames/words/etc.
|
| In this case it helps prevent auto-regressive drift due to the
| accumulation of small errors from the randomness inherent in
| generative diffusion models. Figure 4 shows such drift
| happening when a player is standing still.
| rvnx wrote:
| The concept is that if you train a Diffusion model by feeding
| all the possible frames seen in the game.
|
| The training was over almost 1 billion frames, 20 days of
| full-time play-time, taking a screenshot of every single inch
| of the map.
|
| Now you show him N frames as input, and ask it "give me frame
| N+1", then it gives you the frame n. N+1 back based on how it
| was originally seen during training.
|
| But it is not frame N+1 from a mysterious intelligence, it's
| simply frame N+1 given back from past database.
|
| The drift you mentioned is actually a clear (but sad) proof
| that the model does not work at inventing new frames, and can
| only spit out an answer from the past dataset.
|
| It's a bit like if you train stable diffusion on Simpsons
| episodes, and that it outputs the next frame of an existing
| episode that was in the training set, but few frames later
| goes wild and buggy.
| jetrink wrote:
| I don't think you've understood the project completely. The
| model accepts player input, so frame 601 could be quite
| different if the player decided to turn left rather than
| right, or chose that moment to fire at an exploding barrel.
| rvnx wrote:
| 1 billion frames in memory... With such dataset, you have
| seen practically all realistic possibilities in the
| short-term.
|
| If it would be able to invent action and maps and let the
| user play "infinite doom", then it would be very
| different (and impressive!).
| OskarS wrote:
| > 1 billion frames in memory... With such dataset, you
| have seen practically all realistic possibilities in the
| short-term.
|
| I mean... no? Not even close? Multiply the number of game
| states with the number of inputs at any given frame gives
| you a number vastly bigger than 1 billion, not even
| comparable. Even with 20 days of play time to train no,
| it's entirely likely that at no point did someone stop at
| a certain location and look to the left from that angle.
| They might have done from similar angles, but the model
| then has to reconstruct some sense of the geometry of the
| level to synthesize the frame. They might also not have
| arrived there from the same direction, which again the
| model needs some smarts to understand.
|
| I get your point, it's very overtrained on these
| particular levels of Doom, which means you might as well
| just play Doom. But this is not a hash table lookup we're
| talking about, it's pretty impressive work.
| rvnx wrote:
| This was the basis for the reasoning:
|
| The map 1 has 2'518 walkable map units. There are 65536
| angles.
|
| 2'518*65'536=165'019'648
|
| If you capture 165M frames, you already cover all the
| possibilities in terms of camera / player view, but
| probably the diffusion models don't even need to have all
| the frames (the same way that LLMs don't).
| znx_0 wrote:
| I think enemy and effects are probably in there
| bee_rider wrote:
| Do you have to be exactly on a tile in Doom? I thought
| the guy walked smoothly around the map.
| commodoreboxer wrote:
| There's also enemy motion, enemy attacks, shooting, and
| UI considerations, which make the combinatorials explode.
|
| And Doom movement isn't tile based. The map may be, but
| you can be in many many places on a tile.
| TeMPOraL wrote:
| Like many people in case of LLMs, you're just
| demonstrating unawareness of - or disbelief in - the fact
| that the model doesn't record training data vetbatim, but
| smears it out in high-dimensional space, from which it
| then samples. The model then doesn't recall past inputs
| (which are effectively under extreme lossy compression),
| but samples from that high-dimensional space to produce
| output. The high-dimensional representation by necessity
| captures semantic understanding of the training data.
|
| Generating "infinite Doom" is exactly what this model is
| doing, as it does not capture the larger map layout well
| enough to stay consistent with it.
| Workaccount2 wrote:
| Whether or not a judge understands this will probably
| form the basis of any precedent set about the legality of
| image models and copyright.
| znx_0 wrote:
| I like "conditioned brute force" better term.
| mensetmanusman wrote:
| Research is the acquisition of knowledge that may or may
| not have practical applications.
|
| They succeeded in the research, gained knowledge, and might
| be able to do something awesome with it.
|
| It's a success even if they don't sell anything.
| nine_k wrote:
| But it's not a game. It's a memory of a game video, predicting
| the next frame based on the few previous frames, like "I can
| imagine what happened next".
|
| I would call it the world's least efficient video compression.
|
| What I would like to see is the actual _predictive_ strength,
| aka imagination, which I did not notice mentioned in the
| abstract. The model is trained on a set of classic maps. What
| would it do, given a few frames of gameplay on an unfamiliar
| map as input? How well could it imagine what happens next?
| WithinReason wrote:
| If it's trained on absolute player coordinates then it would
| likely just morph into the known map at those coordinates.
| nine_k wrote:
| But it's trained on the actual screen pixel data, AFAICT.
| It's literally a visual imagination model, not gameplay /
| geometry imagination model. They had to make special
| provisions to the pixel data on the HUD which by its nature
| different than the pictures of a 3D world.
| PoignardAzur wrote:
| > _But it 's not a game. It's a memory of a game video,
| predicting the next frame based on the few previous frames,
| like "I can imagine what happened next"._
|
| It's not super clear from the landing page, but I _think_ it
| 's an engine? Like, its input is both previous images _and_
| input for the next frame.
|
| So as a player, if you press "shoot", the diffusion engine
| need to output an image where the monster in front of you
| takes damage/dies.
| bergen wrote:
| How is what you think they say not clear?
|
| We present GameNGen, the first game engine powered entirely
| by a neural model that enables real-time interaction with a
| complex environment over long trajectories at high quality.
| taneq wrote:
| It's more like the Tetris Effect, where the model has seen so
| much Doom that it confabulates gameplay.
| mensetmanusman wrote:
| They could down convert the entire model to only utilize the
| subset of matrix components from stable diffusion. This
| approach may be able to improve internet bandwidth efficiency
| assuming consumers in the future have powerful enough
| computers.
| Sharlin wrote:
| No, it's predicting the next frame conditioned on past frames
| _AND player actions!_ This is clear from the article. Mere
| video generation would be nothing new.
| TeMPOraL wrote:
| It's a memory of a video looped to controls, so frame 1 is "I
| wonder how would it look if the player pressed D instead of
| W", then the frame 2 is based on frame 1, etc. and couple
| frames in, it's already not remembering, but _imagining_ the
| gameplay on the fly. It 's not prerecorded, it responds to
| inputs during generation. That's what makes it a game engine.
| wavemode wrote:
| > Apparently there's more cause, effect, and sequencing in
| diffusion models than what I expected
|
| To temper this a bit, you may want to pay close attention to
| the demo videos. The player rarely backtracks, and for good
| reason - the few times the character does turn around and look
| back at something a second time, it has changed significantly
| (the most noticeable I think is the room with the grey wall and
| triangle sign).
|
| This falls in line with how we'd expect a diffusion model to
| behave - it's trained on many billions of frames of gameplay,
| so it's very good at generating a plausible -next- frame of
| gameplay based on some previous frames. But it doesn't deeply
| understand logical gameplay constraints, like remembering level
| geometry.
| mensetmanusman wrote:
| That is kind of cool though, I would play like being lost in
| a dream.
|
| If on the backend you could record the level layouts in
| memory you could have exploration teams that try to find new
| areas to explore.
| debo_ wrote:
| It would be cool for dream sequences in games to feel more
| like dreams. This is probably an expensive way to do it,
| but it would be neat!
| Groxx wrote:
| There's an example right at the beginning too - the ammo drop
| on the right changes to something green (I think that's a
| body?)
| codeflo wrote:
| Even purely going forward, specks on wall textures morph into
| opponents and so on. All the diffusion-generated videos I've
| seen so far have this kind of unsettling feature.
| bee_rider wrote:
| It it like some kind of weird dream doom.
| whiteboardr wrote:
| But does it need to be frame-based?
|
| What if you combine this with an engine in parallel that
| provides all geometry including characters and objects with
| their respective behavior, recording changes made through
| interactions the other model generates, talking back to it?
|
| A dialogue between two parties with different functionality
| so to speak.
|
| (Non technical person here - just fantasizing)
| bee_rider wrote:
| In that case, the title of the article wouldn't be true
| anymore. It seems like a better plan, though.
| beepbooptheory wrote:
| What would the model provide if not what we see on the
| screen?
| whiteboardr wrote:
| The environment and everything in it.
|
| "Everything" would mean all objects and the elements
| they're made of, their rules on how they interact and
| decay.
|
| A modularized ecosystem i guess, comprised of "sub-
| systems" of sorts.
|
| The other model, that provides all interaction (cause for
| effect) could either be run artificially or be used
| interactively by a human - opening up the possibility for
| being a tree : )
|
| This all would need an interfacing agent that in
| principle would be an engine simulating the second law of
| thermodynamics and at the same time recording every state
| that has changed and diverged off the driving actor's
| vector in time.
|
| Basically the "effects" model keeping track of everyones
| history.
|
| In the end a system with an "everything" model (that can
| grow overtime), a "cause" model messing with it, brought
| together and documented by the "effect" model.
|
| (Again ... non technical person, just fantasizing) : )
| mplewis wrote:
| What you're asking for doesn't make sense.
| HappMacDonald wrote:
| So you're basically just talking about upgrading "enemy
| AI" to a more complex form of AI :)
| robotresearcher wrote:
| In that scheme what is the NN providing that a classical
| renderer would not? DOOM ran great on an Intel 486, which
| is not a lot of computer.
| whiteboardr wrote:
| An experience that isn't asset- but rule-based.
| Sohcahtoa82 wrote:
| > DOOM ran great on an Intel 486
|
| It always blew my mind how well it worked on a 33 Mhz
| 486. I'm fairly sure it ran at 30 fps in 320x200. That
| gives it just over 17 clock cycles per pixel, and that
| doesn't even include time for game logic.
|
| My memory could be wrong, though, but even if it required
| a 66 Mhz to reach 30 fps, that's still only 34 clocks per
| pixel on an architecture that required multiple clocks
| for a simple integer add instruction.
| dewarrn1 wrote:
| Great observation. And not entirely unlike normal human
| visual perception which is notoriously vulnerable to missing
| highly salient information; I'm reminded of the "gorillas in
| our midst" work by Dan Simons and Christopher Chabris [0].
|
| [0]: https://en.wikipedia.org/wiki/Inattentional_blindness#In
| visi...
| bamboozled wrote:
| Are you saying if I turn around, I'll be surprised at what
| I find ? I don't feel like this is accurate at all.
| matheusd wrote:
| If a generic human glances at an unfamiliar
| screen/wall/room, can they accurately, pixel-perfectly
| reconstruct every single element of it? Can they do it
| for every single screen they have seen in their entire
| lives?
| bamboozled wrote:
| I never said pixel perfect, but I would be surprised if
| whole objects , like flaming lanterns suddenly appeared.
|
| What this demo demonstrates to me is how incredible
| willing we are to accept what seems familiar to us as
| accurate.
|
| I bet if you look closely and objectively you will see
| even more anomalies. But at first watch, I didn't see
| most errors because I think accepting something is more
| efficient for the brain.
| ben_w wrote:
| You'd likely be surprised by a flaming lantern unless you
| were in Flaming Lanterns 'R Us, but if you were watching
| a video of a card trick and the two participants changed
| clothes while the camera wasn't focused on them, you may
| well miss that and the other five changes that came with
| that.
| dewarrn1 wrote:
| Not exactly, but our representation of what's behind us
| is a lot more sparse than we would assume. That is, I
| might not be surprised by what I see when I turn around,
| but it could have changed pretty radically since I last
| looked, and I might not notice. In fact, an observer
| might be quite surprised that I missed the change.
|
| Objectively, Simons and Chabris (and many others) have a
| lot of data to support these ideas. Subjectively, I can
| say that these types of tasks (inattentional blindness,
| change blindness, etc.) are humbling.
| jerf wrote:
| Well, it's a bit of a spoiler to encounter this video in
| this context, but this is a very good video:
| https://www.youtube.com/watch?v=LRFMuGBP15U
|
| Even having a clue why I'm linking this, I virtually
| guarantee you won't catch everything.
|
| And even if you do catch everything... the _real_ thing
| to notice is that you had to _look_. Your brain does not
| flag these things naturally. Dreams are notorious for
| this sort of thing, but even in the waking world your
| model of the world is much less rich than you think.
| Magic tricks like to hide in this space, for instance.
| dewarrn1 wrote:
| Yup, great example! Simons's lab has done some things
| along exactly these lines [0], too.
|
| [0]: https://www.youtube.com/watch?v=wBoMjORwA-4
| ajuc wrote:
| The opposite - if you turn around and there's something
| that wasn't there the last time - you'll likely not
| notice if it's not out of place. You'll just assume it
| was there and you weren't paying attention.
|
| We don't memorize things that the environment remembers
| for us if they aren't relevant for other reasons.
| lawlessone wrote:
| I reminds me of dreaming. When you do something and turn
| back to check it has turned into something completely
| different.
|
| edit: someone should train it on MyHouse.wad
| robotresearcher wrote:
| Not noticing to a gorilla that 'shouldn't' be there is not
| the same thing as object permanence. Even quite young
| babies are surprised by objects that go missing.
| dewarrn1 wrote:
| That's absolutely true. It's also well-established by
| Simons et al. and others that healthy normal adults
| maintain only a very sparse visual representation of
| their surroundings, anchored but not perfectly predicted
| by attention, and this drives the unattended gorilla
| phenomenon (along with many others). I don't work in this
| domain, but I would suggest that object permanence
| probably starts with attending and perceiving an object,
| whereas the inattentional or change blindness phenomena
| mostly (but not exclusively) occur when an object is not
| attended (or only briefly attended) _or_ attention is
| divided by some competing task.
| throwway_278314 wrote:
| Work which exaggerates the blindness.
|
| The people were told to focus very deeply on a certain
| aspect of the scene. Maintaining that focus means
| explicitly blocking things not related to that focus. Also,
| there is social pressure at the end to have peformed well
| at the task; evaluating them on a task which is
| intentionally completely different than the one explicitly
| given is going to bias people away from reporting gorillas.
|
| And also, "notice anything unusual" is a pretty vague
| prompt. No-one in the video thought the gorillas were
| unusual, so if the PEOPLE IN THE SCENE thought gorillas
| were normal, why would I think they were strange? Look at
| any TV show, they are all full of things which are pretty
| crazy unusual in normal life, yet not unusual in terms of
| the plot.
|
| Why would you think the gorillas were unusual?
| dewarrn1 wrote:
| I understand what you mean. I believe that the authors
| would contend that what you're describing is a typical
| attentional state for an awake/aware human: focused
| mostly on one thing, and with surprisingly little
| awareness of most other things (until/unless they are in
| turn attended).
|
| Furthermore, even what we attend to isn't always
| represented with all that much detail. Simons has a whole
| series of cool demonstration experiments where they show
| that they can swap out someone you're speaking with (an
| unfamiliar conversational partner like a store clerk or
| someone asking for directions), and you may not even
| notice [0]. It's rather eerie.
|
| [0]: https://www.youtube.com/watch?v=FWSxSQsspiQ&t=5s
| alickz wrote:
| is that something that can be solved with more
| memory/attention/context?
|
| or do we believe it's an inherent limitation in the approach?
| noiv wrote:
| I think the real question is does the player get shot from
| behind?
| alickz wrote:
| great question
|
| tangentially related but Grand Theft Auto speedrunners
| often point the camera behind them while driving so cars
| don't spawn "behind" them (aka in front of the car)
| nmstoker wrote:
| I saw a longer video of this that Ethan Mollick posted and in
| that one, the sequences are longer and they do appear to
| demonstrate a fair amount of consistency. The clips don't
| backtrack in the summary video on the paper's home page
| because they're showing a number of district environments but
| you only get a few seconds of each.
|
| If I studied the longer one more closely, I'm sure
| inconsistencies would be seen but it seemed able to recall
| presence/absence of destroyed items, dead monsters etc on
| subsequent loops around a central obstruction that completely
| obscured them for quite a while. This did seem pretty odd to
| me, as I expected it to match how you'd described it.
| wavemode wrote:
| Yes it definitely is very good for simulating gameplay
| footage, don't get me wrong. Its input for predicting the
| next frame is not just the previous frame, it has access to
| a whole sequence of prior frames.
|
| But to say the model is simulating actual gameplay (i.e.
| that a person could actually play Doom in this) is far
| fetched. It's definitely great that the model was able to
| remember that the gray wall was still there after we turned
| around, but it's untenable for actual gameplay that the
| wall completely changed location and orientation.
| dr_dshiv wrote:
| It's an empirical question, right? But they didn't do
| it...
| TeMPOraL wrote:
| > _it 's untenable for actual gameplay that the wall
| completely changed location and orientation._
|
| It would in an SCP-themed game. Or dreamscape/Inception
| themed one.
|
| Hell, "you're trapped in Doom-like dreamscape, escape
| before you lose your mind" is a very interesting pitch
| for a game. Basically take this Doom thing and make
| walking though a specific, unique-looking doorway from
| the original game to be the victory condition - the
| player's job would be to coerce the model to generate it,
| while also not dying in the Doom fever dream game itself.
| I'd play the hell out of this.
|
| (Implementation-wise, just loop in a simple recognition
| model to continously evaluate victory condiiton from last
| few frames, and some OCR to detect when player's hit
| points indicator on the HUD drops to zero.)
|
| (I'll happily pay $100 this year to the first project
| that gets this to work. I bet I'm not the only one.
| Doesn't have to be Doom specifically, just has to be
| interesting.)
| wavemode wrote:
| To be honest, I agree! That would be an interesting
| gameplay concept for sure.
|
| Mainly just wanted to temper expectations I'm seeing
| throughout this thread that the model is actually
| simulating Doom. I don't know what will be required to
| get from here to there, but we're definitely not there
| yet.
| ValentinA23 wrote:
| What you're pointing at mirrors the same kind of
| limitation in using LLMs for role-play/interactive
| fictions.
| lawlessone wrote:
| Maybe a hybrid approach would work. Certain things like
| inventory being stored as variables, lists etc.
|
| Wouldn't be as pure though.
| crooked-v wrote:
| Give it state by having a rendered-but-offscreen pixel
| area that's fed back in as byte data for the next frame.
| KajMagnus wrote:
| Or if training the model on many FPS games? Surviving in
| one nightmare that morphs into another, into another,
| into another ...
| kridsdale1 wrote:
| Check out the actual modern DOOM WAD MyHouse which
| implements these ideas. It totally breaks our
| preconceptions of what the DOOM engine is capable of.
|
| https://en.wikipedia.org/wiki/MyHouse.wad
| jsheard wrote:
| MyHouse is excellent, but it mostly breaks our perception
| of what the Doom engine is capable of by not _really_
| using the Doom engine. It leans heavily on engine
| features which were embellishments by the GZDoom project,
| and never existed in the original Doom codebase.
| hoosieree wrote:
| Small objects like powerups appear and disappear as the
| player moves (even without backtracking), the ammo count is
| constantly varying, getting shot doesn't deplete health or
| armor, etc.
| TeMPOraL wrote:
| So for the next iteration, they should add a minimap overlay
| (perhaps on a side channel) - it should help the model give
| more consistent output in any given location. Right now, the
| game is very much like a lucid dream - the universe makes
| sense from moment to moment, but without outside reference,
| everything that falls out of short-term memory (few frames
| here) gets reimagined.
| Workaccount2 wrote:
| I don't see this as something that would be hard to overcome.
| Sora for instance has already shown the ability for a
| diffusion model to maintain object permanence. Flux recently
| too has shown the ability to render the same person in many
| different poses or images.
| idunnoman1222 wrote:
| Where does a sora video turn around backwards? I don't even
| maintain such consistency in my dreams.
| idunnoman1222 wrote:
| Where does a sora video turn around backwards? I can't
| maintain such consistency in my own dreams.
| Workaccount2 wrote:
| I don't know of an example (not to say it doesn't exist)
| but the problem is fundamentally the same as things
| moving out of sight/out of frame and coming back again.
| Jensson wrote:
| > the problem is fundamentally the same as things moving
| out of sight/out of frame and coming back again
|
| Maybe it is, but doing that with the entire scene instead
| of just a small part of it makes the problem massively
| harder, as the model needs to grow exponentially to
| remember more things. It isn't something that we will
| manage anytime soon, maybe 10-20 years with current
| architecture and same compute progress.
|
| Then you make that even harder by remembering a whole
| game level? No, ain't gonna happen in our lifetimes
| without massive changes to the architecture. They would
| need to make a different model keep track of level state
| etc, not just an image to image model.
| Workaccount2 wrote:
| 10 to 20 years sounds wildly pessimistic
|
| In this sora video the dragon covers half the scene, and
| its basically identical when it is revealed again ~5
| seconds later, or about 150 frames later. The is lots of
| evidence (and some studies) that these models are in fact
| building internal world models.
|
| https://www.youtube.com/watch?v=LXJ-yLiktDU
|
| Buckle in, the train is moving way faster. I don't think
| there would be much surprise if this is solved in the
| next few generations of video generators. The first
| generation is already doing very well.
| Jensson wrote:
| Did you watch the video, it is completely different after
| the dragon goes past? Its still a flag there, but
| everything else changed. Even the stores in the
| background changed, the mass of people is completely
| different with no hint of anyone moving there etc.
|
| You always get this from AI enthusiast, they come and
| post "proof" that disproves their own point.
| HappMacDonald wrote:
| I'm not GP, but running over that video I'm actually
| having a hard time finding any detail present before the
| dragon obscures them not either exit frame right when the
| camera pans left slightly near the end or not re-appear
| with reasonably crisp detail after the dragon gets out of
| the way.
|
| Most of the mob of people are indistinct, but there is a
| woman in a lime green coat who is visible, and then
| obstructed by the dragon twice (beard and ribbon) and
| reappears fine. Unfortunately when dragon fully moves
| past she has been lost to frame right.
|
| There is another person in black holding a red satchel
| which is visible both before and after the dragon has
| passed.
|
| Nothing about the storefronts appear to change. The
| complex sign full of Chinese text (which might be
| gibberish text: it's highly stylized and I don't know
| Chinese) appears to survive the dragon passing without
| even any changes to the individual ideograms.
|
| There is also a red box shaped like a Chinese paper
| lantern with a single gold ideogram on it at the store
| entrance which spends most of the video obscured by the
| dragon and is still in the same location after it passes
| (though video artifacting makes it more challenging to
| verify that that ideogram is unchanged it certainly does
| not appear substantially different)
|
| What detail are you seeing that is different before and
| after the obstruction?
| nielsbot wrote:
| You can also notice in the first part of the video the ammo
| numbers fluctuate a bit randomly.
| raghavbali wrote:
| Nicely summarised. Another important thing that clearly
| standsout (not to undermine the efforts and work gone into
| this) is the fact that more and more we are now seeing larger
| and more complex building blocks emerging (first it was
| embedding models then encoder decoder layers and now whole
| models are being duck-taped for even powerful pipelines). AI/DL
| ecosystem is growing on a nice trajectory.
|
| Though I wonder if 10 years down the line folks wouldn't even
| care about underlying model details (no more than a current day
| web-developer needs to know about network packets).
|
| PS: Not great examples, but I hope you get the idea ;)
| pradn wrote:
| > Google here uses SD 1.4, as the core of the diffusion model,
| which is a nice reminder that open models are useful to even
| giant cloud monopolies.
|
| A mistake people make all the time is that massive companies
| will put all their resources toward every project. This paper
| was written by four co-authors. They probably got a good amount
| of resources, but they still had to share in the pool allocated
| to their research department.
|
| Even Google only has one Gemini (in a few versions).
| zzanz wrote:
| The quest to run doom on everything continues. Technically
| speaking, isn't this the greatest possible anti-Doom, the Doom
| with the highest possible hardware requirement? I just find it
| funny that on a linear scale of hardware specification, Doom now
| finds itself on both ends.
| fngjdflmdflg wrote:
| >Technically speaking, isn't this the greatest possible anti-
| Doom
|
| When I read this part I thought you were going to say because
| you're technically _not_ running Doom at all. That is, instead
| of running Doom without Doom 's original hardware/software
| environment (by porting it), you're running Doom without Doom
| itself.
| bugglebeetle wrote:
| Pierre Menard, Author of Doom.
| el_memorioso wrote:
| I applaud your erudition.
| 1attice wrote:
| that took a moment, thank you
| airstrike wrote:
| OK, this is the single most perfect comment someone could
| make on this thread. Diffusion me impressed.
| jl6 wrote:
| Knee Deep in the Death of the Author.
| ynniv wrote:
| It's _dreaming_ Doom.
| birracerveza wrote:
| We made machines dream of Doom. Insane.
| daemin wrote:
| Time to make a sheep mod for Doom.
| qingcharles wrote:
| _Do Robots Dream of E1M1?_
| x-complexity wrote:
| > Technically speaking, isn't this the greatest possible anti-
| Doom, the Doom with the highest possible hardware requirement?
|
| Not really? The greatest anti-Doom would be an infinite nest of
| these types of models predicting models predicting Doom at the
| very end of the chain.
|
| The next step of anti-Doom would be a model generating the
| model, generating the Doom output.
| nurettin wrote:
| Isn't this technically a model (training step) generating a
| model (a neural network) generating Doom output?
| yuchi wrote:
| "...now it can _implement_ Doom!"
| Vecr wrote:
| It's the No-Doom.
| WithinReason wrote:
| Undoom?
| riwsky wrote:
| It's a mood.
| jeffhuys wrote:
| Bliss
| Terr_ wrote:
| > the Doom with the highest possible hardware requirement?
|
| Isn't that possible by setting arbitrarily high goals for ray-
| cast rendering?
| danjl wrote:
| So, diffusion models are game engines as long as you already
| built the game? You need the game to train the model. Chicken.
| Egg?
| billconan wrote:
| maybe the next step is adding text guidance and generating non-
| existing games.
| kragen wrote:
| here are some ideas:
|
| - you could build a non-real-time version of the game engine
| and use the neural net as a real-time approximation
|
| - you could edit videos shot in real life to have huds or
| whatever and train the neural net to simulate reality rather
| than doom. (this paper used 900 million frames which i think is
| about a year of video if it's 30fps, but maybe algorithmic
| improvements can cut the training requirements down) and a year
| of video isn't actually all that much--like, maybe you could
| recruit 500 people to play paintball while wearing gopro
| cameras with accelerometers and gyros on their heads and
| paintball guns, so that you could get a year of video in a
| weekend?
| w_for_wumbo wrote:
| That feels like the endgame of video game generation. You
| select an art style, a video and the type of game you'd like
| to play. The game is then generated in real-time responding
| to each action with respect to the existing rule engine.
|
| I imagine a game like that could get so convincing in its
| details and immersiveness that one could forget they're
| playing a game.
| THBC wrote:
| Holodeck is just around the corner
| amelius wrote:
| Except for haptics.
| omegaworks wrote:
| EXISTENZ IS PAUSED!
| numpad0 wrote:
| IIRC, both _2001_ (1968) and _Solaris_ (1972) depict that
| kind of things as part of alien euthanasia process, not as
| happy endings
| hypertele-Xii wrote:
| Also The Matrix, Oblivion, etc.
| catanama wrote:
| Well, 2001 is actually a happy ending, as Dave is reborn
| as a cosmic being. Solaris, at least in the book, is an
| attempt by the sentient ocean to communicate with
| researchers through mimics.
| aithrowaway1987 wrote:
| Have you ever played a video game? This is unbelievably
| depressing. This is a future where games like Slay the
| Spire, with a unique art style and innovative gameplay
| simply are not being made.
|
| Not to mention this childish nonsense about "forget they're
| playing a game," as if every game needs to be lifelike VR
| and there's no room for stylization or imagination. I am
| worried for the future that people think they want these
| things.
| idiotsecant wrote:
| Its a good thing. When the printing press was invented
| there were probably monks and scribes who thought that
| this new mechanical monster that took all the individual
| flourish out of reading was the end of literature.
| Instead it became a tool to make literature better and
| just removed a lot of drudgery. Games with individual
| style and design made by people will of course still
| exist. They'll just be easier to make.
| Workaccount2 wrote:
| The problem is quite the opposite, that AI will be able
| to generate games so many game with so many play styles
| that it will totally dilute the value of all games.
|
| Compare it to music gen algo's that can now produce music
| that is 100% indiscernible from generic crappy music.
| Which is insane given that 5 years ago it could maybe
| create the sound of something that maybe someone would
| describe as "sort of guitar-like". At this rate of
| progress it's probably not going to be long before AI is
| making better music than humans. And it's infinitely
| available too.
| troupo wrote:
| There are thousands of games that mimic each other, and
| only a handful of them are any good.
|
| What makes you think a mechanical "predict next frame based
| on existing games" will be any good?
| injidup wrote:
| Why games? I will train it on 1 years worth of me attending
| Microsoft teams meetings. Then I will go surfing.
| akie wrote:
| Ready to pay for this
| ccozan wrote:
| most underrated comment here!
| kqr wrote:
| Even if you spend 40 hours a week in video conferences,
| you'll have to work for over four years to get one years'
| worth of footage. Of course, by then the models will be
| even better and so you might actually have a chance of
| going surfing.
|
| I guess I should start hoarding video of myself now.
| kragen wrote:
| the neural net doesn't need a year of video to train to
| simulate your face; it can do that from a single photo.
| the year of video is to learn how to play the game, and
| in most cases lots of people are playing the same game,
| so you can dump all their video in the same training set
| qznc wrote:
| The Cloud Gaming platforms could record things for training
| data.
| modeless wrote:
| If you train it on multiple games then you could produce new
| games that have never existed before, in the same way image
| generation models can produce new images that have never
| existed before.
| lewhoo wrote:
| From what I understand that could make the engine much less
| stable. The key here is repetitiveness.
| jsheard wrote:
| It's unlikely that such a procedurally generated mashup would
| be perfectly coherent, stable and most importantly _fun_
| right out of the gate, so you would need some way to reach
| into the guts of the generated game and refine it. If
| properties as simple as "how much health this enemy type
| has" are scattered across an enormous inscrutable neural
| network, and may not even have a single consistent definition
| in all contexts, that's going to be quite a challenge.
| Nevermind if the game just catastrophically implodes and you
| have to "debug" the model.
| slashdave wrote:
| Well, yeah. Image diffusion models only work because you can
| provide large amounts of training data. For Doom it is even
| simpler, since you don't need to deal with compositing.
| attilakun wrote:
| If only there was a rich 3-dimensional physical environment we
| could draw training data from.
| passion__desire wrote:
| Maybe, in future, techniques of Scientific Machine Learning
| which can encode physics and other known laws into a model
| would form a base model. And then other models on top could
| just fine tune aspects to customise a game.
| ravetcofx wrote:
| There is going to be a flood of these dreamlike "games" in the
| next few years. This feels likes a bit of a breakthrough in the
| engineering of these systems.
| wkcheng wrote:
| It's insane that that this works, and that it works fast enough
| to render at 20 fps. It seems like they almost made a cross
| between a diffusion model and an RNN, since they had to encode
| the previous frames and actions and feed it into the model at
| each step.
|
| Abstractly, it's like the model is dreaming of a game that it
| played a lot of, and real time inputs just change the state of
| the dream. It makes me wonder if humans are just next moment
| prediction machines, with just a little bit more memory built in.
| Teever wrote:
| Also recursion and nested virtualization. We can dream about
| dreaming and imagine different scenarios, some completely
| fictional or simply possible future scenarios all while doing
| day to day stuff.
| lokimedes wrote:
| It makes good sense for humans to have this ability. If we flip
| the argument, and see the next frame as a hypothesis for what
| is expected as the outcome of the current frame, then comparing
| this "hypothesis" with what is sensed makes it easier to
| process the differences, rather than the totality of the
| sensory input.
|
| As Richard Dawkins recently put it in a podcast[1], our genes
| are great prediction machines, as their continued survival
| rests on it. Being able to generate a visual prediction fits
| perfectly with the amount of resources we dedicate to sight.
|
| If that is the case, what does aphantasia tell us?
|
| [1] https://podcasts.apple.com/dk/podcast/into-the-impossible-
| wi...
| quickestpoint wrote:
| As Richard Dawkins theorized, would be more accurate and less
| LLM like :)
| jonplackett wrote:
| What's the aphantasia link? I've got aphantasia. I'm
| convinced though that the bit of my brain that should be
| making images is used for letting me 'see' how things are
| connected together very easily in my head. Also I still love
| games like Pictionary and can somehow draw things onto paper
| than I don't really know what they look like in my head. It's
| often a surprise when pen meets paper.
| lokimedes wrote:
| I agree, it is my own experience as well. Craig Venter In
| one of his books also credit this way of representing
| knowledge as abstractions as his strength in inventing new
| concepts.
|
| The link may be that we actually see differences between
| "frames", rather than the frames directly. That in itself
| would imply that a from of sub-visual representation is
| being processed by our brain. For aphantasia, it could be
| that we work directly on this representation instead of
| recalling imagery through the visual system.
|
| Many people with aphantasia reports being able to visualize
| in their dreams, meaning that they don't lack the ability
| to generate visuals. So it may be that the brain has an
| affinity to rely on the abstract representation when
| "thinking", while dreaming still uses the "stable diffusion
| mode".
|
| I'm no where near qualified to speak of this with
| certainty, but it seems plausible to me.
| dbspin wrote:
| Worth noting that aphantasia doesn't necessarily extend to
| dreams. Anecdotally - I have pretty severe aphantasia (I can
| conjure milisecond glimpses of barely tangible imagery that I
| can't quite perceive before it's gone - but only since
| learning that visualisation wasn't a linguistic metaphor). I
| can't really simulate object rotation. I can't really
| 'picture' how things will look before they're drawn / built
| etc. However I often have highly vivid dream imagery. I also
| have excellent recognition of faces and places (e.g.: can't
| get lost in a new city). So there clearly is a lot of
| preconscious visualisation and image matching going on in
| some aphantasia cases, even where the explicit visual screen
| is all but absent.
| zimpenfish wrote:
| Pretty much the same for me. My aphantasia is total (no
| images at all) but still ludicrously vivid dreams and not
| too bad at recognising people and places.
| lokimedes wrote:
| I fabulate about this in another comment below:
|
| > Many people with aphantasia reports being able to
| visualize in their dreams, meaning that they don't lack the
| ability to generate visuals. So it may be that the
| [aphantasia] brain has an affinity to rely on the abstract
| representation when "thinking", while dreaming still uses
| the "stable diffusion mode".
|
| (I obviously don't know what I'm talking about, just a
| fellow aphant)
| dbspin wrote:
| Obviously we're all introspecting here - but my guess is
| that there's some kind of cross talk in aphantasic brains
| between the conscious narrating semantic brain and the
| visual module. Such that default mode visualisation is
| impaired. It's specifically the loss of reflexive
| consciousness that allows visuals to emerge. Not sure if
| this is related, but I have pretty severe chronic
| insomnia, and I often wonder if this in part relates to
| the inability to drift off into imagery.
| drowsspa wrote:
| Yeah. In my head it's like I'm manipulating SVG paths
| instead of raw pixels
| slashdave wrote:
| Image is 2D. Video is 3D. The mathematical extension is
| obvious. In this case, low resolution 2D (pixels), and the
| third dimension is just frame rate (discrete steps). So rather
| simple.
| Sharlin wrote:
| This is not "just" video, however. It's interactive in real
| time. Sure, you can say that playing is simply video with
| some extra parameters thrown in to encode player input, but
| still.
| slashdave wrote:
| It is just video. There are no external interactions.
|
| Heck, it is far simpler than video, because the point of
| view and frame is fixed.
| raincole wrote:
| ?
|
| I highly suggest you to read the paper briefly before
| commenting on the topic. The whole point is that it's not
| just generating a video.
| slashdave wrote:
| I did. It is generating a video, using latent information
| on player actions during the process (which it also
| predicts). It is not interactive.
| SeanAnderson wrote:
| I think you're mistaken. The abstract says it's
| interactive, "We present GameNGen, the first game engine
| powered entirely by a neural model that enables real-time
| interaction"
|
| Further - "a diffusion model is trained to produce the
| next frame, conditioned on the sequence of past frames
| and actions." specifically "and actions"
|
| User input is being fed into this system and subsequent
| frames take that into account. The user is "actually"
| firing a gun.
| nopakos wrote:
| Maybe it's so advanced, it knows the players' next moves,
| so it is a video!
| slashdave wrote:
| I guess you are being sarcastic, except this is precisely
| what it is doing. And it's not hard: player movement is
| low information and probably not the hardest part of the
| model.
| smusamashah wrote:
| It's interactive but can it go beyond what it learned
| from the videos. As in, can the camera break free and
| roam around the map from different angles? I don't think
| it will be able to do that at all. There are still a few
| hallucinations in this rendering, it doesn't look it
| understands 3d.
| Sharlin wrote:
| You might be surprised. Generating views from novel
| angles based on a single image is not novel, and if
| anything, this model has more than a single frame as
| input. I'd wager that it's quite able to extrapolate
| DOOM-like corridors and rooms even if it hasn't seen the
| exact place during training. And sure, it's imperfect but
| on the other hand _it works in real time_ on a single
| TPU.
| hypertele-Xii wrote:
| Then why do monsters become blurry smudgy messes when
| shot? That looks like a video compression artifact of a
| neural network attempting to replicate low-structure
| image (source material contains guts exploding, very un-
| structured visual).
| Sharlin wrote:
| Uh, maybe because monster death animations make up a
| small part of the training material (ie. gameplay) so the
| model has not learned to reproduce them very well?
|
| There cannot be "video compression artifacts" because it
| hasn't even seen any compressed video during training, as
| far as I can see.
|
| Seriously, how is this even a discussion? The article is
| clear that the novel thing is that this is real-time
| frame generation conditioned on the previous frame(s)
| _AND player actions._ Just generating video would be
| nothing new.
| psb217 wrote:
| In a sense, poorly reproducing rare content is a form of
| compression artifact. Ie, since this content occurs
| rarely in the training set, it will have less impact on
| the gradients and thus less impact on the final form of
| the model. Roughly speaking, the model is allocating
| fewer bits to this content, by storing less information
| about this content in its parameters, compared to content
| which it sees more often during training. I think this
| isn't too different from certain aspects of images,
| videos, music, etc., being distorted in different ways
| based on how a particular codec allocates its available
| bits.
| slashdave wrote:
| No, I am not. The interaction is part of the training,
| and is used during inference, but it is not including
| during the process of generation.
| SeanAnderson wrote:
| Okay, I think you're right. My mistake. I read through
| the paper more closely and I found the abstract to be a
| bit misleading compared to the contents. Sorry.
| slashdave wrote:
| Don't worry. The paper is not very well written.
| psb217 wrote:
| Academic authors are consistently better at editing away
| unclear and ambiguous statements which make their work
| seem less impressive compared to ones which make their
| work seem more impressive. Maybe it's just a coincidence,
| lol.
| InDubioProRubio wrote:
| Video is also higher resolution, as the pixels flip for the
| high resolution world by moving through it. Swivelling your
| head without glasses, even the blurry world contains more
| information in the curve of pixelchange.
| slashdave wrote:
| Correct, for the sprites. However, the walls in Doom are
| texture mapped, and so have the same issue as videos.
| Interesting, though, because I assume the antialiasing is
| something approximate, given the extreme demands on CPUs of
| the era.
| stevenhuang wrote:
| > It makes me wonder if humans are just next moment prediction
| machines, with just a little bit more memory built in.
|
| Yup, see https://en.wikipedia.org/wiki/Predictive_coding
| quickestpoint wrote:
| Umm, that's a theory.
| mind-blight wrote:
| So are gravity and friction. I don't know how well tested
| or accepted it is, but being just a theory doesn't tell you
| much about how true it is without more info
| richard___ wrote:
| Did they take in the entire history as context?
| nsbk wrote:
| We are. At least that's what Lisa Feldman Barrett [1] thinks.
| It is worth listening to this Lex Fridman podcast:
| Counterintuitive Ideas About How the Brain Works [2], where she
| explains among other ideas how constant prediction is the most
| efficient way of running a brain as opposed to reaction. I
| never get tired of listening to her, she's such a great science
| communicator.
|
| [1] https://en.wikipedia.org/wiki/Lisa_Feldman_Barrett
|
| [2] https://www.youtube.com/watch?v=NbdRIVCBqNI&t=1443s
| PunchTornado wrote:
| Interesting talk about the brain, but the stuff she says
| about free will is not a very good argument. Basically it is
| sort of the argument that the ancient greeks made which
| brings the discussion into a point where you can take both
| directions.
| mensetmanusman wrote:
| Penrose (Nobel prize in physics) stipulates that quantum
| effects in the brain may allow a certain amount of time travel
| and back propagation to accomplish this.
| wrsh07 wrote:
| You don't need back propagation to learn
|
| This is an incredibly complex hypothesis that doesn't really
| seem justified by the evidence
| wrsh07 wrote:
| Makes me wonder when an update to the world models paper comes
| out where they drop in diffusion models:
| https://worldmodels.github.io/
| dartos wrote:
| > It makes me wonder if humans are just next moment prediction
| machines, with just a little bit more memory built in.
|
| This, to me, seems extremely reductionist. Like you start with
| AI and work backwards until you frame all cognition as next
| something predictors.
|
| It's just the stochastic parrot argument again.
| bangaladore wrote:
| > It's insane that that this works, and that it works fast
| enough to render at 20 fps.
|
| It is running on an entire v5 TPU
| (https://cloud.google.com/blog/products/ai-machine-
| learning/i...)
|
| It's unclear how that compares to a high-end consumer GPU like
| a 3090, but they seem to have similar INT8 TFLOPS. The TPU has
| less memory (16 vs. 24), and I'm unsure of the other specs.
|
| Something doesn't add up, in my opinion, though. SD usually
| takes (at minimum) seconds to produce a high-quality result on
| a 3090, so I can't comprehend how they are like 2 orders of
| magnitudes faster--indicating that the TPU vastly outperforms a
| GPU for this task. They seem to be producing low-res (320x240)
| images, but it still seems too fast.
| masterspy7 wrote:
| There's been a ton of work to generate assets for games using AI:
| 3d models, textures, code, etc. None of that may even be
| necessary with a generative game engine like this! If you could
| scale this up, train on all games in existence, etc. I bet some
| interesting things would happen
| rererereferred wrote:
| But can you grab what this Ai has learned and generate the 3d
| models, maps and code to turn it into an actual game that can
| run on a user's PC? That would be amazing.
| passion__desire wrote:
| Jensen Huang's vision that future games will be generated
| rather than rendered is coming true.
| kleiba wrote:
| What would be the point? This model has been trained on an
| existing game, so turning it back into assets, maps, and code
| would just give you a copy of the original game you started
| with. I suppose you could create variations of it then...
| but:
|
| You don't even need to do all of that - this trained model
| _already is_ the game, i.e., it 's interactive, you can play
| the game.
| whamlastxmas wrote:
| I would absolutely love if they could take this demo, add a new
| door that isn't in the original, and see what it generates
| behind that door
| refibrillator wrote:
| There is no text conditioning provided to the SD model because
| they removed it, but one can imagine a near future where text
| prompts are enough to create a fun new game!
|
| Yes they had to use RL to learn what DOOM looks like and how it
| works, but this doesn't necessarily pose a chicken vs egg
| problem. In the same way that LLMs can write a novel story,
| despite only being trained on existing text.
|
| IMO one of the biggest challenges with this approach will be open
| world games with essentially an infinite number of possible
| states. The paper mentions that they had trouble getting RL
| agents to completely explore every nook and corner of DOOM.
| Factorio or Dwarf Fortress probably won't be simulated anytime
| soon...I think.
| mlsu wrote:
| With enough computation, your neural net weights would converge
| to some very compressed latent representation of the source
| code of DOOM. Maybe smaller even than the source code itself?
| Someone in the field could probably correct me on that.
|
| At which point, you effectively would be interpolating in
| latent space through the source code to actually "render" the
| game. You'd have an entire latent space computer, with an
| engine, assets, textures, a software renderer.
|
| With a sufficiently powerful computer, one could imagine what
| interpolating in this latent space between, say Factorio and
| TF2 (2 of my favorites). And tweaking this latent space to your
| liking by conditioning it on any number of gameplay aspects.
|
| This future comes very quickly for subsets of the pipeline,
| like the very end stage of rendering -- DLSS is already in
| production, for example. Maybe Nvidia's revenue wraps back to
| gaming once again, as we all become bolted into a neural
| metaverse.
|
| God I love that they chose DOOM.
| energy123 wrote:
| The source code lacks information required to render the
| game. Textures for example.
| TeMPOraL wrote:
| Obviously assets would get encoded too, in some form. Not
| necessarily corresponding to the original bitmaps, if the
| game does some consistent post-processing, the encoded
| thing would more likely be (equivalent to) the post-
| processed state.
| hoseja wrote:
| Finally, the AI superoptimizing compiler.
| mistercheph wrote:
| That's just an artifact of the language we use to describe
| an implementation detail, in the sense GP means it, the
| data payload bits are not essentially distinct from the
| executable instruction bits
| electrondood wrote:
| The Holographic Principle is the idea that our universe is a
| projection of a higher dimensional space, which sounds an
| awful lot like the total simulation of an interactive
| environment, encoded in the parameter space of a neural
| network.
|
| The first thing I thought when I saw this was: couldn't my
| immediate experience be exactly the same thing? Including the
| illusion of a separate main character to whom events are
| occurring?
| Jensson wrote:
| > With enough computation, your neural net weights would
| converge to some very compressed latent representation of the
| source code of DOOM. Maybe smaller even than the source code
| itself? Someone in the field could probably correct me on
| that.
|
| Neural nets are not guaranteed to converge to anything even
| remotely optimal, so no that isn't how it works. Also even
| though neural nets can approximate any function they usually
| can't do it in a time or space efficient manner, resulting in
| much larger programs than the human written code.
| mlsu wrote:
| Could is certainly a better word, yes. There is no
| guarantee that it will happen, only that it could. The
| existence of LLMs is proof of that; imagine how large and
| inefficient a handwritten computer program to generate the
| next token would be. On the flipside, human beings very
| effectively predicting the next token, and much more, on 5
| watts is proof that LLM in their current form certainly are
| not the most efficient method for generating next token.
|
| I don't really know why everyone is piling on me here.
| Sorry for a bit of fun speculating! This model is on the
| continuum. There _is_ a latent representation of Doom in
| weights. _some_ weights, not _these_ weights. Therefore
| _some_ representation of doom in a neural net _could_
| become more efficient over time. That 's really the point
| I'm trying to make.
| godelski wrote:
| > With enough computation, your neural net weights would
| converge to some very compressed latent representation of the
| source code of DOOM.
|
| You and I have very different definitions of compression
|
| https://news.ycombinator.com/item?id=41377398
| > Someone in the field could probably correct me on that.
|
| ^__^
| _hark wrote:
| The raw capacity of the network doesn't tell you how
| complex the weights actually are. The capacity is only an
| upper bound on the complexity.
|
| It's easy to see this by noting that you can often prune
| networks quite a bit without any loss in performance. I.e.
| the effective dimension of the manifold the weights live on
| can be much, much smaller than the total capacity allows
| for. In fact, good regularization is exactly that which
| encourages the model itself to be compressible.
| godelski wrote:
| I think your confusing capacity with the training
| dynamics.
|
| Capacity is autological. The amount of information it can
| express.
|
| Training dynamics are the way the model learns, the
| optimization process, etc. So this is where things like
| regularization come into play.
|
| There's also architecture which affects the training
| dynamics as well as model capacity. Which makes no
| guarantee that you get the most information dense
| representation.
|
| Fwiw, the authors did also try distillation.
| basch wrote:
| Similarly, you could run a very very simple game engine, that
| outputs little more than a low resolution wireframe, and
| upscale it. Put all of the effort into game mechanics and none
| into visual quality.
|
| I would expect something in this realm to be a little better at
| not being visually inconsistent when you look away and look
| back. A red monster turning into a blue friendly etc.
| slashdave wrote:
| > where text prompts are enough to create a fun new game!
|
| Not really. This is a reproduction of the first level of Doom.
| Nothing original is being created.
| radarsat1 wrote:
| Most games are conditioned on text, it's just that we call it
| "source code" :).
|
| (Jk of course I know what you mean, but you can seriously see
| text prompts as compressed forms of programming that leverage
| the model's prior knowledge)
| troupo wrote:
| > one can imagine a near future where text prompts are enough
| to create a fun new game
|
| Sit down and write down a text prompt for a "fun new game". You
| can start with something relatively simple like a Mario-like
| platformer.
|
| By page 300, when you're about halfway through describing what
| you mean, you might understand why this is wishful thinking
| reverius42 wrote:
| If it can be trained on (many) existing games, then it might
| work similarly to how you don't need to describe every
| possible detail of a generated image in order to get
| something that looks like what you're asking for (and looks
| like a plausible image for the underspecified parts).
| troupo wrote:
| Things that might work plausible in a static image will not
| look plausible when things are moving, especially in the
| game.
|
| Also: https://news.ycombinator.com/item?id=41376722
|
| Also: define "fun" and "new" in a "simple text prompt".
| Current image generators suck at properly reflecting what
| you want exactly, because they regurgitate existing things
| and styles.
| SomewhatLikely wrote:
| Video games are gonna be wild in the near future. You could
| have one person talking to a model producing something that's
| on par with a AAA title from today. Imagine the 2d sidescroller
| boom on Steam but with immersive photorealistic 3d games with
| hyper-realistic physics (water flow, fire that spreads,
| tornados) and full deformability and buildability because the
| model is pretrained with real world videos. Your game is just a
| "style" that tweaks some priors on look, settings, and story.
| user432678 wrote:
| Sorry, no offence, but you sound like those EA execs wearing
| expensive suits and never played a single video game in their
| entire life. There's a great documentary on how Half Life was
| made. Gabe Newell was interviewed by someone asking "why you
| did that and this, it's not realistic", where he answered
| "because it's more fun this way, you want realism -- just go
| outside".
| magicalhippo wrote:
| This got me thinking. Anyone tried using SD or similar to
| create graphics for the old classic text adventure games?
| throwmeaway222 wrote:
| You know how when you're dreaming and you walk into a room at
| your house and you're suddenly naked at school?
|
| I'm convinced this is the code that gives Data (ST TNG) his
| dreaming capabilities.
| dean2432 wrote:
| So in the future we can play FPS games given any setting? Pog
| darrinm wrote:
| So... is it interactive? Playable? Or just generating a video of
| gameplay?
| vunderba wrote:
| From the article: _We present GameNGen, the first game engine
| powered entirely by a neural model that enables real-time
| interaction with a complex environment over long trajectories
| at high quality_.
|
| The demo is actual gameplay at ~20 FPS.
| darrinm wrote:
| It confused me that their stated evaluations by humans are
| comparing video clips rather than evaluating game play.
| furyofantares wrote:
| Short clips are the only way a human will make any errors
| determining which is which.
| darrinm wrote:
| More relevant is if by _playing_ it they couldn't tell
| which is which.
| Jensson wrote:
| They obviously can within seconds, so it wouldn't be a
| result. Being able to generate gameplay that looks right
| even if it doesn't play right is one step.
| kcaj wrote:
| Take a bunch of videos of the real world and calculate the
| differential camera motion with optical flow or feature tracking.
| Call this the video's control input. Now we can play SORA.
| piperswe wrote:
| This is honestly the most impressive ML project I've seen
| since... probably O.G. DALL-E? Feels like a gem in a sea of AI
| shit.
| bufferoverflow wrote:
| That's probably how our reality is rendered.
| arduinomancer wrote:
| How does the model "remember" the whole state of the world?
|
| Like if I kill an enemy in some room and walk all the way across
| the map and come back, would the body still be there?
| a_e_k wrote:
| Watch closely in the videos and you'll see that enemies often
| respawn when offscreen and sometimes when onscreen. Destroyed
| barrels come back, ammo count and health fluctuates weirdly,
| etc. It's still impressive, but its not perfect in that regard.
| Sharlin wrote:
| Not unlike in (human) dreams.
| raincole wrote:
| It doesn't. You need to put the world state in the input (the
| "prompt", even it doesn't look like prompt in this case).
| Whatever not in the prompt is lost.
| Jensson wrote:
| It doesn't even remember the state of the game you look at.
| Doors spawning right in front of you, particle effects turning
| into enemies mid flight etc, so just regular gen AI issues.
|
| Edit: Can see this in the first 10 seconds of the first video
| under "Full Gameplay Videos", stairs turning to corridor
| turning to closed door for no reason without looking away.
| csmattryder wrote:
| There's also the case in the video (0:59) where the player
| jumps into the poison but doesn't take damage for a few
| seconds then takes two doses back-to-back - they should've
| taken a hit of damage every ~500-1000ms(?)
|
| Guessing the model hasn't been taught enough about that,
| because most people don't jump into hazards.
| broast wrote:
| Maybe one day this will be how operating systems work.
| misterflibble wrote:
| Don't give them ideas lol terrifying stuff if that happens!
| dysoco wrote:
| Ah finally we are starting to see something gaming related. I'm
| curious as to why we haven't seen more of neural networks applied
| to games even in a completely experimental fashion; we used to
| have a lot of little experimental indie games such as Facade
| (2005) and I'm surprised we don't have something similar years
| after the advent of LLMs.
|
| We could have mods for old games that generate voices for the
| characters for example. Maybe it's unfeasible from a computing
| perspective? There are people running local LLMs, no?
| raincole wrote:
| > We could have mods for old games that generate voices for the
| characters for example
|
| You mean in real time? Or just in general?
|
| There are _a lot_ of mods that use AI-generated voices. I 'll
| say it's the norm of modding community now.
| sitkack wrote:
| What most programmers don't understand, that in the very near
| future, the entire application will be delivered by an AI model,
| no source, no text, just connect to the app over RDP. The whole
| app will be created by example, the app developer will train the
| app like a dog trainer trains a dog.
| Grimblewald wrote:
| that might work for some applications, especially recreational
| things, I think we're a while away from it doing away with all
| things, especially where deterministic behavior, efficiency, or
| reliability are important.
| sitkack wrote:
| Problems for two papers down the line.
| ukuina wrote:
| So... https://websim.ai except over pixels instead of in your
| browser?
| sitkack wrote:
| Yes, and that is super neat.
| Jonovono wrote:
| I think it's possible AI models will generate dynamic UI for
| each client and stream the UI to clients (maybe eventually
| client devices will generate their UI on the fly) similar to
| Google Stadia. Maybe some offset of video that allows the
| remote to control it. Maybe Wasm based - just stream wasm
| bytecode around? The guy behind VLC is building a library for
| ulta low latency: https://www.kyber.video/techology.
|
| I was playing around with the idea in this:
| https://github.com/StreamUI/StreamUI. Thinking is take the
| ideas of Elixir LiveView to the extreme.
| sitkack wrote:
| I am so glad you posted, this is super cool!
|
| I too have been thinking about how to push dynamic wasm to
| the client for super low latency UIs.
|
| LiveView is just the beginning. Your readme is dreamy. I'll
| dive into your project at the end of Sept when I get back
| into deep tech.
| mo_42 wrote:
| An implementation of the game engine in the model itself is
| theoretically the most accurate solution for predicting the next
| frame.
|
| I'm wondering when people will apply this to other areas like the
| real world. Would it learn the game engine of the universe (ie
| physics)?
| radarsat1 wrote:
| There has definitely been research for simulating physics based
| on observation, especially in fluid dynamics but also for rigid
| body motion and collision. It's important for robotics
| applications actually. You can bet people will be applying this
| technique in those contexts.
|
| I think for real world application one challenge is going to be
| the "action" signal which is a necessary component of the
| conditioning signal that makes the simulation reactive. In
| video games you can just record the buttons, but for real world
| scenarios you need difficult and intrusive sensor setups for
| recording force signals.
|
| (Again for robotics though maybe it's enough to record the
| motor commands, just that you can't easily record the "motor
| commands" for humans, for example)
| cubefox wrote:
| A popular theory in neuroscience is that this is what the brain
| does:
|
| https://slatestarcodex.com/2017/09/05/book-review-surfing-un...
|
| It's called predictive coding. By trying to predict sensory
| stimuli, the brain creates a simplified model of the world,
| including common sense physics. Yann LeCun says that this is a
| major key to AGI. Another one is effective planning.
|
| But while current predictive models (autoregressive LLMs) work
| well on text, they don't work well on video data, because of
| the large outcome space. In an LLM, text prediction boils down
| to a probability distribution over a few thousand possible next
| tokens, while there are several orders of magnitude more
| possible "next frames" in a video. Diffusion models work better
| on video data, but they are not inherently predictive like
| causal LLMs. Apparently this new Doom model made some progress
| on that front though.
| ccozan wrote:
| Howver, this is due how we actually digitize video. From a
| human point a view, looking in my room reduces the load to
| the _objects_ in the room and everyhing else is just noise (
| like the color of the wall could be just a single item to
| remember, while otherwise in the digital world, it needs to
| remember all the pixels )
| helloplanets wrote:
| So, any given sequence of inputs is rebuilt into a corresponding
| image, twenty times per second. I wonder how separate the game
| logic and the generated graphics are in the fully trained model.
|
| Given a sufficient enough separation between these two, couldn't
| you basically boil the game/input logic down to an abstract game
| template? Meaning, you could just output a hash that corresponds
| to a specific combination of inputs, and then treat the resulting
| mapping as a representation of a specific game's inner workings.
|
| To make it less abstract, you could save some small enough
| snapshot of the game engine's state for all given input
| sequences. This could make it much less dependent to what's
| recorded off of the agents' screens. And you could map the
| objects that appear in the saved states to graphics, in a
| separate step.
|
| I imagine this whole system would work especially well for games
| that only update when player input is given: Games like Myst,
| Sokoban, etc.
| toppy wrote:
| I think you've just encoded the title of the paper
| richard___ wrote:
| Uhhh... demos would be more convincing with enemies and
| decreasing health
| Kiro wrote:
| I see enemies and decreasing health on hit. But even if it
| lacked those, it seems like a pretty irrelevant nitpick that is
| completely underplaying what we're seeing here. The fact that
| this is even possible at all feels like science fiction.
| troupo wrote:
| Key: "predicts next frame, recreates classic Doom". A game that
| was analyzed and documented to death. And the training included
| uncountable runs of Doom.
|
| A game engine lets you create a _new_ game, not predict the next
| frame of an existing and copiously documented one.
|
| This is not a game engine.
|
| Creating a new _good_ game? Good luck with that.
| nolist_policy wrote:
| Makes me wonder... If you stand still in front of a door so all
| past observations only contain that door, will the model teleport
| you to another level when opening the door?
| zbendefy wrote:
| I think some state is also being given (or if its not, it could
| be given) to the network, like 3d world position/orientation of
| the player, that could help the neural network anchor the
| player in the world.
| lukol wrote:
| I believe future game engines will be state machines with
| deterministic algorithms that can be reproduced at any time.
| However, rendering said state into visual / auditory / etc.
| experiences will be taken over by AI models.
|
| This will also allow players to easily customize what they
| experience without changing the core game loop.
| jamilton wrote:
| I wonder if the MineRL
| (https://www.ijcai.org/proceedings/2019/0339.pdf and minerl.io)
| dataset would be sufficient to reproduce this work with
| Minecraft.
|
| Any other similar existing datasets?
|
| A really goofy way I can think of to get a bunch of data would be
| to get videos from youtube and try to detect keyboard sounds to
| determine what keys they're pressing.
| jamilton wrote:
| Although ideally a follow up work would be something where
| there won't be any potential legal trouble with releasing the
| complete model so people can play it.
|
| A similar approach but with a game where the exact input is
| obvious and unambiguous from the graphics alone so that you can
| use unannotated data might work. You'd just have to create a
| model to create the action annotations. I'm not sure what the
| point would be, but it sounds like it'd be interesting.
| qnleigh wrote:
| Could a similar scheme be used to drastically improve the visual
| quality of a video game? You would train the model on gameplay
| rendered at low and high quality (say with and without ray
| tracing, and with low and high density meshing), and try to get
| it to convert a quick render into something photorealistic on the
| fly.
|
| When things like DALL-E first came out, I was expecting something
| like the above to make it into mainstream games within a few
| years. But that was either too optimistic or I'm not up to speed
| on this sort of thing.
| agys wrote:
| Isn't that what Nvidia's Ray Reconstruction and DLSS (frame
| generation and upscaler) are doing, more or less?
| qnleigh wrote:
| At a high level I guess so. I don't know enough about Ray
| Reconstruction (though the results are impressive), but I was
| thinking of something more drastic than DLSS. Diffusion
| models on static images can turn a cartoon into a
| photorealistic image. Doing something similar for a game,
| where a low-quality render is turned into something that
| would otherwise take seconds to render, seems qualitatively
| quite different from DLSS. In principle a model could fill in
| huge amounts of detail, like increasing the number of
| particles in a particle-based effect, adding shading/lighting
| effects...
| lIl-IIIl wrote:
| How does it know how many times it needs to shoot the zombie
| before it dies?
|
| Most enemies have enough hit points to survive the first shot. If
| the model is only trained on the previous frame, it doesn't know
| how many times the enemy was already shot at.
|
| From the video it seems like it is probability based - they may
| die right away or it might take way longer than it should.
|
| I love how the player's health goes down when he stands in the
| radioactive green water.
|
| In Doom the enemies fight with each other if they accidentally
| incur "friendly fire". It would be interesting to see it play out
| in this version.
| golol wrote:
| It gets a number of previous frame _s_ as input I think.
| meheleventyone wrote:
| > I love how the player's health goes down when he stands in
| the radioactive green water.
|
| This is one of the bits that was weird to me, it doesn't work
| correctly. In the real game you take damage at a consistent
| rate, in the video the player doesn't and whether the player
| takes damage or not seems highly dependent on some factor that
| isn't whether or not the player is in the radioactive slime. My
| thought is that its learnt something else that correlates
| poorly.
| lupusreal wrote:
| > _In Doom the enemies fight with each other if they
| accidentally incur "friendly fire". It would be interesting to
| see it play out in this version._
|
| They trained this thing on bot gameplay, so I bet it does
| poorly when advanced strategies like deliberately inducing mob
| infighting are employed (the bots probably didn't do that a
| lot, of at all.)
| golol wrote:
| What I understand is the folloeing: If this works so well, why
| didn't we have good video generation much earlier? After
| diffusion models were seen to work the most obvious thing to do
| was to generate the next frame based on previous framrs but... it
| took 1-2 years for good video models to appear. For example
| compare Sora generating minecraft video versus this method
| generating minecraft video. Say in both cases the player is
| standing on a meadow with fee inputs and watching some pigs. In
| the Sora video you'd expect the typical glitched to appear, like
| erratic, sliding movement, overlapping legs, multiplication of
| pigs etc. Would these glitches not appear in the GameNGen video?
| Why?
| Closi wrote:
| Because video is much more difficult than images (it's lots of
| images that have to be consistent across time, with motion
| following laws of physics etc), and this is much more limited
| in terms of scope than pure arbitrary video generation.
| golol wrote:
| This misses the point, I'm comparing two methods of
| generating minecraft videos.
| soulofmischief wrote:
| By simplifying the problem, we are better able to focus on
| researching specific aspects of generation. In this case,
| they synthetically created a large, highly domain-specific
| training set and then used this to train a diffusion model
| which encodes input parameters instead of text.
|
| Sora was trained on a much more diverse dataset, and so has
| to learn more general solutions in order to maintain
| consistency, which is harder. The low resolution and
| simple, highly repetitive textures of doom definitely help
| as well.
|
| In general, this is just an easier problem to approach
| because of the more focused constraints. It's also worth
| mentioning that noise was added during the process in order
| to make the model robust to small perturbations.
| pantalaimon wrote:
| I would have thought it is much easier to generate huge amounts
| of game footage for training, but as I understand this is not
| what was done here.
| golol wrote:
| Certain categories of youtube videos can also be viewed as some
| sort of game where the actions are the audio/transcript advanced
| a couple of seconds. Add two eggs. Fetch the ball. I'm walking in
| the park.
| thegabriele wrote:
| Wow, I bet Boston Dynamics and such are quite interested
| jumploops wrote:
| This seems similar to how we use LLMs to generate code: generate,
| run, fix, generate.
|
| Instead of working through a game, it's building generic UI
| components and using common abstractions.
| HellDunkel wrote:
| Although impressive i must disagree. Diffusion models are not
| game engines. A game engine is a component to propell your game
| (along the time axis?). In that sense it is similar to the engine
| of the car, hence the name. It does not need a single working car
| nor a road to drive on do its job. The above is a dynamic,
| interactive replication of what happens when you put a car on a
| given road, requiring a million test drives with working
| vehicles. An engine would also work offroad.
| MasterScrat wrote:
| Interesting point.
|
| In a way this is a "simulated game engine", trained from actual
| game engine data. But I would argue a working simulated game
| engine becomes a game engine of its own, as it is then able to
| "propell the game" as you say. The way it achieves this becomes
| irrelevant, in one case the content was crafted by humans, in
| the other case it mimics existing game content, the player
| really doesn't care!
|
| > An engine would also work offroad.
|
| Here you could imagine that such a "generative game engine"
| could _also_ go offroad, extrapolating what would happen if you
| go to unseen places. I 'd even say extrapolation capabilities
| of such a model could be better than a traditional game engine,
| as it can make things up as it goes, while if you accidentally
| cross a wall in a typical game engine the screen goes blank.
| jsheard wrote:
| > Here you could imagine that such a "generative game engine"
| could also go offroad, extrapolating what would happen if you
| go to unseen places.
|
| They easily could have demonstrated this by seeding the model
| with images of Doom maps which weren't in the training set,
| but they chose not to. I'm sure they tried it and the results
| just weren't good, probably morphing the map into one of the
| ones it was trained on at the first opportunity.
| HellDunkel wrote:
| The game doom is more than a game engine, isnt it? I'd be
| okay with calling the above a ,,simulated game" or a ,,game".
| My point is: let's not conflate the idea of a ,,game engine"
| which is a construct of intellectual concepts put together to
| create a simulation of ,,things happening in time" and
| deriving output (audio and visual). the engine is fed with
| input and data (levels and other assets) and then
| drives(EDIT) a ,,game".
|
| training the model with a final game will never give you an
| engine. maybe a ,,simulated game" or even a ,,game" but
| certainly not an ,,engine". the latter would mean the model
| would be capable to derive and extract the technical and
| intellectual concepts and apply them elsewhere.
| icoder wrote:
| This is impressive. But at the same time, it can't count. We see
| this every time, and I understand why it happens, but it is still
| intriguing. We are so close or in some ways even way beyond, and
| yet at the same time so extremely far away, from 'our'
| intelligence.
|
| (I say it can't count because there are numerous examples where
| the bullet count glitches, it goes right impressively often, but
| still, counting, being up or down, is something computers have
| been able to do flawlessly basically since forever)
|
| (It is the same with chess, where the LLM models are becoming
| really good, yet sometimes make mistakes that even my 8yo niece
| would not make)
| marci wrote:
| 'our' intelligence may not be the best thing we can make. It
| would be like trying to only make planes that flaps wings or
| trucks with legs. A bit like using a llm to do multiplication.
| Not the best tool. Biomimcry is great for inspiration, but
| shouldn't be a 1-to-1 copy, especialy in different scale and
| medium.
| icoder wrote:
| Sure, although I still think a system with less of a contrast
| between how well it performs 'modally' and how bad it
| performs incidentally, would be more practical.
|
| What I wonder is whether LLM's will inherently always have
| this dichotomy and we need something 'extra' (reasoning,
| attention or something les biomimicried), or whether this
| will eventually resolves itself (to an acceptable extend)
| when they improve even further.
| panki27 wrote:
| > Human raters are only slightly better than random chance at
| distinguishing short clips of the game from clips of the
| simulation.
|
| I can hardly believe this claim, anyone who has played some
| amount of DOOM before should notice the viewport and textures not
| "feeling right", or the usually static objects moving slightly.
| meheleventyone wrote:
| It's telling IMO that they only want people opinions based on
| our notoriously faulty memories rather than sitting comparable
| situations next to one another in the game and simulation then
| analyzing them. Several things jump out watching the example
| video.
| GaggiX wrote:
| >rather than sitting comparable situations next to one
| another in the game and simulation then analyzing them.
|
| That's literally how the human rating was setup if you read
| the paper.
| meheleventyone wrote:
| I think you misunderstand me. I don't mean a snap
| evaluation and deciding between two very-short competing
| videos which is what the participants were doing. I mean
| doing an actual analysis of how well the simulation matches
| the ground truth of the game.
|
| What I'd posit is that it's not actually a very good
| replication of the game but very good a replicating short
| clips that almost look like the game and the short time
| horizons are deliberately chosen because the authors know
| the model lacks coherence beyond that.
| GaggiX wrote:
| >I mean doing an actual analysis of how well the
| simulation matches the ground truth of the game.
|
| Do you mean the PSNR and LPIPS metrics used in paper?
| meheleventyone wrote:
| No, I think I've been pretty clear that I'm interested in
| how mechanically sound the simulation is. Also those
| measures are over an even shorter duration so even less
| relevant to how coherent it is at real game scales.
| GaggiX wrote:
| How should this be concretely evaluated and measured? A
| vibe check?
| meheleventyone wrote:
| I think the studies evaluation using very short video and
| humans is much more of a vibe check than what I've
| suggested.
|
| Off the top of my head DOOM is open source so it should
| be reasonable to setup repeatable scenarios and use some
| frames from the game to create a starting scenario for
| the simulation that is the same. Then the input from the
| player of the game could be used to drive the simulated
| version. You could go further and instrument events
| occurring in the game for direct comparison to the
| simulation. I'd be interested in setting a baseline for
| playtime of the level in question and using sessions of
| around that length as an ultimate test.
|
| There are some on obvious mechanical deficiencies seen in
| the videos they've published. One that really stood out
| to me was the damage taken when in the radioactive slime.
| So I don't think the analysis would need to particularly
| deep to find differences.
| arc-in-space wrote:
| This, watching the generated clips feels uncomfortable, like a
| nightmare. Geometry is "swimming" with camera movement, objects
| randomly appear and disappear, damage is inconsistent.
|
| The entire thing would probably crash and burn if you did
| something just slightly unusual compared to the training data,
| too. People talking about 'generated' games often seem to
| fantasize about an AI that will make up new outcomes for
| players that go off the beaten path, but a large part of the
| fun of real games is figuring out what you can do within the
| predetermined constraints set by the game's code. (Pen-and-
| paper RPGs are highly open-ended, but even a Game Master needs
| to sometimes protects the players from themselves; whereas the
| current generation of AI is famously incapable of saying no.)
| aithrowaway1987 wrote:
| I also noticed that they played AI DOOM very slowly: in an
| actual game you are running around like a madman, but in the
| video clips the player is moving in a very careful, halting
| manner. In particular the player only moves in straight lines
| or turns while stationary, they almost never turn while
| running. Also didn't see much _strafing._
|
| I suspect there is a reason for this: running while turning
| doesn't work properly and makes it very obvious that the system
| doesn't have a consistent internal 3D view of the world. I'm
| already getting motion sickness from the inconsistencies in
| straight-line movement, I can't imagine turning is any better.
| freestyle24147 wrote:
| It made me laugh. Maybe they pulled random people from the
| hallway who had never seen the original Doom (or any FPS), or
| maybe only selected people who wore glasses and forgot them at
| their desk.
| holoduke wrote:
| I saw a video a while ago where they recreated actual doom
| footage with a diffusion technique so it looked like a jungle or
| anything you liked. Cant find it anymore, but looked impressive.
| godelski wrote:
| Doom system requirements: - 4 MB RAM - 12
| MB disk space
|
| Stable diffusion v1 > 860M UNet and CLIP ViT-L/14
| (540M) Checkpoint size: 4.27 Gb 7.7 GB
| (full EMA) Running on a TPU-v5e Peak compute per
| chip (bf16) 197 TFLOPs Peak compute per chip (Int8) 393
| TFLOPs HBM2 capacity and bandwidth 16 GB, 819 GBps
| Interchip Interconnect BW 1600 Gbps
|
| This is quite impressive, especially considering the speed. But
| there's still a ton of room for improvement. It seems it didn't
| even memorize the game despite having the capacity to do so
| hundreds of times over. So we definitely have lots of room for
| optimization methods. Though who knows how such things would
| affect existing tech since the goal here is to memorize.
|
| What's also interesting about this work is it's basically saying
| you can rip a game if you're willing to "play" (automate) it
| enough times and spend a lot more on storage and compute. I'm
| curious what the comparison in cost and time would be if you
| hired an engineer to reverse engineer Doom (how much prior
| knowledge do they get considering pertained models and visdoom
| environment. Was doom source code in T5? And which vit checkpoint
| was used? I can't keep track of Google vit checkpoints).
|
| I would love to see the checkpoint of this model. I think people
| would find some really interesting stuff taking it apart.
|
| - https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...
|
| - https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...
|
| - https://cloud.google.com/tpu/docs/v5e
|
| - https://github.com/Farama-Foundation/ViZDoom
|
| - https://zdoom.org/index
| snickmy wrote:
| Those are valid points, but irrelevant for the context of this
| research.
|
| Yes, the computational cost is ridicolous compared to the
| original game, and yes, it lacks basic things like pre-
| computing, storing, etc. That said, you could assume that all
| that can be either done at the margin of this discovery OR over
| time will naturally improve OR will become less important as a
| blocker.
|
| The fact that you can model a sequence of frames with such
| contextual awareness without explictly having to encode it, is
| the real breakthrough here. Both from a pure gaming standpoint,
| but on simulation in general.
| godelski wrote:
| I'm not sure what you're saying is irrelevant.
|
| 1) the model has enough memory to store not only all game
| assets and engine but even hundreds of "plays".
|
| 2) me mentioning that there's still a lot of room to make
| these things better (seems you think so too so maybe not this
| one?)
|
| 3) an interesting point I was wondering to compare current
| state of things (I mean I'll give you this but it's just a
| random thought and I'm not reviewing this paper in an
| academic setting. This is HN, not NeurIPS. I'm just curious -
| \ _ ( tsu ) _ / -)
|
| 4) the point that you can rip a game
|
| I'm really not sure what you're contesting to because I said
| several things. > it lacks basic things like
| pre-computing, storing, etc.
|
| It does? Last I checked neural nets store information. I
| guess I need to return my PhD because last I checked there's
| a UNet in SD 1.4 and that contains a decoder.
| snickmy wrote:
| Sorry, probably didn't explain myself well enough
|
| 1) yes you are correct. the point i was making is that, in
| the context of the discovery/research, that's outside the
| scope, and 'easier' to do, as it has been done in other
| verticals (ie.: e2e self driving)
|
| 2) yep, aligned here
|
| 3) I'm not fully following here, but agree this is not
| NeurIPS, and no Schmidhuber's bickering.
|
| 4) The network does store information, it just doesn't
| store a gameplay information, which could be forced, but as
| per point 1, it is , and I think it is the right approach,
| beyond the scope of this research
| godelski wrote:
| 1) I'm not sure this is outside scope. It's also not
| something I'd use to reject a paper were I to review this
| in a conference. I mean you got to start somewhere and
| unlike reviewer 2 I don't think any criticism is
| rejection criteria. That'd be silly since lack of
| globally optimal solutions. But I'm also unconvinced this
| is proven my self-driving vehicles but I'm also not an RL
| expert.
|
| 3) It's always hard to evaluate. I was thinking about the
| ripping the game and so a reasonable metric is a
| comparison of ability to perform the task by a human. Of
| course I'm A LOT faster than my dishwasher at cleaning
| dishes but I'm not occupied while it is going, so it
| still has high utility. (Someone tell reviewer 2 lol)
|
| 4) Why should we believe that it doesn't store gameplay?
| The model was fed "user" inputs and frames. So it has
| this information and this information appears useful for
| learning the task.
| pickledoyster wrote:
| >you could assume that all that can be either done at the
| margin of this discovery OR over time will naturally improve
| OR will become less important as a blocker.
|
| OR one can hope it will be thrown to the heap of nonviable
| tech with the rest of spam waste
| tobr wrote:
| I suppose it also doesn't really matter what kinds of
| resources the game originally requires. The diffusion model
| isn't going to require twice as much memory just because the
| game does. Presumably you wouldn't even necessarily need to
| be able to render the original game in real time - I would
| imagine the basic technique would work even if you used a
| state of the Hollywood-quality offline renderer to render
| each input frame, and that the performance of the diffusion
| model would be similar?
| godelski wrote:
| Well the majority of ML systems are compression machines
| (entropy minimizers), so ideally you'd want to see if you
| can learn the assets and game mechanics through play alone
| (what this paper shows). Better would be to do so more
| efficiently than that devs themselves, finding better
| compression. Certainly the game is not perfectly optimized.
| But still, this is a step in that direction. I mean no one
| has accomplished this before so even with a model with far
| higher capacity it's progress. (I think people are
| interpreting my comment as dismissive. I'm critiquing but
| the key point I was making was about how there's likely
| better architectures, training methods, and all sorts of
| stuff to still research. Personally I'm glad there's still
| more to research. That's the fun part)
| danielmarkbruce wrote:
| Is it a breakthrough? Weather models are miles ahead of this
| as far as I can tell.
| dTal wrote:
| >What's also interesting about this work is it's basically
| saying you can rip a game if you're willing to "play"
| (automate) it enough times and spend a lot more on storage and
| compute
|
| That's the least of it. It means you can _generate_ a game from
| real footage. Want a perfect flight sim? Put a GoPro in the
| cockpit of every airliner for a year.
| isaacfung wrote:
| The possibility seems far beyond gaming(given enough
| computation resources).
|
| You can feed it with videos of usage of any software or real
| world footage recorded by a Go Pro mounted on your
| shoulder(with body motion measured by some sesnors though the
| action space would be much larger).
|
| Such a "game engine" can potentially be used as a simulation
| gym environment to train RL agents.
| camtarn wrote:
| Plus, presumably, either training it on pilot inputs (and
| being able to map those to joystick inputs and mouse clicks)
| or having the user have an identical fake cockpit to play in
| and a camera to pick up their movements.
|
| And, unless you wanted a simulator that only allowed
| perfectly normal flight, you'd have to have those airliners
| go through every possible situation that you wanted to
| reproduce: warnings, malfunctions, emergencies, pilots
| pushing the airliner out of its normal flight envelope, etc.
| phh wrote:
| > Want a perfect flight sim? Put a GoPro in the cockpit of
| every airliner for a year.
|
| I guess that's the occasion to remind that ML is splendid at
| interpolating, but extrapolating, maybe don't keep your hopes
| too high.
|
| Namely, to have a "perfect flight sim" using GoPros, you'll
| need to record hundreds of stalls and crashs.
| amelius wrote:
| Yes, and you can use an LLM to simulate role playing games.
| amunozo wrote:
| This is amazing and an interesting discovery. It is a pity that I
| don't find it capable of creating anything new.
| itomato wrote:
| The gibs are a dead giveaway
| nuz wrote:
| I wonder how overfit it is though. You could fit a lot of doom
| resolution jpeg frames into 4gb (the size of SD1.4)
| ciroduran wrote:
| Congrats on running Doom on an Diffusion Model :D
|
| I was really entranced on how combat is rendered (the grunt doing
| weird stuff in very much the style that the model generates
| images). Now I'd like to see this implemented in a shader in a
| game
| LtdJorge wrote:
| So is it taking inputs from a player and simulating the gameplay
| or is it just simulating everything (effectively, a generated
| video)?
| smusamashah wrote:
| Has this model actually learned the 3d space of the game? Is it
| possible to break the camera free and roam around the map freely
| and view it from different angles?
|
| I noticed a few hallucinations e.g. when it picked green jacket
| from a corner, walking back it generated another corner.
| Therefore I don't think it has any clue about the 3D world of the
| game at all.
| kqr wrote:
| > Is it possible to break the camera free and roam around the
| map freely and view it from different angles?
|
| I would assume only if the training data contained this type of
| imagery, which it did not. The training data (from what I
| understand) consisted only of input+video of actual gameplay,
| so that is what the model is trained to mimick.
|
| This is like a dog that has been trained to form English words
| - what's impressive is not that it does it well, but that it
| does it at all.
| Sohcahtoa82 wrote:
| > Therefore I don't think it has any clue about the 3D world of
| the game at all.
|
| AI models don't "know" things at all.
|
| At best, they're just very fuzzy predictors. In this case,
| given the last couple frames of video and a user input, it
| predicts the next frame.
|
| It has zero knowledge of the game world, game rules,
| interactions, etc. It's merely a mapping of [pixels, input] ->
| pixels.
| kqr wrote:
| I have been kind of "meh" about the recent AI hype, but this is
| seriously impressive.
|
| Of course, we're clearly looking at complete nonsense generated
| by something that does not understand what it is doing - yet, it
| is astonishingly sensible nonsense given the type of information
| it is working from. I had no idea the state of the art was
| capable of this.
| acoye wrote:
| Nvidia CEO reckons your GPU will be replaced with AI in "5-10
| years". So this is what the sort of first working game I guess.
| acoye wrote:
| I'd love to see John Carmack come back from his AGI hiatus and
| advance AI based rendering. This would be supper cool.
| seydor wrote:
| I wonder how far it is from this to generating language reasoning
| about the game from the game itself, rather than learning a large
| corpus of language, like LLMs do. That would be a true grounded
| language generator
| t1c wrote:
| They got DOOM running on a diffusion engine before GTA 6
| lackoftactics wrote:
| I think Alan's conservative countdown to AGI will need to be
| updated after this. https://lifearchitect.ai/agi/ This is really
| impressive stuff. I thought about it a couple of months ago, that
| probably this is the next modality worth exploring for data, but
| didn't imagine it would come so fast. On the other side, the
| amount of compute required is crazy.
| joseferben wrote:
| impressive, imagine this but photo realistic with vr goggles.
| gwbas1c wrote:
| Am I the only one who thinks this is faked?
|
| It's not that hard to fake something like this: Just make a video
| of DOSBox with DOOM running inside of it, and then compress it
| with settings that will result in compression artifacts.
| GaggiX wrote:
| >Am I the only one who thinks this is faked?
|
| Yes.
| dtagames wrote:
| A diffusion model cannot be a game engine because a game engine
| can be used to create _new_ games and modify the rules of
| existing games in real time -- even rules which are not visible
| on-screen.
|
| These tools are fascinating but, as with all AI hype, they need a
| disclaimer: The tool didn't create the game. It simply generated
| frames and the appearance of play mechanics from a game it
| sampled (which humans created).
| kqr wrote:
| > even rules which are not visible on-screen.
|
| If a rule was changed but it's never visible on the screen, did
| it really change?
|
| > It simply generated frames and the appearance of play
| mechanics from a game it sampled (which humans created).
|
| Simply?! I understand it's mechanically trivial but the fact
| that it's compressed such a rich conditional distribution seems
| far from simple to me.
| darby_nine wrote:
| > Simply?! I understand it's mechanically trivial but the
| fact that it's compressed such a rich conditional
| distribution seems far from simple to me.
|
| It's much simpler than actually creating a game....
| stnmtn wrote:
| If someone told you 10 years ago that they were going to
| create something where you could play a whole new level of
| Doom, without them writing a single line of game
| logic/rendering code, would you say that that is simpler
| than creating a demo by writing the game themselves?
| darby_nine wrote:
| There are two things at play here: the complexity of the
| underlying mechanism, and the complexity of detailed
| creation. This is obviously a complicated mechanism, but
| in another sense it's a trivial result compared to
| actually reproducing the game itself in its original
| intended state.
| znx_0 wrote:
| > If a rule was changed but it's never visible on the screen,
| did it really change?
|
| Well for "some" games it does really change
| sharpshadow wrote:
| So all it did is generate a video of the gameplay which is
| slightly different from the video it used for training?
| TeMPOraL wrote:
| No, it implements a 3D FPS that's interactive, and renders
| each frame based on your input and a lot of memorized
| gameplay.
| sharpshadow wrote:
| But is it playing the actual game or just making a
| interactive video of it?
| Maxatar wrote:
| Making an interactive video of it. It is not playing the
| game, a human does that.
|
| With that said, I wholly disagree that this is not an
| engine. This is absolutely a game engine and while this
| particular demo uses the engine to recreate DOOM, an
| existing game, you could certainly use this engine to
| produce new games in addition to extrapolating existing
| games in novel ways.
| Workaccount2 wrote:
| What is the difference?
| TeMPOraL wrote:
| Yes.
|
| All video games are, by definition, interactive videos.
|
| What I imagine you're asking about is, a typical game
| like Doom is effectively a function:
| f(internal state, player input) -> (new frame, new
| internal state)
|
| where internal state is the shape and looks of loaded
| map, positions and behaviors and stats of enemies,
| player, items, etc.
|
| A typical AI that plays Doom, which is _not_ what 's
| happening here, is (at runtime): f(last
| frame) -> new player input
|
| and is attached in a loop to the previous case in the
| obvious way.
|
| What we have here, however, is a game you can play but
| implemented in a diffusion model, and it works like this:
| f(player input, N last frames) -> new frame
|
| Of note here is the _lack of game state_ - the state is
| implicit in the contents of the N previous frames, and is
| otherwise not represented or mutated explicitly. The
| diffusion model has seen so much Doom that it, in a way,
| _internalized_ most of the state and its evolution, so it
| can look at what 's going on and guess what's about to
| happen. Which is what it does: it renders the next frame
| by predicting it, based on current user input and last N
| frames. And then that frame becomes the input for the
| next prediction, and so on, and so on.
|
| So yes, it's totally an interactive video _and_ a game
| _and_ a third thing - a probabilistic emulation of Doom
| on a generative ML model.
| sharpshadow wrote:
| Thank you for the further explanation, that's what I
| thought in the meantime and intended to find out with my
| question.
|
| That opens up a new branch of possibilities.
| calebh wrote:
| One thing I'd like to see is to take a game rendered with low
| poly assets (or segmented in some way) and use a diffusion
| model to add realistic or stylized art details. This would fix
| the consistency problem while still providing tangible
| benefits.
| momojo wrote:
| The title should be "Diffusion Models can be used to render
| frames given user input"
| throwthrowuknow wrote:
| They only trained it on one game and only embedded the control
| inputs. You could train it on many games and embed a lot more
| information about each of them which could possibly allow you
| to specify a prompt that would describe the game and then play
| it.
| EcommerceFlow wrote:
| Jensen said that this is the future of gaming a few months ago
| fyi.
| weakfish wrote:
| Who is that?
| Fraterkes wrote:
| Thousands of different people have been speculating about this
| kind of thing for years.
| alkonaut wrote:
| The job of the game engine is also to render the world given only
| the worlds properties (textures, geometries, physics rules, ...),
| and not given "training data that had to be supplied from an
| already written engine".
|
| I'm guessing that the "This door requires a blue key" doesn't
| mean that the user can run around, the engine dreams up a blue
| key in some other corner of the map, and the user can then return
| to the door and the engine now opens the door? THAT would be
| impressive. It's interesting to think that all that would be
| required for that task to go from really hard to quite doable,
| would be that the door requiring the blue key is blue, and the UI
| showing some icon indicating the user possesses the blue key.
| Without that, it becomes (old) hidden state.
| dabochen wrote:
| So there is no interactivity, but the generated content is not
| the exact view in the training data, is this the correct
| understanding?
|
| If so, is it more like imagination/hallucination rather than
| rendering?
| og_kalu wrote:
| It's conditioned on previous frames AND player actions so it's
| interactive.
| wantsanagent wrote:
| Anyone have reliable numbers on the file sizes here? Doom.exe
| from my searches was around 715k, and with all assets somewhere
| around 10MB. It looks like the SD 1.4 files are over 2GB, so it's
| likely we're looking at a 200-2000x increase in file size
| depending on if you think of this as an 'engine' or the full
| game.
| YeGoblynQueenne wrote:
| Misleading Titles Are Everywhere These Days.
| jetrink wrote:
| What if instead of a video game, this was trained on video and
| control inputs from people operating equipment like warehouse
| robots? Then an automated system could visualize the result of a
| proposed action or series of actions when operating the equipment
| itself. You would need a different model/algorithm to propose
| control inputs, but this would offer a way for the system to
| validate and refine plans as part of a problem solving feedback
| loop.
| Workaccount2 wrote:
| >Robotic Transformer 2 (RT-2) is a novel vision-language-action
| (VLA) model that learns from both web and robotics data, and
| translates this knowledge into generalised instructions for
| robotic control
|
| https://deepmind.google/discover/blog/rt-2-new-model-transla...
| harha_ wrote:
| This is so sick I don't know what to say. I never expected this,
| aren't the implications of this huge?
| aithrowaway1987 wrote:
| I am struggling to understand a _single_ implication of this!
| How does this generalize to anything other than other than
| playing retro games in the most expensive way possible? The
| very intention of this project is overfitting to data in a non-
| generalizable way! Maybe it 's just pure engineering, that good
| ANNs are getting cheap and fast. But this project still seems
| to have the fundamental weaknesses of all AI projects:
|
| - needs a huge amount of data, which a priori precludes a lot
| of interesting use cases
|
| - flashy-but-misleading demos which hide the actual weaknesses
| of the AI software (note that the player is moving very
| haltingly compared to a real game of DOOM, where you almost
| never stop moving)
|
| - AI nailing something really complicated for humans (98%
| effective raycasting, 98% effective Python codegen) while
| failing to grasp abstract concepts rigorously understood by
| _fish_ (object permanence, quantity)
|
| I am genuinely struggling to see this as a meaningful step
| forward. It seems more like a World's Fair exhibit - a fun and
| impressive diversion, but probably not a vision of the future.
| Putting it another way: unlike AlphaGo, Deep Blue wasn't really
| a technological milestone so much as a _sociological_ milestone
| reflecting the apex of a certain approach to AI. I think this
| DOOM project is in a similar vein.
| KETpXDDzR wrote:
| I think the correct title should be "Diffusion Models Are Fake
| Real-Time Game Engines". I don't think just more training will
| ever be sufficient to create a complete game engine. It would
| need to "understand" what it's doing.
| aghilmort wrote:
| looking forward to &/or wondering about overlap with notion of
| ray tracing LLMs
| TheRealPomax wrote:
| If by "game" you mean "literal hallucination" then yes. But if
| we're not trying to click-bait, then no: it's not really a game
| when there is no permanence or determinism to be found anywhere.
| It might be a "game-flavoured dream simulator", but it's
| absolutely not a game engine.
| rrnechmech wrote:
| > To mitigate auto-regressive drift during inference, we corrupt
| context frames by adding Gaussian noise to encoded frames during
| training. This allows the network to correct information sampled
| in previous frames, and we found it to be critical for preserving
| visual stability over long time periods.
|
| I get this (mostly). But would any kind soul care to elaborate on
| this? What is this "drift" they are trying to avoid and _how_
| does (AFAIU) adding _noise_ help?
| gwern wrote:
| People may recall GameGAN from May 2020:
| https://arxiv.org/abs/2005.12126#nvidia https://nv-
| tlabs.github.io/gameGAN/#nvidia https://github.com/nv-
| tlabs/GameGAN_code
| SeanAnderson wrote:
| After some discussion in this thread, I found it worth pointing
| out that this paper is NOT describing a system which receives
| real-time user input and adjusts its output accordingly, but, to
| me, the way the abstract is worded heavily implied this was
| occurring.
|
| It's trained on a large set of data in which agents played DOOM
| and video samples are given to users for evaluation, but users
| are not feeding inputs into the simulation in real-time in such a
| way as to be "playing DOOM" at ~20FPS.
|
| There are some key phrases within the paper that hint at this
| such as "Key questions remain, such as ... how games would be
| effectively created in the first place, including how to best
| leverage human inputs" and "Our end goal is to have human players
| interact with our simulation.", but mostly it's just the omission
| of a section describing real-time user gameplay.
| bob1029 wrote:
| Were the agents playing at 20 real FPS, or did this occur like
| a Pixar movie offline?
| refibrillator wrote:
| You are incorrect, this is an _interactive_ simulation that is
| playable by humans.
|
| > Figure 1: a human player is playing DOOM on GameNGen at 20
| FPS.
|
| The abstract is ambiguously worded which has caused a lot of
| confusion here, but the paper is unmistakably clear about this
| point.
|
| Kind of disappointing to see this misinformation upvoted so
| highly on a forum full of tech experts.
| FrustratedMonky wrote:
| Yeah. If isn't doing this, then what could it be doing that
| is worth a paper? "real-time user input and adjusts its
| output accordingly"
| rvnx wrote:
| There is a hint in the paper itself:
|
| It says in a shy way that it is based on: "Ha & Schmidhuber
| (2018) who train a Variational Auto-Encoder (Kingma &
| Welling, 2014) to encode game frames into a latent vector"
|
| So it means they most likely took
| https://worldmodels.github.io/ (that is actually open-
| source) or something similar and swapped the frame
| generation by Stable Diffusion that was released in 2022.
| psb217 wrote:
| If the generative model/simulator can run at 20FPS, then
| obviously in principle a human could play the game in
| simulation at 20 FPS. However, they do no evaluation of human
| play in the paper. My guess is that they limited human evals
| to watching short clips of play in the real engine vs the
| simulator (which conditions on some number of initial frames
| from the engine when starting each clip...) since the actual
| "playability" is not great.
| pajeets wrote:
| I knew it was too good be true but seems like real time video
| generation can be good enough to get to a point where it feels
| like a truly interactive video/game
|
| Imagine if text2game was possible. there would be some sort of
| network generating each frame from an image generated by text,
| with some underlying 3d physics simulation to keep all the
| multiplayer screens sync'd
|
| this paper does not seem to be of that possibility rather some
| cleverly words to make you think people were playing a real
| time video. we can't even generate more than 5~10 second of
| video without it hallucinating. something this persistent would
| require an extreme amount of gameplay video training. it can be
| done but the video shown by this paper is not true to its
| words.
| Chance-Device wrote:
| I also thought this, but refer back to the paper, not the
| abstract:
|
| > A is the set of key presses and mouse movements...
|
| > ...to condition on actions, we simply learn an embedding
| A_emb for each action
|
| So, it's clear that in this model the diffusion process is
| conditioned by embedding A that is derived from user actions
| rather than words.
|
| Then a noised start frame is encoded into latents and
| concatenated on to the noise latents as a second conditioning.
|
| So we have a diffusion model which is trained solely on images
| of doom, and which is conditioned on current doom frames and
| user actions to produce subsequent frames.
|
| So yes, the users are playing it.
|
| However, it should be unsurprising that this is possible. This
| is effectively just a neural recording of the game. But it's a
| cool tech demo.
| foota wrote:
| I wonder if they could somehow feed in a trained Gaussian
| splats model to this to get better images?
|
| Since the splats are specifically designed for rendering it
| seems like it would be an efficient way for the image model
| to learn the geometry without having to encode it on the
| image model itself.
| Chance-Device wrote:
| I'm not sure how that would help vs just training the model
| with the conditionings described in the paper.
|
| I'm not very familiar with Gaussian splats models, but
| aren't they just a way of constructing images using
| multiple superimposed parameterized Gaussian distributions,
| sort of like the Fourier series does with waveforms using
| sine and cosine waves?
|
| I'm not seeing how that would apply here but I'd be
| interested in hearing how you would do it.
| psb217 wrote:
| The agent never interacts with the simulator during training
| or evaluation. There is no user, there is only an agent which
| trained to play the real game and which produced the
| sequences of game frames and actions that were used to train
| the simulator and to provide ground truth sequences of game
| experience for evaluation. Their evaluation metrics are all
| based on running short simulations in the diffusion model
| which are initiated with some number of conditioning frames
| taken from the real game engine. Statements in the paper
| like: "GameNGen shows that an architecture and model weights
| exist such that a neural model can effectively run a complex
| game (DOOM) interactively on existing hardware." are wildly
| misleading.
| teamonkey wrote:
| I think _someone_ is playing it, but it has a reduced set of
| inputs and they 're playing it in a very specific way (slowly,
| avoiding looking back to places they've been) so as not to show
| off the flaws in the system.
|
| The people surveyed in this study are not playing the game,
| they are watching extremely short video clips of the game being
| played and comparing them to equally short videos of the
| original Doom being played, to see if they can spot the
| difference.
|
| I may be wrong with how it works, but I think this is just
| hallucinating in real time. It has no internal state per se, it
| knows what was on screen in the previous few frames and it
| knows what inputs the user is pressing, and so it generates the
| next frame. Like with video compression, it probably doesn't
| need to generate a full frame every time, just "differences".
|
| As with all the previous AI game research, these are not games
| in any real sense. They fall apart when played beyond any
| meaningful length of time (seconds). Crucially, they are not
| playable by anyone other than the developers in very controlled
| settings. A defining attribute of any game is that it can be
| played.
| lewhoo wrote:
| The movement of the player seems jittery a bit so I inferred
| something similar on that basis.
| 7734128 wrote:
| What you're describing reminded me of this cool project:
|
| https://www.youtube.com/watch?v=udPY5rQVoW0 "Playing a Neural
| Network's version of GTA V: GAN Theft Auto"
| SeanAnderson wrote:
| Ehhh okay, I'm not as convinced as I was earlier. Sorry for
| misleading. There's been a lot of back-and-forth.
|
| I would've really liked to see a section of the paper
| explicitly call out that they used humans in real time. There's
| a lot of sentences that led me to believe otherwise. It's clear
| that they used a bunch of agents to simulate gameplay where
| those agents submitted user inputs to affect the gameplay and
| they captured those inputs in their model. This made it a bit
| murky as to whether humans ever actually got involved.
|
| This statement, "Our end goal is to have human players interact
| with our simulation. To that end, the policy p as in Section 2
| is that of human gameplay. Since we cannot sample from that
| directly at scale, we start by approximating it via teaching an
| automatic agent to play"
|
| led me to believe that while they had an ultimate goal of user
| input (why wouldn't they) they sufficed by approximating human
| input.
|
| I was looking to refute that assumption later in the paper by
| hopefully reading some words on the human gameplay experience,
| but instead, under Results, I found:
|
| "Human Evaluation. As another measurement of simulation
| quality, we provided 10 human raters with 130 random short
| clips (of lengths 1.6 seconds and 3.2 seconds) of our
| simulation side by side with the real game. The raters were
| tasked with recognizing the real game (see Figure 14 in
| Appendix A.6). The raters only choose the actual game over the
| simulation in 58% or 60% of the time (for the 1.6 seconds and
| 3.2 seconds clips, respectively)."
|
| and it's like.. okay.. if you have a section in results on
| human evaluation, and your goal is to have humans play, then
| why are you talking just about humans reviewing video rather
| than giving some sort of feedback on the human gameplay
| experience - even if it's not especially positive?
|
| Still, in the Discussion section, it mentions, "The second
| important limitation are the remaining differences between the
| agent's behavior and those of human players. For example, our
| agent, even at the end of training, still does not explore all
| of the game's locations and interactions, leading to erroneous
| behavior in those cases." which makes it more clear that humans
| gave input which went outside the bounds of the automatic
| agents. It doesn't seem like this would occur if it were agents
| simulating more input.
|
| Ultimately, I think that the paper itself could've been more
| clear in this regard, but clearly the publishing website tries
| to be very explicit by saying upfront - "Real-time recordings
| of people playing the game DOOM" and it's pretty hard to argue
| against that.
|
| Anyway. I repent! It was a learning experience going back and
| forth on my belief here. Very cool tech overall.
| psb217 wrote:
| It's funny how academic writing works. Authors rarely produce
| many unclear or ambiguous statements where the most likely
| interpretation undersells their work...
| dewarrn1 wrote:
| The paper should definitely be more clear on this point, but
| there's a sentence in section 5.2.3 that makes me think that
| this was playable and played: "When playing with the model
| manually, we observe that some areas are very easy for both,
| some areas are very hard for both, and in some the agent
| performs much better." It may be a failure of imagination, but
| I can't think of another reasonable way of interpreting
| "playing with the model manually".
| ollin wrote:
| We can't assess the quality of gameplay ourselves of course
| (since the model wasn't released), but one author said "It's
| playable, the videos on our project page are actual game play."
| (https://x.com/shlomifruchter/status/1828850796840268009) and
| the video on top of https://gamengen.github.io/ starts out with
| "these are real-time recordings of people playing the game".
| Based on those claims, it seems likely that they did get a
| playable system in front of humans by the end of the project
| (though perhaps not by the time the draft was uploaded to
| arXiv).
| Sohcahtoa82 wrote:
| It's always fun reading the dead comments on a post like this.
| People love to point how how pointless this is.
|
| Some of ya'll need to learn how to make things _for the fun of
| making things_. Is this useful? No, not really. Is it
| interesting? Absolutely.
|
| Not everything has to be made for profit. Not everything has to
| be made to make the world a better place. Sometimes, people
| create things just for the learning experience, the challenge, or
| they're curious to see if something is possible.
|
| Time spent enjoying yourself is never time wasted. Some of ya'll
| are going to be on your death beds wishing you had allowed
| yourself to have more fun.
| Gooblebrai wrote:
| So true. The hustle culture is an spreading disease that has
| replaced the fun maker culture from the 80s/90s.
|
| It's unavoidable though. Cost of living being increasingly
| expensive and romantization of entrepreneurs like they are rock
| stars leads towards this hustle mindset.
| ninetyninenine wrote:
| I don't think this is not useful. This is a stepping stone for
| generating entire novel games.
| Sohcahtoa82 wrote:
| > This is a stepping stone for generating entire novel games.
|
| I don't see how.
|
| This game "engine" is purely mapping [pixels, input] -> new
| pixels. It has no notion of game state (so you can kill an
| enemy, turn your back, then turn around again, and the enemy
| could be alive again), not to mention that it requires the
| game to _already exist_ in order to train it.
|
| I suppose, in theory, you could train the network to include
| game state in the input and output, or potentially even
| handle game state outside the network entirely and just make
| it one of the inputs, but the output would be incredibly
| noisy and nigh unplayable.
|
| And like I said, all of it requires the game to already exist
| in order to train the network.
| airstrike wrote:
| _> (so you can kill an enemy, turn your back, then turn
| around again, and the enemy could be alive again)_
|
| Sounds like a great game.
|
| _> not to mention that it requires the game to already
| exist in order to train it_
|
| Diffusion models create new images that did not previously
| exist all of the time, so I'm not sure how that follows.
| It's not hard to extrapolate from TFA to a model that
| generically creates games based on some input
| ninetyninenine wrote:
| >It has no notion of game state (so you can kill an enemy,
| turn your back, then turn around again)
|
| Well you see a wall you turn around then turn back the wall
| is still there. With enough training data the model will be
| able to pick up the state of the enemy because it has
| ALREADY learned the state of the wall due to much more
| numerous data on the wall. It's probably impractical to do
| this, but this is only a stepping stone like said.
|
| > not to mention that it requires the game to already exist
| in order to train it.
|
| Is this a problem? Do games not exist? Not only due we have
| tons of games, but we also have in theory unlimited amounts
| of training data for each game.
| Sohcahtoa82 wrote:
| > Well you see a wall you turn around then turn back the
| wall is still there. With enough training data the model
| will be able to pick up the state of the enemy because it
| has ALREADY learned the state of the wall due to much
| more numerous data on the wall.
|
| It's really important to understand that _ALL THE MODEL
| KNOWS_ is a mapping of [pixels, input] - > new pixels. It
| has zero knowledge of game state. The wall is still there
| after spinning 360 degrees simply because it knows that
| the image of a view facing away from the wall while
| holding the key to turn right eventually becomes an image
| of a view of the wall.
|
| The only "state" that is known is the last few frames of
| the game screen. Because of this, it's simply not
| possible for the game model to know if an enemy should be
| shown as dead or alive once it has been off-screen for
| longer than those few frames. It also means that if you
| keeping turning away and towards an enemy, it could
| teleport around. Once it's off the screen for those few
| frames, the model will have forgotten about it.
|
| > Is this a problem? Do games not exist?
|
| If you're trying to make a _new_ game, then you need
| _new_ frames to train the model on.
| ninetyninenine wrote:
| >It's really important to understand that ALL THE MODEL
| KNOWS is a mapping of [pixels, input] -> new pixels. It
| has zero knowledge of game state.
|
| This is false. What occurs in inside the model is
| unknown. It arranges pixel input and produces pixel
| output as if it actually understands game state. Like
| LLMs we don't actually fully understand what's going on
| internally. You can't assume that models don't
| "understand" things just because the high level training
| methodology only includes pixel input and output.
|
| >The only "state" that is known is the last few frames of
| the game screen. Because of this, it's simply not
| possible for the game model to know if an enemy should be
| shown as dead or alive once it has been off-screen for
| longer than those few frames. It also means that if you
| keeping turning away and towards an enemy, it could
| teleport around. Once it's off the screen for those few
| frames, the model will have forgotten about it.
|
| This is true. But then one could say it knows game state
| for up to a few frames. That's different from saying the
| model ONLY knows pixel input and pixel output. Very
| different.
|
| There are other tricks for long term memory storage as
| well. Think Radar. Radar will capture the state of the
| enemy beyond just visual frames so the model won't forget
| an enemy was behind them.
|
| Game state can also be encoded into some frame pixels at
| the bottom lines. The Model can pick up on these
| associations.
|
| edit: someone mentioned that the game state lasts past a
| few frames.
|
| >If you're trying to make a new game, then you need new
| frames to train the model on.
|
| Right so for a generative model you would instead of
| training the model on one game you would train it on
| multitudes of games. The model would then based off of a
| seed number output a new type of game.
|
| Alternatively you could have a model generate a model.
|
| All of what I'm saying is of course speculative. As I
| said, this model is a stepping stone for the future. Just
| like the LLM which is only trivially helpful now, the LLM
| can be a stepping stone for replacing programmers all
| together.
| throwthrowuknow wrote:
| Read the paper. It is capable of maintaining state for a
| fairly long time including updating the UI elements.
| msk-lywenn wrote:
| I'd like to now to carbon footprint of that fun.
| ploxiln wrote:
| The skepticism and criticism in this thread is against the hype
| of AI, it's implied by people saying "this is so amazing" that
| they think that in some near future you can create any video
| game experience you can imagine by just replacing all the
| software with some AI models, rendering the whole game.
|
| When in reality this is the least efficient and reliable form
| of Doom yet created, using literally millions of times the
| computation used by the first x86 PCs that were able to render
| and play doom in real-time.
|
| But it's a funny party trick, sure.
| KhoomeiK wrote:
| NVIDIA did something similar with GANs in 2020 [1], except users
| _could_ actually play those games (unlike in this diffusion work
| which just plays back simulated video). Sentdex later adapted
| this to play GTA with a really cool demo [2].
|
| [1] https://research.nvidia.com/labs/toronto-ai/gameGAN/
|
| [2] https://www.youtube.com/watch?v=udPY5rQVoW0
| throwthrowuknow wrote:
| Several thoughts for future work:
|
| 1. Continue training on all of the games that used the Doom
| engine to see if it is capable of creating new graphics, enemies,
| weapons, etc. I think you would need to embed more details for
| this perhaps information about what is present in the current
| level so that you could prompt it to produce a new level from
| some combination.
|
| 2. Could embedding information from the map view or a raytrace of
| the surroundings of the player position help with consistency? I
| suppose the model would need to predict this information as the
| neural simulation progressed.
|
| 3. Can this technique be applied to generating videos with
| consistent subjects and environments by training on a camera view
| of a 3D scene and embedding the camera position and the position
| and animation states of objects and avatars within the scene?
|
| 4. What would the result of training on a variety of game engines
| and games with different mechanics and inputs be? The space of
| possible actions is limited by the available keys on a keyboard
| or buttons on a controller but the labelling of the
| characteristics of each game may prove a challenge if you wanted
| to be able to prompt for specific details.
| danielmarkbruce wrote:
| What is the point of this? It's hard to see how this is useful.
| Maybe it's just an exercise to show what a diffusion model can
| do?
| Kapura wrote:
| What is useful about this? I am a game programmer, and I cannot
| imagine a world where this improves any part of the development
| process. It seems to me to be a way to copy a game without
| literally copying the assets and code; plagiarism with extra
| steps. What am I missing?
| jasonkstevens wrote:
| AI no longer plays Doom-it _is_ Doom.
___________________________________________________________________
(page generated 2024-08-28 23:01 UTC)