[HN Gopher] Diffusion models are real-time game engines
       ___________________________________________________________________
        
       Diffusion models are real-time game engines
        
       Author : jmorgan
       Score  : 999 points
       Date   : 2024-08-28 02:59 UTC (20 hours ago)
        
 (HTM) web link (gamengen.github.io)
 (TXT) w3m dump (gamengen.github.io)
        
       | vessenes wrote:
       | So, this is surprising. Apparently there's more cause, effect,
       | and sequencing in diffusion models than what I expected, which
       | would be roughly 'none'. Google here uses SD 1.4, as the core of
       | the diffusion model, which is a nice reminder that open models
       | are useful to even giant cloud monopolies.
       | 
       | The two main things of note I took away from the summary were: 1)
       | they got infinite training data using agents playing doom (makes
       | sense), and 2) they added Gaussian noise to source frames and
       | rewarded the agent for 'correcting' sequential frames back, and
       | said this was critical to get long range stable 'rendering' out
       | of the model.
       | 
       | That last is intriguing -- they explain the intuition as teaching
       | the model to do error correction / guide it to be stable.
       | 
       | Finally, I wonder if this model would be easy to fine tune for
       | 'photo realistic' / ray traced restyling -- I'd be super curious
       | to see how hard it would be to get a 'nicer' rendering out of
       | this model, treating it as a doom foundation model of sorts.
       | 
       | Anyway, a fun idea that worked! Love those.
        
         | refibrillator wrote:
         | Just want to clarify a couple possible misconceptions:
         | 
         | The diffusion model doesn't maintain any state itself, though
         | its weights may encode some notion of cause/effect. It just
         | renders one frame at a time (after all it's a text to image
         | model, not text to video). Instead of text, the previous states
         | and frames are provided as inputs to the model to predict the
         | next frame.
         | 
         | Noise is added to the previous frames before being passed into
         | the SD model, so the RL agents were not involved with
         | "correcting" it.
         | 
         | De-noising objectives are widespread in ML, intuitively it
         | forces a predictive model to leverage context, ie surrounding
         | frames/words/etc.
         | 
         | In this case it helps prevent auto-regressive drift due to the
         | accumulation of small errors from the randomness inherent in
         | generative diffusion models. Figure 4 shows such drift
         | happening when a player is standing still.
        
           | rvnx wrote:
           | The concept is that if you train a Diffusion model by feeding
           | all the possible frames seen in the game.
           | 
           | The training was over almost 1 billion frames, 20 days of
           | full-time play-time, taking a screenshot of every single inch
           | of the map.
           | 
           | Now you show him N frames as input, and ask it "give me frame
           | N+1", then it gives you the frame n. N+1 back based on how it
           | was originally seen during training.
           | 
           | But it is not frame N+1 from a mysterious intelligence, it's
           | simply frame N+1 given back from past database.
           | 
           | The drift you mentioned is actually a clear (but sad) proof
           | that the model does not work at inventing new frames, and can
           | only spit out an answer from the past dataset.
           | 
           | It's a bit like if you train stable diffusion on Simpsons
           | episodes, and that it outputs the next frame of an existing
           | episode that was in the training set, but few frames later
           | goes wild and buggy.
        
             | jetrink wrote:
             | I don't think you've understood the project completely. The
             | model accepts player input, so frame 601 could be quite
             | different if the player decided to turn left rather than
             | right, or chose that moment to fire at an exploding barrel.
        
               | rvnx wrote:
               | 1 billion frames in memory... With such dataset, you have
               | seen practically all realistic possibilities in the
               | short-term.
               | 
               | If it would be able to invent action and maps and let the
               | user play "infinite doom", then it would be very
               | different (and impressive!).
        
               | OskarS wrote:
               | > 1 billion frames in memory... With such dataset, you
               | have seen practically all realistic possibilities in the
               | short-term.
               | 
               | I mean... no? Not even close? Multiply the number of game
               | states with the number of inputs at any given frame gives
               | you a number vastly bigger than 1 billion, not even
               | comparable. Even with 20 days of play time to train no,
               | it's entirely likely that at no point did someone stop at
               | a certain location and look to the left from that angle.
               | They might have done from similar angles, but the model
               | then has to reconstruct some sense of the geometry of the
               | level to synthesize the frame. They might also not have
               | arrived there from the same direction, which again the
               | model needs some smarts to understand.
               | 
               | I get your point, it's very overtrained on these
               | particular levels of Doom, which means you might as well
               | just play Doom. But this is not a hash table lookup we're
               | talking about, it's pretty impressive work.
        
               | rvnx wrote:
               | This was the basis for the reasoning:
               | 
               | The map 1 has 2'518 walkable map units. There are 65536
               | angles.
               | 
               | 2'518*65'536=165'019'648
               | 
               | If you capture 165M frames, you already cover all the
               | possibilities in terms of camera / player view, but
               | probably the diffusion models don't even need to have all
               | the frames (the same way that LLMs don't).
        
               | znx_0 wrote:
               | I think enemy and effects are probably in there
        
               | bee_rider wrote:
               | Do you have to be exactly on a tile in Doom? I thought
               | the guy walked smoothly around the map.
        
               | commodoreboxer wrote:
               | There's also enemy motion, enemy attacks, shooting, and
               | UI considerations, which make the combinatorials explode.
               | 
               | And Doom movement isn't tile based. The map may be, but
               | you can be in many many places on a tile.
        
               | TeMPOraL wrote:
               | Like many people in case of LLMs, you're just
               | demonstrating unawareness of - or disbelief in - the fact
               | that the model doesn't record training data vetbatim, but
               | smears it out in high-dimensional space, from which it
               | then samples. The model then doesn't recall past inputs
               | (which are effectively under extreme lossy compression),
               | but samples from that high-dimensional space to produce
               | output. The high-dimensional representation by necessity
               | captures semantic understanding of the training data.
               | 
               | Generating "infinite Doom" is exactly what this model is
               | doing, as it does not capture the larger map layout well
               | enough to stay consistent with it.
        
               | Workaccount2 wrote:
               | Whether or not a judge understands this will probably
               | form the basis of any precedent set about the legality of
               | image models and copyright.
        
               | znx_0 wrote:
               | I like "conditioned brute force" better term.
        
             | mensetmanusman wrote:
             | Research is the acquisition of knowledge that may or may
             | not have practical applications.
             | 
             | They succeeded in the research, gained knowledge, and might
             | be able to do something awesome with it.
             | 
             | It's a success even if they don't sell anything.
        
         | nine_k wrote:
         | But it's not a game. It's a memory of a game video, predicting
         | the next frame based on the few previous frames, like "I can
         | imagine what happened next".
         | 
         | I would call it the world's least efficient video compression.
         | 
         | What I would like to see is the actual _predictive_ strength,
         | aka imagination, which I did not notice mentioned in the
         | abstract. The model is trained on a set of classic maps. What
         | would it do, given a few frames of gameplay on an unfamiliar
         | map as input? How well could it imagine what happens next?
        
           | WithinReason wrote:
           | If it's trained on absolute player coordinates then it would
           | likely just morph into the known map at those coordinates.
        
             | nine_k wrote:
             | But it's trained on the actual screen pixel data, AFAICT.
             | It's literally a visual imagination model, not gameplay /
             | geometry imagination model. They had to make special
             | provisions to the pixel data on the HUD which by its nature
             | different than the pictures of a 3D world.
        
           | PoignardAzur wrote:
           | > _But it 's not a game. It's a memory of a game video,
           | predicting the next frame based on the few previous frames,
           | like "I can imagine what happened next"._
           | 
           | It's not super clear from the landing page, but I _think_ it
           | 's an engine? Like, its input is both previous images _and_
           | input for the next frame.
           | 
           | So as a player, if you press "shoot", the diffusion engine
           | need to output an image where the monster in front of you
           | takes damage/dies.
        
             | bergen wrote:
             | How is what you think they say not clear?
             | 
             | We present GameNGen, the first game engine powered entirely
             | by a neural model that enables real-time interaction with a
             | complex environment over long trajectories at high quality.
        
           | taneq wrote:
           | It's more like the Tetris Effect, where the model has seen so
           | much Doom that it confabulates gameplay.
        
           | mensetmanusman wrote:
           | They could down convert the entire model to only utilize the
           | subset of matrix components from stable diffusion. This
           | approach may be able to improve internet bandwidth efficiency
           | assuming consumers in the future have powerful enough
           | computers.
        
           | Sharlin wrote:
           | No, it's predicting the next frame conditioned on past frames
           | _AND player actions!_ This is clear from the article. Mere
           | video generation would be nothing new.
        
           | TeMPOraL wrote:
           | It's a memory of a video looped to controls, so frame 1 is "I
           | wonder how would it look if the player pressed D instead of
           | W", then the frame 2 is based on frame 1, etc. and couple
           | frames in, it's already not remembering, but _imagining_ the
           | gameplay on the fly. It 's not prerecorded, it responds to
           | inputs during generation. That's what makes it a game engine.
        
         | wavemode wrote:
         | > Apparently there's more cause, effect, and sequencing in
         | diffusion models than what I expected
         | 
         | To temper this a bit, you may want to pay close attention to
         | the demo videos. The player rarely backtracks, and for good
         | reason - the few times the character does turn around and look
         | back at something a second time, it has changed significantly
         | (the most noticeable I think is the room with the grey wall and
         | triangle sign).
         | 
         | This falls in line with how we'd expect a diffusion model to
         | behave - it's trained on many billions of frames of gameplay,
         | so it's very good at generating a plausible -next- frame of
         | gameplay based on some previous frames. But it doesn't deeply
         | understand logical gameplay constraints, like remembering level
         | geometry.
        
           | mensetmanusman wrote:
           | That is kind of cool though, I would play like being lost in
           | a dream.
           | 
           | If on the backend you could record the level layouts in
           | memory you could have exploration teams that try to find new
           | areas to explore.
        
             | debo_ wrote:
             | It would be cool for dream sequences in games to feel more
             | like dreams. This is probably an expensive way to do it,
             | but it would be neat!
        
           | Groxx wrote:
           | There's an example right at the beginning too - the ammo drop
           | on the right changes to something green (I think that's a
           | body?)
        
           | codeflo wrote:
           | Even purely going forward, specks on wall textures morph into
           | opponents and so on. All the diffusion-generated videos I've
           | seen so far have this kind of unsettling feature.
        
             | bee_rider wrote:
             | It it like some kind of weird dream doom.
        
           | whiteboardr wrote:
           | But does it need to be frame-based?
           | 
           | What if you combine this with an engine in parallel that
           | provides all geometry including characters and objects with
           | their respective behavior, recording changes made through
           | interactions the other model generates, talking back to it?
           | 
           | A dialogue between two parties with different functionality
           | so to speak.
           | 
           | (Non technical person here - just fantasizing)
        
             | bee_rider wrote:
             | In that case, the title of the article wouldn't be true
             | anymore. It seems like a better plan, though.
        
             | beepbooptheory wrote:
             | What would the model provide if not what we see on the
             | screen?
        
               | whiteboardr wrote:
               | The environment and everything in it.
               | 
               | "Everything" would mean all objects and the elements
               | they're made of, their rules on how they interact and
               | decay.
               | 
               | A modularized ecosystem i guess, comprised of "sub-
               | systems" of sorts.
               | 
               | The other model, that provides all interaction (cause for
               | effect) could either be run artificially or be used
               | interactively by a human - opening up the possibility for
               | being a tree : )
               | 
               | This all would need an interfacing agent that in
               | principle would be an engine simulating the second law of
               | thermodynamics and at the same time recording every state
               | that has changed and diverged off the driving actor's
               | vector in time.
               | 
               | Basically the "effects" model keeping track of everyones
               | history.
               | 
               | In the end a system with an "everything" model (that can
               | grow overtime), a "cause" model messing with it, brought
               | together and documented by the "effect" model.
               | 
               | (Again ... non technical person, just fantasizing) : )
        
               | mplewis wrote:
               | What you're asking for doesn't make sense.
        
               | HappMacDonald wrote:
               | So you're basically just talking about upgrading "enemy
               | AI" to a more complex form of AI :)
        
             | robotresearcher wrote:
             | In that scheme what is the NN providing that a classical
             | renderer would not? DOOM ran great on an Intel 486, which
             | is not a lot of computer.
        
               | whiteboardr wrote:
               | An experience that isn't asset- but rule-based.
        
               | Sohcahtoa82 wrote:
               | > DOOM ran great on an Intel 486
               | 
               | It always blew my mind how well it worked on a 33 Mhz
               | 486. I'm fairly sure it ran at 30 fps in 320x200. That
               | gives it just over 17 clock cycles per pixel, and that
               | doesn't even include time for game logic.
               | 
               | My memory could be wrong, though, but even if it required
               | a 66 Mhz to reach 30 fps, that's still only 34 clocks per
               | pixel on an architecture that required multiple clocks
               | for a simple integer add instruction.
        
           | dewarrn1 wrote:
           | Great observation. And not entirely unlike normal human
           | visual perception which is notoriously vulnerable to missing
           | highly salient information; I'm reminded of the "gorillas in
           | our midst" work by Dan Simons and Christopher Chabris [0].
           | 
           | [0]: https://en.wikipedia.org/wiki/Inattentional_blindness#In
           | visi...
        
             | bamboozled wrote:
             | Are you saying if I turn around, I'll be surprised at what
             | I find ? I don't feel like this is accurate at all.
        
               | matheusd wrote:
               | If a generic human glances at an unfamiliar
               | screen/wall/room, can they accurately, pixel-perfectly
               | reconstruct every single element of it? Can they do it
               | for every single screen they have seen in their entire
               | lives?
        
               | bamboozled wrote:
               | I never said pixel perfect, but I would be surprised if
               | whole objects , like flaming lanterns suddenly appeared.
               | 
               | What this demo demonstrates to me is how incredible
               | willing we are to accept what seems familiar to us as
               | accurate.
               | 
               | I bet if you look closely and objectively you will see
               | even more anomalies. But at first watch, I didn't see
               | most errors because I think accepting something is more
               | efficient for the brain.
        
               | ben_w wrote:
               | You'd likely be surprised by a flaming lantern unless you
               | were in Flaming Lanterns 'R Us, but if you were watching
               | a video of a card trick and the two participants changed
               | clothes while the camera wasn't focused on them, you may
               | well miss that and the other five changes that came with
               | that.
        
               | dewarrn1 wrote:
               | Not exactly, but our representation of what's behind us
               | is a lot more sparse than we would assume. That is, I
               | might not be surprised by what I see when I turn around,
               | but it could have changed pretty radically since I last
               | looked, and I might not notice. In fact, an observer
               | might be quite surprised that I missed the change.
               | 
               | Objectively, Simons and Chabris (and many others) have a
               | lot of data to support these ideas. Subjectively, I can
               | say that these types of tasks (inattentional blindness,
               | change blindness, etc.) are humbling.
        
               | jerf wrote:
               | Well, it's a bit of a spoiler to encounter this video in
               | this context, but this is a very good video:
               | https://www.youtube.com/watch?v=LRFMuGBP15U
               | 
               | Even having a clue why I'm linking this, I virtually
               | guarantee you won't catch everything.
               | 
               | And even if you do catch everything... the _real_ thing
               | to notice is that you had to _look_. Your brain does not
               | flag these things naturally. Dreams are notorious for
               | this sort of thing, but even in the waking world your
               | model of the world is much less rich than you think.
               | Magic tricks like to hide in this space, for instance.
        
               | dewarrn1 wrote:
               | Yup, great example! Simons's lab has done some things
               | along exactly these lines [0], too.
               | 
               | [0]: https://www.youtube.com/watch?v=wBoMjORwA-4
        
               | ajuc wrote:
               | The opposite - if you turn around and there's something
               | that wasn't there the last time - you'll likely not
               | notice if it's not out of place. You'll just assume it
               | was there and you weren't paying attention.
               | 
               | We don't memorize things that the environment remembers
               | for us if they aren't relevant for other reasons.
        
             | lawlessone wrote:
             | I reminds me of dreaming. When you do something and turn
             | back to check it has turned into something completely
             | different.
             | 
             | edit: someone should train it on MyHouse.wad
        
             | robotresearcher wrote:
             | Not noticing to a gorilla that 'shouldn't' be there is not
             | the same thing as object permanence. Even quite young
             | babies are surprised by objects that go missing.
        
               | dewarrn1 wrote:
               | That's absolutely true. It's also well-established by
               | Simons et al. and others that healthy normal adults
               | maintain only a very sparse visual representation of
               | their surroundings, anchored but not perfectly predicted
               | by attention, and this drives the unattended gorilla
               | phenomenon (along with many others). I don't work in this
               | domain, but I would suggest that object permanence
               | probably starts with attending and perceiving an object,
               | whereas the inattentional or change blindness phenomena
               | mostly (but not exclusively) occur when an object is not
               | attended (or only briefly attended) _or_ attention is
               | divided by some competing task.
        
             | throwway_278314 wrote:
             | Work which exaggerates the blindness.
             | 
             | The people were told to focus very deeply on a certain
             | aspect of the scene. Maintaining that focus means
             | explicitly blocking things not related to that focus. Also,
             | there is social pressure at the end to have peformed well
             | at the task; evaluating them on a task which is
             | intentionally completely different than the one explicitly
             | given is going to bias people away from reporting gorillas.
             | 
             | And also, "notice anything unusual" is a pretty vague
             | prompt. No-one in the video thought the gorillas were
             | unusual, so if the PEOPLE IN THE SCENE thought gorillas
             | were normal, why would I think they were strange? Look at
             | any TV show, they are all full of things which are pretty
             | crazy unusual in normal life, yet not unusual in terms of
             | the plot.
             | 
             | Why would you think the gorillas were unusual?
        
               | dewarrn1 wrote:
               | I understand what you mean. I believe that the authors
               | would contend that what you're describing is a typical
               | attentional state for an awake/aware human: focused
               | mostly on one thing, and with surprisingly little
               | awareness of most other things (until/unless they are in
               | turn attended).
               | 
               | Furthermore, even what we attend to isn't always
               | represented with all that much detail. Simons has a whole
               | series of cool demonstration experiments where they show
               | that they can swap out someone you're speaking with (an
               | unfamiliar conversational partner like a store clerk or
               | someone asking for directions), and you may not even
               | notice [0]. It's rather eerie.
               | 
               | [0]: https://www.youtube.com/watch?v=FWSxSQsspiQ&t=5s
        
           | alickz wrote:
           | is that something that can be solved with more
           | memory/attention/context?
           | 
           | or do we believe it's an inherent limitation in the approach?
        
             | noiv wrote:
             | I think the real question is does the player get shot from
             | behind?
        
               | alickz wrote:
               | great question
               | 
               | tangentially related but Grand Theft Auto speedrunners
               | often point the camera behind them while driving so cars
               | don't spawn "behind" them (aka in front of the car)
        
           | nmstoker wrote:
           | I saw a longer video of this that Ethan Mollick posted and in
           | that one, the sequences are longer and they do appear to
           | demonstrate a fair amount of consistency. The clips don't
           | backtrack in the summary video on the paper's home page
           | because they're showing a number of district environments but
           | you only get a few seconds of each.
           | 
           | If I studied the longer one more closely, I'm sure
           | inconsistencies would be seen but it seemed able to recall
           | presence/absence of destroyed items, dead monsters etc on
           | subsequent loops around a central obstruction that completely
           | obscured them for quite a while. This did seem pretty odd to
           | me, as I expected it to match how you'd described it.
        
             | wavemode wrote:
             | Yes it definitely is very good for simulating gameplay
             | footage, don't get me wrong. Its input for predicting the
             | next frame is not just the previous frame, it has access to
             | a whole sequence of prior frames.
             | 
             | But to say the model is simulating actual gameplay (i.e.
             | that a person could actually play Doom in this) is far
             | fetched. It's definitely great that the model was able to
             | remember that the gray wall was still there after we turned
             | around, but it's untenable for actual gameplay that the
             | wall completely changed location and orientation.
        
               | dr_dshiv wrote:
               | It's an empirical question, right? But they didn't do
               | it...
        
               | TeMPOraL wrote:
               | > _it 's untenable for actual gameplay that the wall
               | completely changed location and orientation._
               | 
               | It would in an SCP-themed game. Or dreamscape/Inception
               | themed one.
               | 
               | Hell, "you're trapped in Doom-like dreamscape, escape
               | before you lose your mind" is a very interesting pitch
               | for a game. Basically take this Doom thing and make
               | walking though a specific, unique-looking doorway from
               | the original game to be the victory condition - the
               | player's job would be to coerce the model to generate it,
               | while also not dying in the Doom fever dream game itself.
               | I'd play the hell out of this.
               | 
               | (Implementation-wise, just loop in a simple recognition
               | model to continously evaluate victory condiiton from last
               | few frames, and some OCR to detect when player's hit
               | points indicator on the HUD drops to zero.)
               | 
               | (I'll happily pay $100 this year to the first project
               | that gets this to work. I bet I'm not the only one.
               | Doesn't have to be Doom specifically, just has to be
               | interesting.)
        
               | wavemode wrote:
               | To be honest, I agree! That would be an interesting
               | gameplay concept for sure.
               | 
               | Mainly just wanted to temper expectations I'm seeing
               | throughout this thread that the model is actually
               | simulating Doom. I don't know what will be required to
               | get from here to there, but we're definitely not there
               | yet.
        
               | ValentinA23 wrote:
               | What you're pointing at mirrors the same kind of
               | limitation in using LLMs for role-play/interactive
               | fictions.
        
               | lawlessone wrote:
               | Maybe a hybrid approach would work. Certain things like
               | inventory being stored as variables, lists etc.
               | 
               | Wouldn't be as pure though.
        
               | crooked-v wrote:
               | Give it state by having a rendered-but-offscreen pixel
               | area that's fed back in as byte data for the next frame.
        
               | KajMagnus wrote:
               | Or if training the model on many FPS games? Surviving in
               | one nightmare that morphs into another, into another,
               | into another ...
        
               | kridsdale1 wrote:
               | Check out the actual modern DOOM WAD MyHouse which
               | implements these ideas. It totally breaks our
               | preconceptions of what the DOOM engine is capable of.
               | 
               | https://en.wikipedia.org/wiki/MyHouse.wad
        
               | jsheard wrote:
               | MyHouse is excellent, but it mostly breaks our perception
               | of what the Doom engine is capable of by not _really_
               | using the Doom engine. It leans heavily on engine
               | features which were embellishments by the GZDoom project,
               | and never existed in the original Doom codebase.
        
           | hoosieree wrote:
           | Small objects like powerups appear and disappear as the
           | player moves (even without backtracking), the ammo count is
           | constantly varying, getting shot doesn't deplete health or
           | armor, etc.
        
           | TeMPOraL wrote:
           | So for the next iteration, they should add a minimap overlay
           | (perhaps on a side channel) - it should help the model give
           | more consistent output in any given location. Right now, the
           | game is very much like a lucid dream - the universe makes
           | sense from moment to moment, but without outside reference,
           | everything that falls out of short-term memory (few frames
           | here) gets reimagined.
        
           | Workaccount2 wrote:
           | I don't see this as something that would be hard to overcome.
           | Sora for instance has already shown the ability for a
           | diffusion model to maintain object permanence. Flux recently
           | too has shown the ability to render the same person in many
           | different poses or images.
        
             | idunnoman1222 wrote:
             | Where does a sora video turn around backwards? I don't even
             | maintain such consistency in my dreams.
        
             | idunnoman1222 wrote:
             | Where does a sora video turn around backwards? I can't
             | maintain such consistency in my own dreams.
        
               | Workaccount2 wrote:
               | I don't know of an example (not to say it doesn't exist)
               | but the problem is fundamentally the same as things
               | moving out of sight/out of frame and coming back again.
        
               | Jensson wrote:
               | > the problem is fundamentally the same as things moving
               | out of sight/out of frame and coming back again
               | 
               | Maybe it is, but doing that with the entire scene instead
               | of just a small part of it makes the problem massively
               | harder, as the model needs to grow exponentially to
               | remember more things. It isn't something that we will
               | manage anytime soon, maybe 10-20 years with current
               | architecture and same compute progress.
               | 
               | Then you make that even harder by remembering a whole
               | game level? No, ain't gonna happen in our lifetimes
               | without massive changes to the architecture. They would
               | need to make a different model keep track of level state
               | etc, not just an image to image model.
        
               | Workaccount2 wrote:
               | 10 to 20 years sounds wildly pessimistic
               | 
               | In this sora video the dragon covers half the scene, and
               | its basically identical when it is revealed again ~5
               | seconds later, or about 150 frames later. The is lots of
               | evidence (and some studies) that these models are in fact
               | building internal world models.
               | 
               | https://www.youtube.com/watch?v=LXJ-yLiktDU
               | 
               | Buckle in, the train is moving way faster. I don't think
               | there would be much surprise if this is solved in the
               | next few generations of video generators. The first
               | generation is already doing very well.
        
               | Jensson wrote:
               | Did you watch the video, it is completely different after
               | the dragon goes past? Its still a flag there, but
               | everything else changed. Even the stores in the
               | background changed, the mass of people is completely
               | different with no hint of anyone moving there etc.
               | 
               | You always get this from AI enthusiast, they come and
               | post "proof" that disproves their own point.
        
               | HappMacDonald wrote:
               | I'm not GP, but running over that video I'm actually
               | having a hard time finding any detail present before the
               | dragon obscures them not either exit frame right when the
               | camera pans left slightly near the end or not re-appear
               | with reasonably crisp detail after the dragon gets out of
               | the way.
               | 
               | Most of the mob of people are indistinct, but there is a
               | woman in a lime green coat who is visible, and then
               | obstructed by the dragon twice (beard and ribbon) and
               | reappears fine. Unfortunately when dragon fully moves
               | past she has been lost to frame right.
               | 
               | There is another person in black holding a red satchel
               | which is visible both before and after the dragon has
               | passed.
               | 
               | Nothing about the storefronts appear to change. The
               | complex sign full of Chinese text (which might be
               | gibberish text: it's highly stylized and I don't know
               | Chinese) appears to survive the dragon passing without
               | even any changes to the individual ideograms.
               | 
               | There is also a red box shaped like a Chinese paper
               | lantern with a single gold ideogram on it at the store
               | entrance which spends most of the video obscured by the
               | dragon and is still in the same location after it passes
               | (though video artifacting makes it more challenging to
               | verify that that ideogram is unchanged it certainly does
               | not appear substantially different)
               | 
               | What detail are you seeing that is different before and
               | after the obstruction?
        
           | nielsbot wrote:
           | You can also notice in the first part of the video the ammo
           | numbers fluctuate a bit randomly.
        
         | raghavbali wrote:
         | Nicely summarised. Another important thing that clearly
         | standsout (not to undermine the efforts and work gone into
         | this) is the fact that more and more we are now seeing larger
         | and more complex building blocks emerging (first it was
         | embedding models then encoder decoder layers and now whole
         | models are being duck-taped for even powerful pipelines). AI/DL
         | ecosystem is growing on a nice trajectory.
         | 
         | Though I wonder if 10 years down the line folks wouldn't even
         | care about underlying model details (no more than a current day
         | web-developer needs to know about network packets).
         | 
         | PS: Not great examples, but I hope you get the idea ;)
        
         | pradn wrote:
         | > Google here uses SD 1.4, as the core of the diffusion model,
         | which is a nice reminder that open models are useful to even
         | giant cloud monopolies.
         | 
         | A mistake people make all the time is that massive companies
         | will put all their resources toward every project. This paper
         | was written by four co-authors. They probably got a good amount
         | of resources, but they still had to share in the pool allocated
         | to their research department.
         | 
         | Even Google only has one Gemini (in a few versions).
        
       | zzanz wrote:
       | The quest to run doom on everything continues. Technically
       | speaking, isn't this the greatest possible anti-Doom, the Doom
       | with the highest possible hardware requirement? I just find it
       | funny that on a linear scale of hardware specification, Doom now
       | finds itself on both ends.
        
         | fngjdflmdflg wrote:
         | >Technically speaking, isn't this the greatest possible anti-
         | Doom
         | 
         | When I read this part I thought you were going to say because
         | you're technically _not_ running Doom at all. That is, instead
         | of running Doom without Doom 's original hardware/software
         | environment (by porting it), you're running Doom without Doom
         | itself.
        
           | bugglebeetle wrote:
           | Pierre Menard, Author of Doom.
        
             | el_memorioso wrote:
             | I applaud your erudition.
        
             | 1attice wrote:
             | that took a moment, thank you
        
             | airstrike wrote:
             | OK, this is the single most perfect comment someone could
             | make on this thread. Diffusion me impressed.
        
             | jl6 wrote:
             | Knee Deep in the Death of the Author.
        
           | ynniv wrote:
           | It's _dreaming_ Doom.
        
             | birracerveza wrote:
             | We made machines dream of Doom. Insane.
        
               | daemin wrote:
               | Time to make a sheep mod for Doom.
        
             | qingcharles wrote:
             | _Do Robots Dream of E1M1?_
        
         | x-complexity wrote:
         | > Technically speaking, isn't this the greatest possible anti-
         | Doom, the Doom with the highest possible hardware requirement?
         | 
         | Not really? The greatest anti-Doom would be an infinite nest of
         | these types of models predicting models predicting Doom at the
         | very end of the chain.
         | 
         | The next step of anti-Doom would be a model generating the
         | model, generating the Doom output.
        
           | nurettin wrote:
           | Isn't this technically a model (training step) generating a
           | model (a neural network) generating Doom output?
        
           | yuchi wrote:
           | "...now it can _implement_ Doom!"
        
         | Vecr wrote:
         | It's the No-Doom.
        
           | WithinReason wrote:
           | Undoom?
        
             | riwsky wrote:
             | It's a mood.
        
             | jeffhuys wrote:
             | Bliss
        
         | Terr_ wrote:
         | > the Doom with the highest possible hardware requirement?
         | 
         | Isn't that possible by setting arbitrarily high goals for ray-
         | cast rendering?
        
       | danjl wrote:
       | So, diffusion models are game engines as long as you already
       | built the game? You need the game to train the model. Chicken.
       | Egg?
        
         | billconan wrote:
         | maybe the next step is adding text guidance and generating non-
         | existing games.
        
         | kragen wrote:
         | here are some ideas:
         | 
         | - you could build a non-real-time version of the game engine
         | and use the neural net as a real-time approximation
         | 
         | - you could edit videos shot in real life to have huds or
         | whatever and train the neural net to simulate reality rather
         | than doom. (this paper used 900 million frames which i think is
         | about a year of video if it's 30fps, but maybe algorithmic
         | improvements can cut the training requirements down) and a year
         | of video isn't actually all that much--like, maybe you could
         | recruit 500 people to play paintball while wearing gopro
         | cameras with accelerometers and gyros on their heads and
         | paintball guns, so that you could get a year of video in a
         | weekend?
        
           | w_for_wumbo wrote:
           | That feels like the endgame of video game generation. You
           | select an art style, a video and the type of game you'd like
           | to play. The game is then generated in real-time responding
           | to each action with respect to the existing rule engine.
           | 
           | I imagine a game like that could get so convincing in its
           | details and immersiveness that one could forget they're
           | playing a game.
        
             | THBC wrote:
             | Holodeck is just around the corner
        
               | amelius wrote:
               | Except for haptics.
        
             | omegaworks wrote:
             | EXISTENZ IS PAUSED!
        
             | numpad0 wrote:
             | IIRC, both _2001_ (1968) and _Solaris_ (1972) depict that
             | kind of things as part of alien euthanasia process, not as
             | happy endings
        
               | hypertele-Xii wrote:
               | Also The Matrix, Oblivion, etc.
        
               | catanama wrote:
               | Well, 2001 is actually a happy ending, as Dave is reborn
               | as a cosmic being. Solaris, at least in the book, is an
               | attempt by the sentient ocean to communicate with
               | researchers through mimics.
        
             | aithrowaway1987 wrote:
             | Have you ever played a video game? This is unbelievably
             | depressing. This is a future where games like Slay the
             | Spire, with a unique art style and innovative gameplay
             | simply are not being made.
             | 
             | Not to mention this childish nonsense about "forget they're
             | playing a game," as if every game needs to be lifelike VR
             | and there's no room for stylization or imagination. I am
             | worried for the future that people think they want these
             | things.
        
               | idiotsecant wrote:
               | Its a good thing. When the printing press was invented
               | there were probably monks and scribes who thought that
               | this new mechanical monster that took all the individual
               | flourish out of reading was the end of literature.
               | Instead it became a tool to make literature better and
               | just removed a lot of drudgery. Games with individual
               | style and design made by people will of course still
               | exist. They'll just be easier to make.
        
               | Workaccount2 wrote:
               | The problem is quite the opposite, that AI will be able
               | to generate games so many game with so many play styles
               | that it will totally dilute the value of all games.
               | 
               | Compare it to music gen algo's that can now produce music
               | that is 100% indiscernible from generic crappy music.
               | Which is insane given that 5 years ago it could maybe
               | create the sound of something that maybe someone would
               | describe as "sort of guitar-like". At this rate of
               | progress it's probably not going to be long before AI is
               | making better music than humans. And it's infinitely
               | available too.
        
             | troupo wrote:
             | There are thousands of games that mimic each other, and
             | only a handful of them are any good.
             | 
             | What makes you think a mechanical "predict next frame based
             | on existing games" will be any good?
        
           | injidup wrote:
           | Why games? I will train it on 1 years worth of me attending
           | Microsoft teams meetings. Then I will go surfing.
        
             | akie wrote:
             | Ready to pay for this
        
             | ccozan wrote:
             | most underrated comment here!
        
             | kqr wrote:
             | Even if you spend 40 hours a week in video conferences,
             | you'll have to work for over four years to get one years'
             | worth of footage. Of course, by then the models will be
             | even better and so you might actually have a chance of
             | going surfing.
             | 
             | I guess I should start hoarding video of myself now.
        
               | kragen wrote:
               | the neural net doesn't need a year of video to train to
               | simulate your face; it can do that from a single photo.
               | the year of video is to learn how to play the game, and
               | in most cases lots of people are playing the same game,
               | so you can dump all their video in the same training set
        
           | qznc wrote:
           | The Cloud Gaming platforms could record things for training
           | data.
        
         | modeless wrote:
         | If you train it on multiple games then you could produce new
         | games that have never existed before, in the same way image
         | generation models can produce new images that have never
         | existed before.
        
           | lewhoo wrote:
           | From what I understand that could make the engine much less
           | stable. The key here is repetitiveness.
        
           | jsheard wrote:
           | It's unlikely that such a procedurally generated mashup would
           | be perfectly coherent, stable and most importantly _fun_
           | right out of the gate, so you would need some way to reach
           | into the guts of the generated game and refine it. If
           | properties as simple as  "how much health this enemy type
           | has" are scattered across an enormous inscrutable neural
           | network, and may not even have a single consistent definition
           | in all contexts, that's going to be quite a challenge.
           | Nevermind if the game just catastrophically implodes and you
           | have to "debug" the model.
        
         | slashdave wrote:
         | Well, yeah. Image diffusion models only work because you can
         | provide large amounts of training data. For Doom it is even
         | simpler, since you don't need to deal with compositing.
        
         | attilakun wrote:
         | If only there was a rich 3-dimensional physical environment we
         | could draw training data from.
        
         | passion__desire wrote:
         | Maybe, in future, techniques of Scientific Machine Learning
         | which can encode physics and other known laws into a model
         | would form a base model. And then other models on top could
         | just fine tune aspects to customise a game.
        
       | ravetcofx wrote:
       | There is going to be a flood of these dreamlike "games" in the
       | next few years. This feels likes a bit of a breakthrough in the
       | engineering of these systems.
        
       | wkcheng wrote:
       | It's insane that that this works, and that it works fast enough
       | to render at 20 fps. It seems like they almost made a cross
       | between a diffusion model and an RNN, since they had to encode
       | the previous frames and actions and feed it into the model at
       | each step.
       | 
       | Abstractly, it's like the model is dreaming of a game that it
       | played a lot of, and real time inputs just change the state of
       | the dream. It makes me wonder if humans are just next moment
       | prediction machines, with just a little bit more memory built in.
        
         | Teever wrote:
         | Also recursion and nested virtualization. We can dream about
         | dreaming and imagine different scenarios, some completely
         | fictional or simply possible future scenarios all while doing
         | day to day stuff.
        
         | lokimedes wrote:
         | It makes good sense for humans to have this ability. If we flip
         | the argument, and see the next frame as a hypothesis for what
         | is expected as the outcome of the current frame, then comparing
         | this "hypothesis" with what is sensed makes it easier to
         | process the differences, rather than the totality of the
         | sensory input.
         | 
         | As Richard Dawkins recently put it in a podcast[1], our genes
         | are great prediction machines, as their continued survival
         | rests on it. Being able to generate a visual prediction fits
         | perfectly with the amount of resources we dedicate to sight.
         | 
         | If that is the case, what does aphantasia tell us?
         | 
         | [1] https://podcasts.apple.com/dk/podcast/into-the-impossible-
         | wi...
        
           | quickestpoint wrote:
           | As Richard Dawkins theorized, would be more accurate and less
           | LLM like :)
        
           | jonplackett wrote:
           | What's the aphantasia link? I've got aphantasia. I'm
           | convinced though that the bit of my brain that should be
           | making images is used for letting me 'see' how things are
           | connected together very easily in my head. Also I still love
           | games like Pictionary and can somehow draw things onto paper
           | than I don't really know what they look like in my head. It's
           | often a surprise when pen meets paper.
        
             | lokimedes wrote:
             | I agree, it is my own experience as well. Craig Venter In
             | one of his books also credit this way of representing
             | knowledge as abstractions as his strength in inventing new
             | concepts.
             | 
             | The link may be that we actually see differences between
             | "frames", rather than the frames directly. That in itself
             | would imply that a from of sub-visual representation is
             | being processed by our brain. For aphantasia, it could be
             | that we work directly on this representation instead of
             | recalling imagery through the visual system.
             | 
             | Many people with aphantasia reports being able to visualize
             | in their dreams, meaning that they don't lack the ability
             | to generate visuals. So it may be that the brain has an
             | affinity to rely on the abstract representation when
             | "thinking", while dreaming still uses the "stable diffusion
             | mode".
             | 
             | I'm no where near qualified to speak of this with
             | certainty, but it seems plausible to me.
        
           | dbspin wrote:
           | Worth noting that aphantasia doesn't necessarily extend to
           | dreams. Anecdotally - I have pretty severe aphantasia (I can
           | conjure milisecond glimpses of barely tangible imagery that I
           | can't quite perceive before it's gone - but only since
           | learning that visualisation wasn't a linguistic metaphor). I
           | can't really simulate object rotation. I can't really
           | 'picture' how things will look before they're drawn / built
           | etc. However I often have highly vivid dream imagery. I also
           | have excellent recognition of faces and places (e.g.: can't
           | get lost in a new city). So there clearly is a lot of
           | preconscious visualisation and image matching going on in
           | some aphantasia cases, even where the explicit visual screen
           | is all but absent.
        
             | zimpenfish wrote:
             | Pretty much the same for me. My aphantasia is total (no
             | images at all) but still ludicrously vivid dreams and not
             | too bad at recognising people and places.
        
             | lokimedes wrote:
             | I fabulate about this in another comment below:
             | 
             | > Many people with aphantasia reports being able to
             | visualize in their dreams, meaning that they don't lack the
             | ability to generate visuals. So it may be that the
             | [aphantasia] brain has an affinity to rely on the abstract
             | representation when "thinking", while dreaming still uses
             | the "stable diffusion mode".
             | 
             | (I obviously don't know what I'm talking about, just a
             | fellow aphant)
        
               | dbspin wrote:
               | Obviously we're all introspecting here - but my guess is
               | that there's some kind of cross talk in aphantasic brains
               | between the conscious narrating semantic brain and the
               | visual module. Such that default mode visualisation is
               | impaired. It's specifically the loss of reflexive
               | consciousness that allows visuals to emerge. Not sure if
               | this is related, but I have pretty severe chronic
               | insomnia, and I often wonder if this in part relates to
               | the inability to drift off into imagery.
        
               | drowsspa wrote:
               | Yeah. In my head it's like I'm manipulating SVG paths
               | instead of raw pixels
        
         | slashdave wrote:
         | Image is 2D. Video is 3D. The mathematical extension is
         | obvious. In this case, low resolution 2D (pixels), and the
         | third dimension is just frame rate (discrete steps). So rather
         | simple.
        
           | Sharlin wrote:
           | This is not "just" video, however. It's interactive in real
           | time. Sure, you can say that playing is simply video with
           | some extra parameters thrown in to encode player input, but
           | still.
        
             | slashdave wrote:
             | It is just video. There are no external interactions.
             | 
             | Heck, it is far simpler than video, because the point of
             | view and frame is fixed.
        
               | raincole wrote:
               | ?
               | 
               | I highly suggest you to read the paper briefly before
               | commenting on the topic. The whole point is that it's not
               | just generating a video.
        
               | slashdave wrote:
               | I did. It is generating a video, using latent information
               | on player actions during the process (which it also
               | predicts). It is not interactive.
        
               | SeanAnderson wrote:
               | I think you're mistaken. The abstract says it's
               | interactive, "We present GameNGen, the first game engine
               | powered entirely by a neural model that enables real-time
               | interaction"
               | 
               | Further - "a diffusion model is trained to produce the
               | next frame, conditioned on the sequence of past frames
               | and actions." specifically "and actions"
               | 
               | User input is being fed into this system and subsequent
               | frames take that into account. The user is "actually"
               | firing a gun.
        
               | nopakos wrote:
               | Maybe it's so advanced, it knows the players' next moves,
               | so it is a video!
        
               | slashdave wrote:
               | I guess you are being sarcastic, except this is precisely
               | what it is doing. And it's not hard: player movement is
               | low information and probably not the hardest part of the
               | model.
        
               | smusamashah wrote:
               | It's interactive but can it go beyond what it learned
               | from the videos. As in, can the camera break free and
               | roam around the map from different angles? I don't think
               | it will be able to do that at all. There are still a few
               | hallucinations in this rendering, it doesn't look it
               | understands 3d.
        
               | Sharlin wrote:
               | You might be surprised. Generating views from novel
               | angles based on a single image is not novel, and if
               | anything, this model has more than a single frame as
               | input. I'd wager that it's quite able to extrapolate
               | DOOM-like corridors and rooms even if it hasn't seen the
               | exact place during training. And sure, it's imperfect but
               | on the other hand _it works in real time_ on a single
               | TPU.
        
               | hypertele-Xii wrote:
               | Then why do monsters become blurry smudgy messes when
               | shot? That looks like a video compression artifact of a
               | neural network attempting to replicate low-structure
               | image (source material contains guts exploding, very un-
               | structured visual).
        
               | Sharlin wrote:
               | Uh, maybe because monster death animations make up a
               | small part of the training material (ie. gameplay) so the
               | model has not learned to reproduce them very well?
               | 
               | There cannot be "video compression artifacts" because it
               | hasn't even seen any compressed video during training, as
               | far as I can see.
               | 
               | Seriously, how is this even a discussion? The article is
               | clear that the novel thing is that this is real-time
               | frame generation conditioned on the previous frame(s)
               | _AND player actions._ Just generating video would be
               | nothing new.
        
               | psb217 wrote:
               | In a sense, poorly reproducing rare content is a form of
               | compression artifact. Ie, since this content occurs
               | rarely in the training set, it will have less impact on
               | the gradients and thus less impact on the final form of
               | the model. Roughly speaking, the model is allocating
               | fewer bits to this content, by storing less information
               | about this content in its parameters, compared to content
               | which it sees more often during training. I think this
               | isn't too different from certain aspects of images,
               | videos, music, etc., being distorted in different ways
               | based on how a particular codec allocates its available
               | bits.
        
               | slashdave wrote:
               | No, I am not. The interaction is part of the training,
               | and is used during inference, but it is not including
               | during the process of generation.
        
               | SeanAnderson wrote:
               | Okay, I think you're right. My mistake. I read through
               | the paper more closely and I found the abstract to be a
               | bit misleading compared to the contents. Sorry.
        
               | slashdave wrote:
               | Don't worry. The paper is not very well written.
        
               | psb217 wrote:
               | Academic authors are consistently better at editing away
               | unclear and ambiguous statements which make their work
               | seem less impressive compared to ones which make their
               | work seem more impressive. Maybe it's just a coincidence,
               | lol.
        
           | InDubioProRubio wrote:
           | Video is also higher resolution, as the pixels flip for the
           | high resolution world by moving through it. Swivelling your
           | head without glasses, even the blurry world contains more
           | information in the curve of pixelchange.
        
             | slashdave wrote:
             | Correct, for the sprites. However, the walls in Doom are
             | texture mapped, and so have the same issue as videos.
             | Interesting, though, because I assume the antialiasing is
             | something approximate, given the extreme demands on CPUs of
             | the era.
        
         | stevenhuang wrote:
         | > It makes me wonder if humans are just next moment prediction
         | machines, with just a little bit more memory built in.
         | 
         | Yup, see https://en.wikipedia.org/wiki/Predictive_coding
        
           | quickestpoint wrote:
           | Umm, that's a theory.
        
             | mind-blight wrote:
             | So are gravity and friction. I don't know how well tested
             | or accepted it is, but being just a theory doesn't tell you
             | much about how true it is without more info
        
         | richard___ wrote:
         | Did they take in the entire history as context?
        
         | nsbk wrote:
         | We are. At least that's what Lisa Feldman Barrett [1] thinks.
         | It is worth listening to this Lex Fridman podcast:
         | Counterintuitive Ideas About How the Brain Works [2], where she
         | explains among other ideas how constant prediction is the most
         | efficient way of running a brain as opposed to reaction. I
         | never get tired of listening to her, she's such a great science
         | communicator.
         | 
         | [1] https://en.wikipedia.org/wiki/Lisa_Feldman_Barrett
         | 
         | [2] https://www.youtube.com/watch?v=NbdRIVCBqNI&t=1443s
        
           | PunchTornado wrote:
           | Interesting talk about the brain, but the stuff she says
           | about free will is not a very good argument. Basically it is
           | sort of the argument that the ancient greeks made which
           | brings the discussion into a point where you can take both
           | directions.
        
         | mensetmanusman wrote:
         | Penrose (Nobel prize in physics) stipulates that quantum
         | effects in the brain may allow a certain amount of time travel
         | and back propagation to accomplish this.
        
           | wrsh07 wrote:
           | You don't need back propagation to learn
           | 
           | This is an incredibly complex hypothesis that doesn't really
           | seem justified by the evidence
        
         | wrsh07 wrote:
         | Makes me wonder when an update to the world models paper comes
         | out where they drop in diffusion models:
         | https://worldmodels.github.io/
        
         | dartos wrote:
         | > It makes me wonder if humans are just next moment prediction
         | machines, with just a little bit more memory built in.
         | 
         | This, to me, seems extremely reductionist. Like you start with
         | AI and work backwards until you frame all cognition as next
         | something predictors.
         | 
         | It's just the stochastic parrot argument again.
        
         | bangaladore wrote:
         | > It's insane that that this works, and that it works fast
         | enough to render at 20 fps.
         | 
         | It is running on an entire v5 TPU
         | (https://cloud.google.com/blog/products/ai-machine-
         | learning/i...)
         | 
         | It's unclear how that compares to a high-end consumer GPU like
         | a 3090, but they seem to have similar INT8 TFLOPS. The TPU has
         | less memory (16 vs. 24), and I'm unsure of the other specs.
         | 
         | Something doesn't add up, in my opinion, though. SD usually
         | takes (at minimum) seconds to produce a high-quality result on
         | a 3090, so I can't comprehend how they are like 2 orders of
         | magnitudes faster--indicating that the TPU vastly outperforms a
         | GPU for this task. They seem to be producing low-res (320x240)
         | images, but it still seems too fast.
        
       | masterspy7 wrote:
       | There's been a ton of work to generate assets for games using AI:
       | 3d models, textures, code, etc. None of that may even be
       | necessary with a generative game engine like this! If you could
       | scale this up, train on all games in existence, etc. I bet some
       | interesting things would happen
        
         | rererereferred wrote:
         | But can you grab what this Ai has learned and generate the 3d
         | models, maps and code to turn it into an actual game that can
         | run on a user's PC? That would be amazing.
        
           | passion__desire wrote:
           | Jensen Huang's vision that future games will be generated
           | rather than rendered is coming true.
        
           | kleiba wrote:
           | What would be the point? This model has been trained on an
           | existing game, so turning it back into assets, maps, and code
           | would just give you a copy of the original game you started
           | with. I suppose you could create variations of it then...
           | but:
           | 
           | You don't even need to do all of that - this trained model
           | _already is_ the game, i.e., it 's interactive, you can play
           | the game.
        
         | whamlastxmas wrote:
         | I would absolutely love if they could take this demo, add a new
         | door that isn't in the original, and see what it generates
         | behind that door
        
       | refibrillator wrote:
       | There is no text conditioning provided to the SD model because
       | they removed it, but one can imagine a near future where text
       | prompts are enough to create a fun new game!
       | 
       | Yes they had to use RL to learn what DOOM looks like and how it
       | works, but this doesn't necessarily pose a chicken vs egg
       | problem. In the same way that LLMs can write a novel story,
       | despite only being trained on existing text.
       | 
       | IMO one of the biggest challenges with this approach will be open
       | world games with essentially an infinite number of possible
       | states. The paper mentions that they had trouble getting RL
       | agents to completely explore every nook and corner of DOOM.
       | Factorio or Dwarf Fortress probably won't be simulated anytime
       | soon...I think.
        
         | mlsu wrote:
         | With enough computation, your neural net weights would converge
         | to some very compressed latent representation of the source
         | code of DOOM. Maybe smaller even than the source code itself?
         | Someone in the field could probably correct me on that.
         | 
         | At which point, you effectively would be interpolating in
         | latent space through the source code to actually "render" the
         | game. You'd have an entire latent space computer, with an
         | engine, assets, textures, a software renderer.
         | 
         | With a sufficiently powerful computer, one could imagine what
         | interpolating in this latent space between, say Factorio and
         | TF2 (2 of my favorites). And tweaking this latent space to your
         | liking by conditioning it on any number of gameplay aspects.
         | 
         | This future comes very quickly for subsets of the pipeline,
         | like the very end stage of rendering -- DLSS is already in
         | production, for example. Maybe Nvidia's revenue wraps back to
         | gaming once again, as we all become bolted into a neural
         | metaverse.
         | 
         | God I love that they chose DOOM.
        
           | energy123 wrote:
           | The source code lacks information required to render the
           | game. Textures for example.
        
             | TeMPOraL wrote:
             | Obviously assets would get encoded too, in some form. Not
             | necessarily corresponding to the original bitmaps, if the
             | game does some consistent post-processing, the encoded
             | thing would more likely be (equivalent to) the post-
             | processed state.
        
               | hoseja wrote:
               | Finally, the AI superoptimizing compiler.
        
             | mistercheph wrote:
             | That's just an artifact of the language we use to describe
             | an implementation detail, in the sense GP means it, the
             | data payload bits are not essentially distinct from the
             | executable instruction bits
        
           | electrondood wrote:
           | The Holographic Principle is the idea that our universe is a
           | projection of a higher dimensional space, which sounds an
           | awful lot like the total simulation of an interactive
           | environment, encoded in the parameter space of a neural
           | network.
           | 
           | The first thing I thought when I saw this was: couldn't my
           | immediate experience be exactly the same thing? Including the
           | illusion of a separate main character to whom events are
           | occurring?
        
           | Jensson wrote:
           | > With enough computation, your neural net weights would
           | converge to some very compressed latent representation of the
           | source code of DOOM. Maybe smaller even than the source code
           | itself? Someone in the field could probably correct me on
           | that.
           | 
           | Neural nets are not guaranteed to converge to anything even
           | remotely optimal, so no that isn't how it works. Also even
           | though neural nets can approximate any function they usually
           | can't do it in a time or space efficient manner, resulting in
           | much larger programs than the human written code.
        
             | mlsu wrote:
             | Could is certainly a better word, yes. There is no
             | guarantee that it will happen, only that it could. The
             | existence of LLMs is proof of that; imagine how large and
             | inefficient a handwritten computer program to generate the
             | next token would be. On the flipside, human beings very
             | effectively predicting the next token, and much more, on 5
             | watts is proof that LLM in their current form certainly are
             | not the most efficient method for generating next token.
             | 
             | I don't really know why everyone is piling on me here.
             | Sorry for a bit of fun speculating! This model is on the
             | continuum. There _is_ a latent representation of Doom in
             | weights. _some_ weights, not _these_ weights. Therefore
             | _some_ representation of doom in a neural net _could_
             | become more efficient over time. That 's really the point
             | I'm trying to make.
        
           | godelski wrote:
           | > With enough computation, your neural net weights would
           | converge to some very compressed latent representation of the
           | source code of DOOM.
           | 
           | You and I have very different definitions of compression
           | 
           | https://news.ycombinator.com/item?id=41377398
           | > Someone in the field could probably correct me on that.
           | 
           | ^__^
        
             | _hark wrote:
             | The raw capacity of the network doesn't tell you how
             | complex the weights actually are. The capacity is only an
             | upper bound on the complexity.
             | 
             | It's easy to see this by noting that you can often prune
             | networks quite a bit without any loss in performance. I.e.
             | the effective dimension of the manifold the weights live on
             | can be much, much smaller than the total capacity allows
             | for. In fact, good regularization is exactly that which
             | encourages the model itself to be compressible.
        
               | godelski wrote:
               | I think your confusing capacity with the training
               | dynamics.
               | 
               | Capacity is autological. The amount of information it can
               | express.
               | 
               | Training dynamics are the way the model learns, the
               | optimization process, etc. So this is where things like
               | regularization come into play.
               | 
               | There's also architecture which affects the training
               | dynamics as well as model capacity. Which makes no
               | guarantee that you get the most information dense
               | representation.
               | 
               | Fwiw, the authors did also try distillation.
        
         | basch wrote:
         | Similarly, you could run a very very simple game engine, that
         | outputs little more than a low resolution wireframe, and
         | upscale it. Put all of the effort into game mechanics and none
         | into visual quality.
         | 
         | I would expect something in this realm to be a little better at
         | not being visually inconsistent when you look away and look
         | back. A red monster turning into a blue friendly etc.
        
         | slashdave wrote:
         | > where text prompts are enough to create a fun new game!
         | 
         | Not really. This is a reproduction of the first level of Doom.
         | Nothing original is being created.
        
         | radarsat1 wrote:
         | Most games are conditioned on text, it's just that we call it
         | "source code" :).
         | 
         | (Jk of course I know what you mean, but you can seriously see
         | text prompts as compressed forms of programming that leverage
         | the model's prior knowledge)
        
         | troupo wrote:
         | > one can imagine a near future where text prompts are enough
         | to create a fun new game
         | 
         | Sit down and write down a text prompt for a "fun new game". You
         | can start with something relatively simple like a Mario-like
         | platformer.
         | 
         | By page 300, when you're about halfway through describing what
         | you mean, you might understand why this is wishful thinking
        
           | reverius42 wrote:
           | If it can be trained on (many) existing games, then it might
           | work similarly to how you don't need to describe every
           | possible detail of a generated image in order to get
           | something that looks like what you're asking for (and looks
           | like a plausible image for the underspecified parts).
        
             | troupo wrote:
             | Things that might work plausible in a static image will not
             | look plausible when things are moving, especially in the
             | game.
             | 
             | Also: https://news.ycombinator.com/item?id=41376722
             | 
             | Also: define "fun" and "new" in a "simple text prompt".
             | Current image generators suck at properly reflecting what
             | you want exactly, because they regurgitate existing things
             | and styles.
        
         | SomewhatLikely wrote:
         | Video games are gonna be wild in the near future. You could
         | have one person talking to a model producing something that's
         | on par with a AAA title from today. Imagine the 2d sidescroller
         | boom on Steam but with immersive photorealistic 3d games with
         | hyper-realistic physics (water flow, fire that spreads,
         | tornados) and full deformability and buildability because the
         | model is pretrained with real world videos. Your game is just a
         | "style" that tweaks some priors on look, settings, and story.
        
           | user432678 wrote:
           | Sorry, no offence, but you sound like those EA execs wearing
           | expensive suits and never played a single video game in their
           | entire life. There's a great documentary on how Half Life was
           | made. Gabe Newell was interviewed by someone asking "why you
           | did that and this, it's not realistic", where he answered
           | "because it's more fun this way, you want realism -- just go
           | outside".
        
         | magicalhippo wrote:
         | This got me thinking. Anyone tried using SD or similar to
         | create graphics for the old classic text adventure games?
        
       | throwmeaway222 wrote:
       | You know how when you're dreaming and you walk into a room at
       | your house and you're suddenly naked at school?
       | 
       | I'm convinced this is the code that gives Data (ST TNG) his
       | dreaming capabilities.
        
       | dean2432 wrote:
       | So in the future we can play FPS games given any setting? Pog
        
       | darrinm wrote:
       | So... is it interactive? Playable? Or just generating a video of
       | gameplay?
        
         | vunderba wrote:
         | From the article: _We present GameNGen, the first game engine
         | powered entirely by a neural model that enables real-time
         | interaction with a complex environment over long trajectories
         | at high quality_.
         | 
         | The demo is actual gameplay at ~20 FPS.
        
           | darrinm wrote:
           | It confused me that their stated evaluations by humans are
           | comparing video clips rather than evaluating game play.
        
             | furyofantares wrote:
             | Short clips are the only way a human will make any errors
             | determining which is which.
        
               | darrinm wrote:
               | More relevant is if by _playing_ it they couldn't tell
               | which is which.
        
               | Jensson wrote:
               | They obviously can within seconds, so it wouldn't be a
               | result. Being able to generate gameplay that looks right
               | even if it doesn't play right is one step.
        
       | kcaj wrote:
       | Take a bunch of videos of the real world and calculate the
       | differential camera motion with optical flow or feature tracking.
       | Call this the video's control input. Now we can play SORA.
        
       | piperswe wrote:
       | This is honestly the most impressive ML project I've seen
       | since... probably O.G. DALL-E? Feels like a gem in a sea of AI
       | shit.
        
       | bufferoverflow wrote:
       | That's probably how our reality is rendered.
        
       | arduinomancer wrote:
       | How does the model "remember" the whole state of the world?
       | 
       | Like if I kill an enemy in some room and walk all the way across
       | the map and come back, would the body still be there?
        
         | a_e_k wrote:
         | Watch closely in the videos and you'll see that enemies often
         | respawn when offscreen and sometimes when onscreen. Destroyed
         | barrels come back, ammo count and health fluctuates weirdly,
         | etc. It's still impressive, but its not perfect in that regard.
        
           | Sharlin wrote:
           | Not unlike in (human) dreams.
        
         | raincole wrote:
         | It doesn't. You need to put the world state in the input (the
         | "prompt", even it doesn't look like prompt in this case).
         | Whatever not in the prompt is lost.
        
         | Jensson wrote:
         | It doesn't even remember the state of the game you look at.
         | Doors spawning right in front of you, particle effects turning
         | into enemies mid flight etc, so just regular gen AI issues.
         | 
         | Edit: Can see this in the first 10 seconds of the first video
         | under "Full Gameplay Videos", stairs turning to corridor
         | turning to closed door for no reason without looking away.
        
           | csmattryder wrote:
           | There's also the case in the video (0:59) where the player
           | jumps into the poison but doesn't take damage for a few
           | seconds then takes two doses back-to-back - they should've
           | taken a hit of damage every ~500-1000ms(?)
           | 
           | Guessing the model hasn't been taught enough about that,
           | because most people don't jump into hazards.
        
       | broast wrote:
       | Maybe one day this will be how operating systems work.
        
         | misterflibble wrote:
         | Don't give them ideas lol terrifying stuff if that happens!
        
       | dysoco wrote:
       | Ah finally we are starting to see something gaming related. I'm
       | curious as to why we haven't seen more of neural networks applied
       | to games even in a completely experimental fashion; we used to
       | have a lot of little experimental indie games such as Facade
       | (2005) and I'm surprised we don't have something similar years
       | after the advent of LLMs.
       | 
       | We could have mods for old games that generate voices for the
       | characters for example. Maybe it's unfeasible from a computing
       | perspective? There are people running local LLMs, no?
        
         | raincole wrote:
         | > We could have mods for old games that generate voices for the
         | characters for example
         | 
         | You mean in real time? Or just in general?
         | 
         | There are _a lot_ of mods that use AI-generated voices. I 'll
         | say it's the norm of modding community now.
        
       | sitkack wrote:
       | What most programmers don't understand, that in the very near
       | future, the entire application will be delivered by an AI model,
       | no source, no text, just connect to the app over RDP. The whole
       | app will be created by example, the app developer will train the
       | app like a dog trainer trains a dog.
        
         | Grimblewald wrote:
         | that might work for some applications, especially recreational
         | things, I think we're a while away from it doing away with all
         | things, especially where deterministic behavior, efficiency, or
         | reliability are important.
        
           | sitkack wrote:
           | Problems for two papers down the line.
        
         | ukuina wrote:
         | So... https://websim.ai except over pixels instead of in your
         | browser?
        
           | sitkack wrote:
           | Yes, and that is super neat.
        
         | Jonovono wrote:
         | I think it's possible AI models will generate dynamic UI for
         | each client and stream the UI to clients (maybe eventually
         | client devices will generate their UI on the fly) similar to
         | Google Stadia. Maybe some offset of video that allows the
         | remote to control it. Maybe Wasm based - just stream wasm
         | bytecode around? The guy behind VLC is building a library for
         | ulta low latency: https://www.kyber.video/techology.
         | 
         | I was playing around with the idea in this:
         | https://github.com/StreamUI/StreamUI. Thinking is take the
         | ideas of Elixir LiveView to the extreme.
        
           | sitkack wrote:
           | I am so glad you posted, this is super cool!
           | 
           | I too have been thinking about how to push dynamic wasm to
           | the client for super low latency UIs.
           | 
           | LiveView is just the beginning. Your readme is dreamy. I'll
           | dive into your project at the end of Sept when I get back
           | into deep tech.
        
       | mo_42 wrote:
       | An implementation of the game engine in the model itself is
       | theoretically the most accurate solution for predicting the next
       | frame.
       | 
       | I'm wondering when people will apply this to other areas like the
       | real world. Would it learn the game engine of the universe (ie
       | physics)?
        
         | radarsat1 wrote:
         | There has definitely been research for simulating physics based
         | on observation, especially in fluid dynamics but also for rigid
         | body motion and collision. It's important for robotics
         | applications actually. You can bet people will be applying this
         | technique in those contexts.
         | 
         | I think for real world application one challenge is going to be
         | the "action" signal which is a necessary component of the
         | conditioning signal that makes the simulation reactive. In
         | video games you can just record the buttons, but for real world
         | scenarios you need difficult and intrusive sensor setups for
         | recording force signals.
         | 
         | (Again for robotics though maybe it's enough to record the
         | motor commands, just that you can't easily record the "motor
         | commands" for humans, for example)
        
         | cubefox wrote:
         | A popular theory in neuroscience is that this is what the brain
         | does:
         | 
         | https://slatestarcodex.com/2017/09/05/book-review-surfing-un...
         | 
         | It's called predictive coding. By trying to predict sensory
         | stimuli, the brain creates a simplified model of the world,
         | including common sense physics. Yann LeCun says that this is a
         | major key to AGI. Another one is effective planning.
         | 
         | But while current predictive models (autoregressive LLMs) work
         | well on text, they don't work well on video data, because of
         | the large outcome space. In an LLM, text prediction boils down
         | to a probability distribution over a few thousand possible next
         | tokens, while there are several orders of magnitude more
         | possible "next frames" in a video. Diffusion models work better
         | on video data, but they are not inherently predictive like
         | causal LLMs. Apparently this new Doom model made some progress
         | on that front though.
        
           | ccozan wrote:
           | Howver, this is due how we actually digitize video. From a
           | human point a view, looking in my room reduces the load to
           | the _objects_ in the room and everyhing else is just noise (
           | like the color of the wall could be just a single item to
           | remember, while otherwise in the digital world, it needs to
           | remember all the pixels )
        
       | helloplanets wrote:
       | So, any given sequence of inputs is rebuilt into a corresponding
       | image, twenty times per second. I wonder how separate the game
       | logic and the generated graphics are in the fully trained model.
       | 
       | Given a sufficient enough separation between these two, couldn't
       | you basically boil the game/input logic down to an abstract game
       | template? Meaning, you could just output a hash that corresponds
       | to a specific combination of inputs, and then treat the resulting
       | mapping as a representation of a specific game's inner workings.
       | 
       | To make it less abstract, you could save some small enough
       | snapshot of the game engine's state for all given input
       | sequences. This could make it much less dependent to what's
       | recorded off of the agents' screens. And you could map the
       | objects that appear in the saved states to graphics, in a
       | separate step.
       | 
       | I imagine this whole system would work especially well for games
       | that only update when player input is given: Games like Myst,
       | Sokoban, etc.
        
         | toppy wrote:
         | I think you've just encoded the title of the paper
        
       | richard___ wrote:
       | Uhhh... demos would be more convincing with enemies and
       | decreasing health
        
         | Kiro wrote:
         | I see enemies and decreasing health on hit. But even if it
         | lacked those, it seems like a pretty irrelevant nitpick that is
         | completely underplaying what we're seeing here. The fact that
         | this is even possible at all feels like science fiction.
        
       | troupo wrote:
       | Key: "predicts next frame, recreates classic Doom". A game that
       | was analyzed and documented to death. And the training included
       | uncountable runs of Doom.
       | 
       | A game engine lets you create a _new_ game, not predict the next
       | frame of an existing and copiously documented one.
       | 
       | This is not a game engine.
       | 
       | Creating a new _good_ game? Good luck with that.
        
       | nolist_policy wrote:
       | Makes me wonder... If you stand still in front of a door so all
       | past observations only contain that door, will the model teleport
       | you to another level when opening the door?
        
         | zbendefy wrote:
         | I think some state is also being given (or if its not, it could
         | be given) to the network, like 3d world position/orientation of
         | the player, that could help the neural network anchor the
         | player in the world.
        
       | lukol wrote:
       | I believe future game engines will be state machines with
       | deterministic algorithms that can be reproduced at any time.
       | However, rendering said state into visual / auditory / etc.
       | experiences will be taken over by AI models.
       | 
       | This will also allow players to easily customize what they
       | experience without changing the core game loop.
        
       | jamilton wrote:
       | I wonder if the MineRL
       | (https://www.ijcai.org/proceedings/2019/0339.pdf and minerl.io)
       | dataset would be sufficient to reproduce this work with
       | Minecraft.
       | 
       | Any other similar existing datasets?
       | 
       | A really goofy way I can think of to get a bunch of data would be
       | to get videos from youtube and try to detect keyboard sounds to
       | determine what keys they're pressing.
        
         | jamilton wrote:
         | Although ideally a follow up work would be something where
         | there won't be any potential legal trouble with releasing the
         | complete model so people can play it.
         | 
         | A similar approach but with a game where the exact input is
         | obvious and unambiguous from the graphics alone so that you can
         | use unannotated data might work. You'd just have to create a
         | model to create the action annotations. I'm not sure what the
         | point would be, but it sounds like it'd be interesting.
        
       | qnleigh wrote:
       | Could a similar scheme be used to drastically improve the visual
       | quality of a video game? You would train the model on gameplay
       | rendered at low and high quality (say with and without ray
       | tracing, and with low and high density meshing), and try to get
       | it to convert a quick render into something photorealistic on the
       | fly.
       | 
       | When things like DALL-E first came out, I was expecting something
       | like the above to make it into mainstream games within a few
       | years. But that was either too optimistic or I'm not up to speed
       | on this sort of thing.
        
         | agys wrote:
         | Isn't that what Nvidia's Ray Reconstruction and DLSS (frame
         | generation and upscaler) are doing, more or less?
        
           | qnleigh wrote:
           | At a high level I guess so. I don't know enough about Ray
           | Reconstruction (though the results are impressive), but I was
           | thinking of something more drastic than DLSS. Diffusion
           | models on static images can turn a cartoon into a
           | photorealistic image. Doing something similar for a game,
           | where a low-quality render is turned into something that
           | would otherwise take seconds to render, seems qualitatively
           | quite different from DLSS. In principle a model could fill in
           | huge amounts of detail, like increasing the number of
           | particles in a particle-based effect, adding shading/lighting
           | effects...
        
       | lIl-IIIl wrote:
       | How does it know how many times it needs to shoot the zombie
       | before it dies?
       | 
       | Most enemies have enough hit points to survive the first shot. If
       | the model is only trained on the previous frame, it doesn't know
       | how many times the enemy was already shot at.
       | 
       | From the video it seems like it is probability based - they may
       | die right away or it might take way longer than it should.
       | 
       | I love how the player's health goes down when he stands in the
       | radioactive green water.
       | 
       | In Doom the enemies fight with each other if they accidentally
       | incur "friendly fire". It would be interesting to see it play out
       | in this version.
        
         | golol wrote:
         | It gets a number of previous frame _s_ as input I think.
        
         | meheleventyone wrote:
         | > I love how the player's health goes down when he stands in
         | the radioactive green water.
         | 
         | This is one of the bits that was weird to me, it doesn't work
         | correctly. In the real game you take damage at a consistent
         | rate, in the video the player doesn't and whether the player
         | takes damage or not seems highly dependent on some factor that
         | isn't whether or not the player is in the radioactive slime. My
         | thought is that its learnt something else that correlates
         | poorly.
        
         | lupusreal wrote:
         | > _In Doom the enemies fight with each other if they
         | accidentally incur "friendly fire". It would be interesting to
         | see it play out in this version._
         | 
         | They trained this thing on bot gameplay, so I bet it does
         | poorly when advanced strategies like deliberately inducing mob
         | infighting are employed (the bots probably didn't do that a
         | lot, of at all.)
        
       | golol wrote:
       | What I understand is the folloeing: If this works so well, why
       | didn't we have good video generation much earlier? After
       | diffusion models were seen to work the most obvious thing to do
       | was to generate the next frame based on previous framrs but... it
       | took 1-2 years for good video models to appear. For example
       | compare Sora generating minecraft video versus this method
       | generating minecraft video. Say in both cases the player is
       | standing on a meadow with fee inputs and watching some pigs. In
       | the Sora video you'd expect the typical glitched to appear, like
       | erratic, sliding movement, overlapping legs, multiplication of
       | pigs etc. Would these glitches not appear in the GameNGen video?
       | Why?
        
         | Closi wrote:
         | Because video is much more difficult than images (it's lots of
         | images that have to be consistent across time, with motion
         | following laws of physics etc), and this is much more limited
         | in terms of scope than pure arbitrary video generation.
        
           | golol wrote:
           | This misses the point, I'm comparing two methods of
           | generating minecraft videos.
        
             | soulofmischief wrote:
             | By simplifying the problem, we are better able to focus on
             | researching specific aspects of generation. In this case,
             | they synthetically created a large, highly domain-specific
             | training set and then used this to train a diffusion model
             | which encodes input parameters instead of text.
             | 
             | Sora was trained on a much more diverse dataset, and so has
             | to learn more general solutions in order to maintain
             | consistency, which is harder. The low resolution and
             | simple, highly repetitive textures of doom definitely help
             | as well.
             | 
             | In general, this is just an easier problem to approach
             | because of the more focused constraints. It's also worth
             | mentioning that noise was added during the process in order
             | to make the model robust to small perturbations.
        
         | pantalaimon wrote:
         | I would have thought it is much easier to generate huge amounts
         | of game footage for training, but as I understand this is not
         | what was done here.
        
       | golol wrote:
       | Certain categories of youtube videos can also be viewed as some
       | sort of game where the actions are the audio/transcript advanced
       | a couple of seconds. Add two eggs. Fetch the ball. I'm walking in
       | the park.
        
       | thegabriele wrote:
       | Wow, I bet Boston Dynamics and such are quite interested
        
       | jumploops wrote:
       | This seems similar to how we use LLMs to generate code: generate,
       | run, fix, generate.
       | 
       | Instead of working through a game, it's building generic UI
       | components and using common abstractions.
        
       | HellDunkel wrote:
       | Although impressive i must disagree. Diffusion models are not
       | game engines. A game engine is a component to propell your game
       | (along the time axis?). In that sense it is similar to the engine
       | of the car, hence the name. It does not need a single working car
       | nor a road to drive on do its job. The above is a dynamic,
       | interactive replication of what happens when you put a car on a
       | given road, requiring a million test drives with working
       | vehicles. An engine would also work offroad.
        
         | MasterScrat wrote:
         | Interesting point.
         | 
         | In a way this is a "simulated game engine", trained from actual
         | game engine data. But I would argue a working simulated game
         | engine becomes a game engine of its own, as it is then able to
         | "propell the game" as you say. The way it achieves this becomes
         | irrelevant, in one case the content was crafted by humans, in
         | the other case it mimics existing game content, the player
         | really doesn't care!
         | 
         | > An engine would also work offroad.
         | 
         | Here you could imagine that such a "generative game engine"
         | could _also_ go offroad, extrapolating what would happen if you
         | go to unseen places. I 'd even say extrapolation capabilities
         | of such a model could be better than a traditional game engine,
         | as it can make things up as it goes, while if you accidentally
         | cross a wall in a typical game engine the screen goes blank.
        
           | jsheard wrote:
           | > Here you could imagine that such a "generative game engine"
           | could also go offroad, extrapolating what would happen if you
           | go to unseen places.
           | 
           | They easily could have demonstrated this by seeding the model
           | with images of Doom maps which weren't in the training set,
           | but they chose not to. I'm sure they tried it and the results
           | just weren't good, probably morphing the map into one of the
           | ones it was trained on at the first opportunity.
        
           | HellDunkel wrote:
           | The game doom is more than a game engine, isnt it? I'd be
           | okay with calling the above a ,,simulated game" or a ,,game".
           | My point is: let's not conflate the idea of a ,,game engine"
           | which is a construct of intellectual concepts put together to
           | create a simulation of ,,things happening in time" and
           | deriving output (audio and visual). the engine is fed with
           | input and data (levels and other assets) and then
           | drives(EDIT) a ,,game".
           | 
           | training the model with a final game will never give you an
           | engine. maybe a ,,simulated game" or even a ,,game" but
           | certainly not an ,,engine". the latter would mean the model
           | would be capable to derive and extract the technical and
           | intellectual concepts and apply them elsewhere.
        
       | icoder wrote:
       | This is impressive. But at the same time, it can't count. We see
       | this every time, and I understand why it happens, but it is still
       | intriguing. We are so close or in some ways even way beyond, and
       | yet at the same time so extremely far away, from 'our'
       | intelligence.
       | 
       | (I say it can't count because there are numerous examples where
       | the bullet count glitches, it goes right impressively often, but
       | still, counting, being up or down, is something computers have
       | been able to do flawlessly basically since forever)
       | 
       | (It is the same with chess, where the LLM models are becoming
       | really good, yet sometimes make mistakes that even my 8yo niece
       | would not make)
        
         | marci wrote:
         | 'our' intelligence may not be the best thing we can make. It
         | would be like trying to only make planes that flaps wings or
         | trucks with legs. A bit like using a llm to do multiplication.
         | Not the best tool. Biomimcry is great for inspiration, but
         | shouldn't be a 1-to-1 copy, especialy in different scale and
         | medium.
        
           | icoder wrote:
           | Sure, although I still think a system with less of a contrast
           | between how well it performs 'modally' and how bad it
           | performs incidentally, would be more practical.
           | 
           | What I wonder is whether LLM's will inherently always have
           | this dichotomy and we need something 'extra' (reasoning,
           | attention or something les biomimicried), or whether this
           | will eventually resolves itself (to an acceptable extend)
           | when they improve even further.
        
       | panki27 wrote:
       | > Human raters are only slightly better than random chance at
       | distinguishing short clips of the game from clips of the
       | simulation.
       | 
       | I can hardly believe this claim, anyone who has played some
       | amount of DOOM before should notice the viewport and textures not
       | "feeling right", or the usually static objects moving slightly.
        
         | meheleventyone wrote:
         | It's telling IMO that they only want people opinions based on
         | our notoriously faulty memories rather than sitting comparable
         | situations next to one another in the game and simulation then
         | analyzing them. Several things jump out watching the example
         | video.
        
           | GaggiX wrote:
           | >rather than sitting comparable situations next to one
           | another in the game and simulation then analyzing them.
           | 
           | That's literally how the human rating was setup if you read
           | the paper.
        
             | meheleventyone wrote:
             | I think you misunderstand me. I don't mean a snap
             | evaluation and deciding between two very-short competing
             | videos which is what the participants were doing. I mean
             | doing an actual analysis of how well the simulation matches
             | the ground truth of the game.
             | 
             | What I'd posit is that it's not actually a very good
             | replication of the game but very good a replicating short
             | clips that almost look like the game and the short time
             | horizons are deliberately chosen because the authors know
             | the model lacks coherence beyond that.
        
               | GaggiX wrote:
               | >I mean doing an actual analysis of how well the
               | simulation matches the ground truth of the game.
               | 
               | Do you mean the PSNR and LPIPS metrics used in paper?
        
               | meheleventyone wrote:
               | No, I think I've been pretty clear that I'm interested in
               | how mechanically sound the simulation is. Also those
               | measures are over an even shorter duration so even less
               | relevant to how coherent it is at real game scales.
        
               | GaggiX wrote:
               | How should this be concretely evaluated and measured? A
               | vibe check?
        
               | meheleventyone wrote:
               | I think the studies evaluation using very short video and
               | humans is much more of a vibe check than what I've
               | suggested.
               | 
               | Off the top of my head DOOM is open source so it should
               | be reasonable to setup repeatable scenarios and use some
               | frames from the game to create a starting scenario for
               | the simulation that is the same. Then the input from the
               | player of the game could be used to drive the simulated
               | version. You could go further and instrument events
               | occurring in the game for direct comparison to the
               | simulation. I'd be interested in setting a baseline for
               | playtime of the level in question and using sessions of
               | around that length as an ultimate test.
               | 
               | There are some on obvious mechanical deficiencies seen in
               | the videos they've published. One that really stood out
               | to me was the damage taken when in the radioactive slime.
               | So I don't think the analysis would need to particularly
               | deep to find differences.
        
         | arc-in-space wrote:
         | This, watching the generated clips feels uncomfortable, like a
         | nightmare. Geometry is "swimming" with camera movement, objects
         | randomly appear and disappear, damage is inconsistent.
         | 
         | The entire thing would probably crash and burn if you did
         | something just slightly unusual compared to the training data,
         | too. People talking about 'generated' games often seem to
         | fantasize about an AI that will make up new outcomes for
         | players that go off the beaten path, but a large part of the
         | fun of real games is figuring out what you can do within the
         | predetermined constraints set by the game's code. (Pen-and-
         | paper RPGs are highly open-ended, but even a Game Master needs
         | to sometimes protects the players from themselves; whereas the
         | current generation of AI is famously incapable of saying no.)
        
         | aithrowaway1987 wrote:
         | I also noticed that they played AI DOOM very slowly: in an
         | actual game you are running around like a madman, but in the
         | video clips the player is moving in a very careful, halting
         | manner. In particular the player only moves in straight lines
         | or turns while stationary, they almost never turn while
         | running. Also didn't see much _strafing._
         | 
         | I suspect there is a reason for this: running while turning
         | doesn't work properly and makes it very obvious that the system
         | doesn't have a consistent internal 3D view of the world. I'm
         | already getting motion sickness from the inconsistencies in
         | straight-line movement, I can't imagine turning is any better.
        
         | freestyle24147 wrote:
         | It made me laugh. Maybe they pulled random people from the
         | hallway who had never seen the original Doom (or any FPS), or
         | maybe only selected people who wore glasses and forgot them at
         | their desk.
        
       | holoduke wrote:
       | I saw a video a while ago where they recreated actual doom
       | footage with a diffusion technique so it looked like a jungle or
       | anything you liked. Cant find it anymore, but looked impressive.
        
       | godelski wrote:
       | Doom system requirements:                 - 4 MB RAM       - 12
       | MB disk space
       | 
       | Stable diffusion v1                 > 860M UNet and CLIP ViT-L/14
       | (540M)       Checkpoint size:         4.27 Gb          7.7 GB
       | (full EMA)       Running on a TPU-v5e         Peak compute per
       | chip (bf16)  197 TFLOPs         Peak compute per chip (Int8)  393
       | TFLOPs         HBM2 capacity and bandwidth  16 GB, 819 GBps
       | Interchip Interconnect BW  1600 Gbps
       | 
       | This is quite impressive, especially considering the speed. But
       | there's still a ton of room for improvement. It seems it didn't
       | even memorize the game despite having the capacity to do so
       | hundreds of times over. So we definitely have lots of room for
       | optimization methods. Though who knows how such things would
       | affect existing tech since the goal here is to memorize.
       | 
       | What's also interesting about this work is it's basically saying
       | you can rip a game if you're willing to "play" (automate) it
       | enough times and spend a lot more on storage and compute. I'm
       | curious what the comparison in cost and time would be if you
       | hired an engineer to reverse engineer Doom (how much prior
       | knowledge do they get considering pertained models and visdoom
       | environment. Was doom source code in T5? And which vit checkpoint
       | was used? I can't keep track of Google vit checkpoints).
       | 
       | I would love to see the checkpoint of this model. I think people
       | would find some really interesting stuff taking it apart.
       | 
       | - https://www.reddit.com/r/gaming/comments/a4yi5t/original_doo...
       | 
       | - https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...
       | 
       | - https://cloud.google.com/tpu/docs/v5e
       | 
       | - https://github.com/Farama-Foundation/ViZDoom
       | 
       | - https://zdoom.org/index
        
         | snickmy wrote:
         | Those are valid points, but irrelevant for the context of this
         | research.
         | 
         | Yes, the computational cost is ridicolous compared to the
         | original game, and yes, it lacks basic things like pre-
         | computing, storing, etc. That said, you could assume that all
         | that can be either done at the margin of this discovery OR over
         | time will naturally improve OR will become less important as a
         | blocker.
         | 
         | The fact that you can model a sequence of frames with such
         | contextual awareness without explictly having to encode it, is
         | the real breakthrough here. Both from a pure gaming standpoint,
         | but on simulation in general.
        
           | godelski wrote:
           | I'm not sure what you're saying is irrelevant.
           | 
           | 1) the model has enough memory to store not only all game
           | assets and engine but even hundreds of "plays".
           | 
           | 2) me mentioning that there's still a lot of room to make
           | these things better (seems you think so too so maybe not this
           | one?)
           | 
           | 3) an interesting point I was wondering to compare current
           | state of things (I mean I'll give you this but it's just a
           | random thought and I'm not reviewing this paper in an
           | academic setting. This is HN, not NeurIPS. I'm just curious -
           | \ _ ( tsu ) _ / -)
           | 
           | 4) the point that you can rip a game
           | 
           | I'm really not sure what you're contesting to because I said
           | several things.                 > it lacks basic things like
           | pre-computing, storing, etc.
           | 
           | It does? Last I checked neural nets store information. I
           | guess I need to return my PhD because last I checked there's
           | a UNet in SD 1.4 and that contains a decoder.
        
             | snickmy wrote:
             | Sorry, probably didn't explain myself well enough
             | 
             | 1) yes you are correct. the point i was making is that, in
             | the context of the discovery/research, that's outside the
             | scope, and 'easier' to do, as it has been done in other
             | verticals (ie.: e2e self driving)
             | 
             | 2) yep, aligned here
             | 
             | 3) I'm not fully following here, but agree this is not
             | NeurIPS, and no Schmidhuber's bickering.
             | 
             | 4) The network does store information, it just doesn't
             | store a gameplay information, which could be forced, but as
             | per point 1, it is , and I think it is the right approach,
             | beyond the scope of this research
        
               | godelski wrote:
               | 1) I'm not sure this is outside scope. It's also not
               | something I'd use to reject a paper were I to review this
               | in a conference. I mean you got to start somewhere and
               | unlike reviewer 2 I don't think any criticism is
               | rejection criteria. That'd be silly since lack of
               | globally optimal solutions. But I'm also unconvinced this
               | is proven my self-driving vehicles but I'm also not an RL
               | expert.
               | 
               | 3) It's always hard to evaluate. I was thinking about the
               | ripping the game and so a reasonable metric is a
               | comparison of ability to perform the task by a human. Of
               | course I'm A LOT faster than my dishwasher at cleaning
               | dishes but I'm not occupied while it is going, so it
               | still has high utility. (Someone tell reviewer 2 lol)
               | 
               | 4) Why should we believe that it doesn't store gameplay?
               | The model was fed "user" inputs and frames. So it has
               | this information and this information appears useful for
               | learning the task.
        
           | pickledoyster wrote:
           | >you could assume that all that can be either done at the
           | margin of this discovery OR over time will naturally improve
           | OR will become less important as a blocker.
           | 
           | OR one can hope it will be thrown to the heap of nonviable
           | tech with the rest of spam waste
        
           | tobr wrote:
           | I suppose it also doesn't really matter what kinds of
           | resources the game originally requires. The diffusion model
           | isn't going to require twice as much memory just because the
           | game does. Presumably you wouldn't even necessarily need to
           | be able to render the original game in real time - I would
           | imagine the basic technique would work even if you used a
           | state of the Hollywood-quality offline renderer to render
           | each input frame, and that the performance of the diffusion
           | model would be similar?
        
             | godelski wrote:
             | Well the majority of ML systems are compression machines
             | (entropy minimizers), so ideally you'd want to see if you
             | can learn the assets and game mechanics through play alone
             | (what this paper shows). Better would be to do so more
             | efficiently than that devs themselves, finding better
             | compression. Certainly the game is not perfectly optimized.
             | But still, this is a step in that direction. I mean no one
             | has accomplished this before so even with a model with far
             | higher capacity it's progress. (I think people are
             | interpreting my comment as dismissive. I'm critiquing but
             | the key point I was making was about how there's likely
             | better architectures, training methods, and all sorts of
             | stuff to still research. Personally I'm glad there's still
             | more to research. That's the fun part)
        
           | danielmarkbruce wrote:
           | Is it a breakthrough? Weather models are miles ahead of this
           | as far as I can tell.
        
         | dTal wrote:
         | >What's also interesting about this work is it's basically
         | saying you can rip a game if you're willing to "play"
         | (automate) it enough times and spend a lot more on storage and
         | compute
         | 
         | That's the least of it. It means you can _generate_ a game from
         | real footage. Want a perfect flight sim? Put a GoPro in the
         | cockpit of every airliner for a year.
        
           | isaacfung wrote:
           | The possibility seems far beyond gaming(given enough
           | computation resources).
           | 
           | You can feed it with videos of usage of any software or real
           | world footage recorded by a Go Pro mounted on your
           | shoulder(with body motion measured by some sesnors though the
           | action space would be much larger).
           | 
           | Such a "game engine" can potentially be used as a simulation
           | gym environment to train RL agents.
        
           | camtarn wrote:
           | Plus, presumably, either training it on pilot inputs (and
           | being able to map those to joystick inputs and mouse clicks)
           | or having the user have an identical fake cockpit to play in
           | and a camera to pick up their movements.
           | 
           | And, unless you wanted a simulator that only allowed
           | perfectly normal flight, you'd have to have those airliners
           | go through every possible situation that you wanted to
           | reproduce: warnings, malfunctions, emergencies, pilots
           | pushing the airliner out of its normal flight envelope, etc.
        
           | phh wrote:
           | > Want a perfect flight sim? Put a GoPro in the cockpit of
           | every airliner for a year.
           | 
           | I guess that's the occasion to remind that ML is splendid at
           | interpolating, but extrapolating, maybe don't keep your hopes
           | too high.
           | 
           | Namely, to have a "perfect flight sim" using GoPros, you'll
           | need to record hundreds of stalls and crashs.
        
       | amelius wrote:
       | Yes, and you can use an LLM to simulate role playing games.
        
       | amunozo wrote:
       | This is amazing and an interesting discovery. It is a pity that I
       | don't find it capable of creating anything new.
        
       | itomato wrote:
       | The gibs are a dead giveaway
        
       | nuz wrote:
       | I wonder how overfit it is though. You could fit a lot of doom
       | resolution jpeg frames into 4gb (the size of SD1.4)
        
       | ciroduran wrote:
       | Congrats on running Doom on an Diffusion Model :D
       | 
       | I was really entranced on how combat is rendered (the grunt doing
       | weird stuff in very much the style that the model generates
       | images). Now I'd like to see this implemented in a shader in a
       | game
        
       | LtdJorge wrote:
       | So is it taking inputs from a player and simulating the gameplay
       | or is it just simulating everything (effectively, a generated
       | video)?
        
       | smusamashah wrote:
       | Has this model actually learned the 3d space of the game? Is it
       | possible to break the camera free and roam around the map freely
       | and view it from different angles?
       | 
       | I noticed a few hallucinations e.g. when it picked green jacket
       | from a corner, walking back it generated another corner.
       | Therefore I don't think it has any clue about the 3D world of the
       | game at all.
        
         | kqr wrote:
         | > Is it possible to break the camera free and roam around the
         | map freely and view it from different angles?
         | 
         | I would assume only if the training data contained this type of
         | imagery, which it did not. The training data (from what I
         | understand) consisted only of input+video of actual gameplay,
         | so that is what the model is trained to mimick.
         | 
         | This is like a dog that has been trained to form English words
         | - what's impressive is not that it does it well, but that it
         | does it at all.
        
         | Sohcahtoa82 wrote:
         | > Therefore I don't think it has any clue about the 3D world of
         | the game at all.
         | 
         | AI models don't "know" things at all.
         | 
         | At best, they're just very fuzzy predictors. In this case,
         | given the last couple frames of video and a user input, it
         | predicts the next frame.
         | 
         | It has zero knowledge of the game world, game rules,
         | interactions, etc. It's merely a mapping of [pixels, input] ->
         | pixels.
        
       | kqr wrote:
       | I have been kind of "meh" about the recent AI hype, but this is
       | seriously impressive.
       | 
       | Of course, we're clearly looking at complete nonsense generated
       | by something that does not understand what it is doing - yet, it
       | is astonishingly sensible nonsense given the type of information
       | it is working from. I had no idea the state of the art was
       | capable of this.
        
       | acoye wrote:
       | Nvidia CEO reckons your GPU will be replaced with AI in "5-10
       | years". So this is what the sort of first working game I guess.
        
       | acoye wrote:
       | I'd love to see John Carmack come back from his AGI hiatus and
       | advance AI based rendering. This would be supper cool.
        
       | seydor wrote:
       | I wonder how far it is from this to generating language reasoning
       | about the game from the game itself, rather than learning a large
       | corpus of language, like LLMs do. That would be a true grounded
       | language generator
        
       | t1c wrote:
       | They got DOOM running on a diffusion engine before GTA 6
        
       | lackoftactics wrote:
       | I think Alan's conservative countdown to AGI will need to be
       | updated after this. https://lifearchitect.ai/agi/ This is really
       | impressive stuff. I thought about it a couple of months ago, that
       | probably this is the next modality worth exploring for data, but
       | didn't imagine it would come so fast. On the other side, the
       | amount of compute required is crazy.
        
       | joseferben wrote:
       | impressive, imagine this but photo realistic with vr goggles.
        
       | gwbas1c wrote:
       | Am I the only one who thinks this is faked?
       | 
       | It's not that hard to fake something like this: Just make a video
       | of DOSBox with DOOM running inside of it, and then compress it
       | with settings that will result in compression artifacts.
        
         | GaggiX wrote:
         | >Am I the only one who thinks this is faked?
         | 
         | Yes.
        
       | dtagames wrote:
       | A diffusion model cannot be a game engine because a game engine
       | can be used to create _new_ games and modify the rules of
       | existing games in real time -- even rules which are not visible
       | on-screen.
       | 
       | These tools are fascinating but, as with all AI hype, they need a
       | disclaimer: The tool didn't create the game. It simply generated
       | frames and the appearance of play mechanics from a game it
       | sampled (which humans created).
        
         | kqr wrote:
         | > even rules which are not visible on-screen.
         | 
         | If a rule was changed but it's never visible on the screen, did
         | it really change?
         | 
         | > It simply generated frames and the appearance of play
         | mechanics from a game it sampled (which humans created).
         | 
         | Simply?! I understand it's mechanically trivial but the fact
         | that it's compressed such a rich conditional distribution seems
         | far from simple to me.
        
           | darby_nine wrote:
           | > Simply?! I understand it's mechanically trivial but the
           | fact that it's compressed such a rich conditional
           | distribution seems far from simple to me.
           | 
           | It's much simpler than actually creating a game....
        
             | stnmtn wrote:
             | If someone told you 10 years ago that they were going to
             | create something where you could play a whole new level of
             | Doom, without them writing a single line of game
             | logic/rendering code, would you say that that is simpler
             | than creating a demo by writing the game themselves?
        
               | darby_nine wrote:
               | There are two things at play here: the complexity of the
               | underlying mechanism, and the complexity of detailed
               | creation. This is obviously a complicated mechanism, but
               | in another sense it's a trivial result compared to
               | actually reproducing the game itself in its original
               | intended state.
        
           | znx_0 wrote:
           | > If a rule was changed but it's never visible on the screen,
           | did it really change?
           | 
           | Well for "some" games it does really change
        
         | sharpshadow wrote:
         | So all it did is generate a video of the gameplay which is
         | slightly different from the video it used for training?
        
           | TeMPOraL wrote:
           | No, it implements a 3D FPS that's interactive, and renders
           | each frame based on your input and a lot of memorized
           | gameplay.
        
             | sharpshadow wrote:
             | But is it playing the actual game or just making a
             | interactive video of it?
        
               | Maxatar wrote:
               | Making an interactive video of it. It is not playing the
               | game, a human does that.
               | 
               | With that said, I wholly disagree that this is not an
               | engine. This is absolutely a game engine and while this
               | particular demo uses the engine to recreate DOOM, an
               | existing game, you could certainly use this engine to
               | produce new games in addition to extrapolating existing
               | games in novel ways.
        
               | Workaccount2 wrote:
               | What is the difference?
        
               | TeMPOraL wrote:
               | Yes.
               | 
               | All video games are, by definition, interactive videos.
               | 
               | What I imagine you're asking about is, a typical game
               | like Doom is effectively a function:
               | f(internal state, player input) -> (new frame, new
               | internal state)
               | 
               | where internal state is the shape and looks of loaded
               | map, positions and behaviors and stats of enemies,
               | player, items, etc.
               | 
               | A typical AI that plays Doom, which is _not_ what 's
               | happening here, is (at runtime):                 f(last
               | frame) -> new player input
               | 
               | and is attached in a loop to the previous case in the
               | obvious way.
               | 
               | What we have here, however, is a game you can play but
               | implemented in a diffusion model, and it works like this:
               | f(player input, N last frames) -> new frame
               | 
               | Of note here is the _lack of game state_ - the state is
               | implicit in the contents of the N previous frames, and is
               | otherwise not represented or mutated explicitly. The
               | diffusion model has seen so much Doom that it, in a way,
               | _internalized_ most of the state and its evolution, so it
               | can look at what 's going on and guess what's about to
               | happen. Which is what it does: it renders the next frame
               | by predicting it, based on current user input and last N
               | frames. And then that frame becomes the input for the
               | next prediction, and so on, and so on.
               | 
               | So yes, it's totally an interactive video _and_ a game
               | _and_ a third thing - a probabilistic emulation of Doom
               | on a generative ML model.
        
               | sharpshadow wrote:
               | Thank you for the further explanation, that's what I
               | thought in the meantime and intended to find out with my
               | question.
               | 
               | That opens up a new branch of possibilities.
        
         | calebh wrote:
         | One thing I'd like to see is to take a game rendered with low
         | poly assets (or segmented in some way) and use a diffusion
         | model to add realistic or stylized art details. This would fix
         | the consistency problem while still providing tangible
         | benefits.
        
         | momojo wrote:
         | The title should be "Diffusion Models can be used to render
         | frames given user input"
        
         | throwthrowuknow wrote:
         | They only trained it on one game and only embedded the control
         | inputs. You could train it on many games and embed a lot more
         | information about each of them which could possibly allow you
         | to specify a prompt that would describe the game and then play
         | it.
        
       | EcommerceFlow wrote:
       | Jensen said that this is the future of gaming a few months ago
       | fyi.
        
         | weakfish wrote:
         | Who is that?
        
         | Fraterkes wrote:
         | Thousands of different people have been speculating about this
         | kind of thing for years.
        
       | alkonaut wrote:
       | The job of the game engine is also to render the world given only
       | the worlds properties (textures, geometries, physics rules, ...),
       | and not given "training data that had to be supplied from an
       | already written engine".
       | 
       | I'm guessing that the "This door requires a blue key" doesn't
       | mean that the user can run around, the engine dreams up a blue
       | key in some other corner of the map, and the user can then return
       | to the door and the engine now opens the door? THAT would be
       | impressive. It's interesting to think that all that would be
       | required for that task to go from really hard to quite doable,
       | would be that the door requiring the blue key is blue, and the UI
       | showing some icon indicating the user possesses the blue key.
       | Without that, it becomes (old) hidden state.
        
       | dabochen wrote:
       | So there is no interactivity, but the generated content is not
       | the exact view in the training data, is this the correct
       | understanding?
       | 
       | If so, is it more like imagination/hallucination rather than
       | rendering?
        
         | og_kalu wrote:
         | It's conditioned on previous frames AND player actions so it's
         | interactive.
        
       | wantsanagent wrote:
       | Anyone have reliable numbers on the file sizes here? Doom.exe
       | from my searches was around 715k, and with all assets somewhere
       | around 10MB. It looks like the SD 1.4 files are over 2GB, so it's
       | likely we're looking at a 200-2000x increase in file size
       | depending on if you think of this as an 'engine' or the full
       | game.
        
       | YeGoblynQueenne wrote:
       | Misleading Titles Are Everywhere These Days.
        
       | jetrink wrote:
       | What if instead of a video game, this was trained on video and
       | control inputs from people operating equipment like warehouse
       | robots? Then an automated system could visualize the result of a
       | proposed action or series of actions when operating the equipment
       | itself. You would need a different model/algorithm to propose
       | control inputs, but this would offer a way for the system to
       | validate and refine plans as part of a problem solving feedback
       | loop.
        
         | Workaccount2 wrote:
         | >Robotic Transformer 2 (RT-2) is a novel vision-language-action
         | (VLA) model that learns from both web and robotics data, and
         | translates this knowledge into generalised instructions for
         | robotic control
         | 
         | https://deepmind.google/discover/blog/rt-2-new-model-transla...
        
       | harha_ wrote:
       | This is so sick I don't know what to say. I never expected this,
       | aren't the implications of this huge?
        
         | aithrowaway1987 wrote:
         | I am struggling to understand a _single_ implication of this!
         | How does this generalize to anything other than other than
         | playing retro games in the most expensive way possible? The
         | very intention of this project is overfitting to data in a non-
         | generalizable way! Maybe it 's just pure engineering, that good
         | ANNs are getting cheap and fast. But this project still seems
         | to have the fundamental weaknesses of all AI projects:
         | 
         | - needs a huge amount of data, which a priori precludes a lot
         | of interesting use cases
         | 
         | - flashy-but-misleading demos which hide the actual weaknesses
         | of the AI software (note that the player is moving very
         | haltingly compared to a real game of DOOM, where you almost
         | never stop moving)
         | 
         | - AI nailing something really complicated for humans (98%
         | effective raycasting, 98% effective Python codegen) while
         | failing to grasp abstract concepts rigorously understood by
         | _fish_ (object permanence, quantity)
         | 
         | I am genuinely struggling to see this as a meaningful step
         | forward. It seems more like a World's Fair exhibit - a fun and
         | impressive diversion, but probably not a vision of the future.
         | Putting it another way: unlike AlphaGo, Deep Blue wasn't really
         | a technological milestone so much as a _sociological_ milestone
         | reflecting the apex of a certain approach to AI. I think this
         | DOOM project is in a similar vein.
        
       | KETpXDDzR wrote:
       | I think the correct title should be "Diffusion Models Are Fake
       | Real-Time Game Engines". I don't think just more training will
       | ever be sufficient to create a complete game engine. It would
       | need to "understand" what it's doing.
        
       | aghilmort wrote:
       | looking forward to &/or wondering about overlap with notion of
       | ray tracing LLMs
        
       | TheRealPomax wrote:
       | If by "game" you mean "literal hallucination" then yes. But if
       | we're not trying to click-bait, then no: it's not really a game
       | when there is no permanence or determinism to be found anywhere.
       | It might be a "game-flavoured dream simulator", but it's
       | absolutely not a game engine.
        
       | rrnechmech wrote:
       | > To mitigate auto-regressive drift during inference, we corrupt
       | context frames by adding Gaussian noise to encoded frames during
       | training. This allows the network to correct information sampled
       | in previous frames, and we found it to be critical for preserving
       | visual stability over long time periods.
       | 
       | I get this (mostly). But would any kind soul care to elaborate on
       | this? What is this "drift" they are trying to avoid and _how_
       | does (AFAIU) adding _noise_ help?
        
       | gwern wrote:
       | People may recall GameGAN from May 2020:
       | https://arxiv.org/abs/2005.12126#nvidia https://nv-
       | tlabs.github.io/gameGAN/#nvidia https://github.com/nv-
       | tlabs/GameGAN_code
        
       | SeanAnderson wrote:
       | After some discussion in this thread, I found it worth pointing
       | out that this paper is NOT describing a system which receives
       | real-time user input and adjusts its output accordingly, but, to
       | me, the way the abstract is worded heavily implied this was
       | occurring.
       | 
       | It's trained on a large set of data in which agents played DOOM
       | and video samples are given to users for evaluation, but users
       | are not feeding inputs into the simulation in real-time in such a
       | way as to be "playing DOOM" at ~20FPS.
       | 
       | There are some key phrases within the paper that hint at this
       | such as "Key questions remain, such as ... how games would be
       | effectively created in the first place, including how to best
       | leverage human inputs" and "Our end goal is to have human players
       | interact with our simulation.", but mostly it's just the omission
       | of a section describing real-time user gameplay.
        
         | bob1029 wrote:
         | Were the agents playing at 20 real FPS, or did this occur like
         | a Pixar movie offline?
        
         | refibrillator wrote:
         | You are incorrect, this is an _interactive_ simulation that is
         | playable by humans.
         | 
         | > Figure 1: a human player is playing DOOM on GameNGen at 20
         | FPS.
         | 
         | The abstract is ambiguously worded which has caused a lot of
         | confusion here, but the paper is unmistakably clear about this
         | point.
         | 
         | Kind of disappointing to see this misinformation upvoted so
         | highly on a forum full of tech experts.
        
           | FrustratedMonky wrote:
           | Yeah. If isn't doing this, then what could it be doing that
           | is worth a paper? "real-time user input and adjusts its
           | output accordingly"
        
             | rvnx wrote:
             | There is a hint in the paper itself:
             | 
             | It says in a shy way that it is based on: "Ha & Schmidhuber
             | (2018) who train a Variational Auto-Encoder (Kingma &
             | Welling, 2014) to encode game frames into a latent vector"
             | 
             | So it means they most likely took
             | https://worldmodels.github.io/ (that is actually open-
             | source) or something similar and swapped the frame
             | generation by Stable Diffusion that was released in 2022.
        
           | psb217 wrote:
           | If the generative model/simulator can run at 20FPS, then
           | obviously in principle a human could play the game in
           | simulation at 20 FPS. However, they do no evaluation of human
           | play in the paper. My guess is that they limited human evals
           | to watching short clips of play in the real engine vs the
           | simulator (which conditions on some number of initial frames
           | from the engine when starting each clip...) since the actual
           | "playability" is not great.
        
         | pajeets wrote:
         | I knew it was too good be true but seems like real time video
         | generation can be good enough to get to a point where it feels
         | like a truly interactive video/game
         | 
         | Imagine if text2game was possible. there would be some sort of
         | network generating each frame from an image generated by text,
         | with some underlying 3d physics simulation to keep all the
         | multiplayer screens sync'd
         | 
         | this paper does not seem to be of that possibility rather some
         | cleverly words to make you think people were playing a real
         | time video. we can't even generate more than 5~10 second of
         | video without it hallucinating. something this persistent would
         | require an extreme amount of gameplay video training. it can be
         | done but the video shown by this paper is not true to its
         | words.
        
         | Chance-Device wrote:
         | I also thought this, but refer back to the paper, not the
         | abstract:
         | 
         | > A is the set of key presses and mouse movements...
         | 
         | > ...to condition on actions, we simply learn an embedding
         | A_emb for each action
         | 
         | So, it's clear that in this model the diffusion process is
         | conditioned by embedding A that is derived from user actions
         | rather than words.
         | 
         | Then a noised start frame is encoded into latents and
         | concatenated on to the noise latents as a second conditioning.
         | 
         | So we have a diffusion model which is trained solely on images
         | of doom, and which is conditioned on current doom frames and
         | user actions to produce subsequent frames.
         | 
         | So yes, the users are playing it.
         | 
         | However, it should be unsurprising that this is possible. This
         | is effectively just a neural recording of the game. But it's a
         | cool tech demo.
        
           | foota wrote:
           | I wonder if they could somehow feed in a trained Gaussian
           | splats model to this to get better images?
           | 
           | Since the splats are specifically designed for rendering it
           | seems like it would be an efficient way for the image model
           | to learn the geometry without having to encode it on the
           | image model itself.
        
             | Chance-Device wrote:
             | I'm not sure how that would help vs just training the model
             | with the conditionings described in the paper.
             | 
             | I'm not very familiar with Gaussian splats models, but
             | aren't they just a way of constructing images using
             | multiple superimposed parameterized Gaussian distributions,
             | sort of like the Fourier series does with waveforms using
             | sine and cosine waves?
             | 
             | I'm not seeing how that would apply here but I'd be
             | interested in hearing how you would do it.
        
           | psb217 wrote:
           | The agent never interacts with the simulator during training
           | or evaluation. There is no user, there is only an agent which
           | trained to play the real game and which produced the
           | sequences of game frames and actions that were used to train
           | the simulator and to provide ground truth sequences of game
           | experience for evaluation. Their evaluation metrics are all
           | based on running short simulations in the diffusion model
           | which are initiated with some number of conditioning frames
           | taken from the real game engine. Statements in the paper
           | like: "GameNGen shows that an architecture and model weights
           | exist such that a neural model can effectively run a complex
           | game (DOOM) interactively on existing hardware." are wildly
           | misleading.
        
         | teamonkey wrote:
         | I think _someone_ is playing it, but it has a reduced set of
         | inputs and they 're playing it in a very specific way (slowly,
         | avoiding looking back to places they've been) so as not to show
         | off the flaws in the system.
         | 
         | The people surveyed in this study are not playing the game,
         | they are watching extremely short video clips of the game being
         | played and comparing them to equally short videos of the
         | original Doom being played, to see if they can spot the
         | difference.
         | 
         | I may be wrong with how it works, but I think this is just
         | hallucinating in real time. It has no internal state per se, it
         | knows what was on screen in the previous few frames and it
         | knows what inputs the user is pressing, and so it generates the
         | next frame. Like with video compression, it probably doesn't
         | need to generate a full frame every time, just "differences".
         | 
         | As with all the previous AI game research, these are not games
         | in any real sense. They fall apart when played beyond any
         | meaningful length of time (seconds). Crucially, they are not
         | playable by anyone other than the developers in very controlled
         | settings. A defining attribute of any game is that it can be
         | played.
        
         | lewhoo wrote:
         | The movement of the player seems jittery a bit so I inferred
         | something similar on that basis.
        
         | 7734128 wrote:
         | What you're describing reminded me of this cool project:
         | 
         | https://www.youtube.com/watch?v=udPY5rQVoW0 "Playing a Neural
         | Network's version of GTA V: GAN Theft Auto"
        
         | SeanAnderson wrote:
         | Ehhh okay, I'm not as convinced as I was earlier. Sorry for
         | misleading. There's been a lot of back-and-forth.
         | 
         | I would've really liked to see a section of the paper
         | explicitly call out that they used humans in real time. There's
         | a lot of sentences that led me to believe otherwise. It's clear
         | that they used a bunch of agents to simulate gameplay where
         | those agents submitted user inputs to affect the gameplay and
         | they captured those inputs in their model. This made it a bit
         | murky as to whether humans ever actually got involved.
         | 
         | This statement, "Our end goal is to have human players interact
         | with our simulation. To that end, the policy p as in Section 2
         | is that of human gameplay. Since we cannot sample from that
         | directly at scale, we start by approximating it via teaching an
         | automatic agent to play"
         | 
         | led me to believe that while they had an ultimate goal of user
         | input (why wouldn't they) they sufficed by approximating human
         | input.
         | 
         | I was looking to refute that assumption later in the paper by
         | hopefully reading some words on the human gameplay experience,
         | but instead, under Results, I found:
         | 
         | "Human Evaluation. As another measurement of simulation
         | quality, we provided 10 human raters with 130 random short
         | clips (of lengths 1.6 seconds and 3.2 seconds) of our
         | simulation side by side with the real game. The raters were
         | tasked with recognizing the real game (see Figure 14 in
         | Appendix A.6). The raters only choose the actual game over the
         | simulation in 58% or 60% of the time (for the 1.6 seconds and
         | 3.2 seconds clips, respectively)."
         | 
         | and it's like.. okay.. if you have a section in results on
         | human evaluation, and your goal is to have humans play, then
         | why are you talking just about humans reviewing video rather
         | than giving some sort of feedback on the human gameplay
         | experience - even if it's not especially positive?
         | 
         | Still, in the Discussion section, it mentions, "The second
         | important limitation are the remaining differences between the
         | agent's behavior and those of human players. For example, our
         | agent, even at the end of training, still does not explore all
         | of the game's locations and interactions, leading to erroneous
         | behavior in those cases." which makes it more clear that humans
         | gave input which went outside the bounds of the automatic
         | agents. It doesn't seem like this would occur if it were agents
         | simulating more input.
         | 
         | Ultimately, I think that the paper itself could've been more
         | clear in this regard, but clearly the publishing website tries
         | to be very explicit by saying upfront - "Real-time recordings
         | of people playing the game DOOM" and it's pretty hard to argue
         | against that.
         | 
         | Anyway. I repent! It was a learning experience going back and
         | forth on my belief here. Very cool tech overall.
        
           | psb217 wrote:
           | It's funny how academic writing works. Authors rarely produce
           | many unclear or ambiguous statements where the most likely
           | interpretation undersells their work...
        
         | dewarrn1 wrote:
         | The paper should definitely be more clear on this point, but
         | there's a sentence in section 5.2.3 that makes me think that
         | this was playable and played: "When playing with the model
         | manually, we observe that some areas are very easy for both,
         | some areas are very hard for both, and in some the agent
         | performs much better." It may be a failure of imagination, but
         | I can't think of another reasonable way of interpreting
         | "playing with the model manually".
        
         | ollin wrote:
         | We can't assess the quality of gameplay ourselves of course
         | (since the model wasn't released), but one author said "It's
         | playable, the videos on our project page are actual game play."
         | (https://x.com/shlomifruchter/status/1828850796840268009) and
         | the video on top of https://gamengen.github.io/ starts out with
         | "these are real-time recordings of people playing the game".
         | Based on those claims, it seems likely that they did get a
         | playable system in front of humans by the end of the project
         | (though perhaps not by the time the draft was uploaded to
         | arXiv).
        
       | Sohcahtoa82 wrote:
       | It's always fun reading the dead comments on a post like this.
       | People love to point how how pointless this is.
       | 
       | Some of ya'll need to learn how to make things _for the fun of
       | making things_. Is this useful? No, not really. Is it
       | interesting? Absolutely.
       | 
       | Not everything has to be made for profit. Not everything has to
       | be made to make the world a better place. Sometimes, people
       | create things just for the learning experience, the challenge, or
       | they're curious to see if something is possible.
       | 
       | Time spent enjoying yourself is never time wasted. Some of ya'll
       | are going to be on your death beds wishing you had allowed
       | yourself to have more fun.
        
         | Gooblebrai wrote:
         | So true. The hustle culture is an spreading disease that has
         | replaced the fun maker culture from the 80s/90s.
         | 
         | It's unavoidable though. Cost of living being increasingly
         | expensive and romantization of entrepreneurs like they are rock
         | stars leads towards this hustle mindset.
        
         | ninetyninenine wrote:
         | I don't think this is not useful. This is a stepping stone for
         | generating entire novel games.
        
           | Sohcahtoa82 wrote:
           | > This is a stepping stone for generating entire novel games.
           | 
           | I don't see how.
           | 
           | This game "engine" is purely mapping [pixels, input] -> new
           | pixels. It has no notion of game state (so you can kill an
           | enemy, turn your back, then turn around again, and the enemy
           | could be alive again), not to mention that it requires the
           | game to _already exist_ in order to train it.
           | 
           | I suppose, in theory, you could train the network to include
           | game state in the input and output, or potentially even
           | handle game state outside the network entirely and just make
           | it one of the inputs, but the output would be incredibly
           | noisy and nigh unplayable.
           | 
           | And like I said, all of it requires the game to already exist
           | in order to train the network.
        
             | airstrike wrote:
             | _> (so you can kill an enemy, turn your back, then turn
             | around again, and the enemy could be alive again)_
             | 
             | Sounds like a great game.
             | 
             |  _> not to mention that it requires the game to already
             | exist in order to train it_
             | 
             | Diffusion models create new images that did not previously
             | exist all of the time, so I'm not sure how that follows.
             | It's not hard to extrapolate from TFA to a model that
             | generically creates games based on some input
        
             | ninetyninenine wrote:
             | >It has no notion of game state (so you can kill an enemy,
             | turn your back, then turn around again)
             | 
             | Well you see a wall you turn around then turn back the wall
             | is still there. With enough training data the model will be
             | able to pick up the state of the enemy because it has
             | ALREADY learned the state of the wall due to much more
             | numerous data on the wall. It's probably impractical to do
             | this, but this is only a stepping stone like said.
             | 
             | > not to mention that it requires the game to already exist
             | in order to train it.
             | 
             | Is this a problem? Do games not exist? Not only due we have
             | tons of games, but we also have in theory unlimited amounts
             | of training data for each game.
        
               | Sohcahtoa82 wrote:
               | > Well you see a wall you turn around then turn back the
               | wall is still there. With enough training data the model
               | will be able to pick up the state of the enemy because it
               | has ALREADY learned the state of the wall due to much
               | more numerous data on the wall.
               | 
               | It's really important to understand that _ALL THE MODEL
               | KNOWS_ is a mapping of [pixels, input] - > new pixels. It
               | has zero knowledge of game state. The wall is still there
               | after spinning 360 degrees simply because it knows that
               | the image of a view facing away from the wall while
               | holding the key to turn right eventually becomes an image
               | of a view of the wall.
               | 
               | The only "state" that is known is the last few frames of
               | the game screen. Because of this, it's simply not
               | possible for the game model to know if an enemy should be
               | shown as dead or alive once it has been off-screen for
               | longer than those few frames. It also means that if you
               | keeping turning away and towards an enemy, it could
               | teleport around. Once it's off the screen for those few
               | frames, the model will have forgotten about it.
               | 
               | > Is this a problem? Do games not exist?
               | 
               | If you're trying to make a _new_ game, then you need
               | _new_ frames to train the model on.
        
               | ninetyninenine wrote:
               | >It's really important to understand that ALL THE MODEL
               | KNOWS is a mapping of [pixels, input] -> new pixels. It
               | has zero knowledge of game state.
               | 
               | This is false. What occurs in inside the model is
               | unknown. It arranges pixel input and produces pixel
               | output as if it actually understands game state. Like
               | LLMs we don't actually fully understand what's going on
               | internally. You can't assume that models don't
               | "understand" things just because the high level training
               | methodology only includes pixel input and output.
               | 
               | >The only "state" that is known is the last few frames of
               | the game screen. Because of this, it's simply not
               | possible for the game model to know if an enemy should be
               | shown as dead or alive once it has been off-screen for
               | longer than those few frames. It also means that if you
               | keeping turning away and towards an enemy, it could
               | teleport around. Once it's off the screen for those few
               | frames, the model will have forgotten about it.
               | 
               | This is true. But then one could say it knows game state
               | for up to a few frames. That's different from saying the
               | model ONLY knows pixel input and pixel output. Very
               | different.
               | 
               | There are other tricks for long term memory storage as
               | well. Think Radar. Radar will capture the state of the
               | enemy beyond just visual frames so the model won't forget
               | an enemy was behind them.
               | 
               | Game state can also be encoded into some frame pixels at
               | the bottom lines. The Model can pick up on these
               | associations.
               | 
               | edit: someone mentioned that the game state lasts past a
               | few frames.
               | 
               | >If you're trying to make a new game, then you need new
               | frames to train the model on.
               | 
               | Right so for a generative model you would instead of
               | training the model on one game you would train it on
               | multitudes of games. The model would then based off of a
               | seed number output a new type of game.
               | 
               | Alternatively you could have a model generate a model.
               | 
               | All of what I'm saying is of course speculative. As I
               | said, this model is a stepping stone for the future. Just
               | like the LLM which is only trivially helpful now, the LLM
               | can be a stepping stone for replacing programmers all
               | together.
        
             | throwthrowuknow wrote:
             | Read the paper. It is capable of maintaining state for a
             | fairly long time including updating the UI elements.
        
         | msk-lywenn wrote:
         | I'd like to now to carbon footprint of that fun.
        
         | ploxiln wrote:
         | The skepticism and criticism in this thread is against the hype
         | of AI, it's implied by people saying "this is so amazing" that
         | they think that in some near future you can create any video
         | game experience you can imagine by just replacing all the
         | software with some AI models, rendering the whole game.
         | 
         | When in reality this is the least efficient and reliable form
         | of Doom yet created, using literally millions of times the
         | computation used by the first x86 PCs that were able to render
         | and play doom in real-time.
         | 
         | But it's a funny party trick, sure.
        
       | KhoomeiK wrote:
       | NVIDIA did something similar with GANs in 2020 [1], except users
       | _could_ actually play those games (unlike in this diffusion work
       | which just plays back simulated video). Sentdex later adapted
       | this to play GTA with a really cool demo [2].
       | 
       | [1] https://research.nvidia.com/labs/toronto-ai/gameGAN/
       | 
       | [2] https://www.youtube.com/watch?v=udPY5rQVoW0
        
       | throwthrowuknow wrote:
       | Several thoughts for future work:
       | 
       | 1. Continue training on all of the games that used the Doom
       | engine to see if it is capable of creating new graphics, enemies,
       | weapons, etc. I think you would need to embed more details for
       | this perhaps information about what is present in the current
       | level so that you could prompt it to produce a new level from
       | some combination.
       | 
       | 2. Could embedding information from the map view or a raytrace of
       | the surroundings of the player position help with consistency? I
       | suppose the model would need to predict this information as the
       | neural simulation progressed.
       | 
       | 3. Can this technique be applied to generating videos with
       | consistent subjects and environments by training on a camera view
       | of a 3D scene and embedding the camera position and the position
       | and animation states of objects and avatars within the scene?
       | 
       | 4. What would the result of training on a variety of game engines
       | and games with different mechanics and inputs be? The space of
       | possible actions is limited by the available keys on a keyboard
       | or buttons on a controller but the labelling of the
       | characteristics of each game may prove a challenge if you wanted
       | to be able to prompt for specific details.
        
       | danielmarkbruce wrote:
       | What is the point of this? It's hard to see how this is useful.
       | Maybe it's just an exercise to show what a diffusion model can
       | do?
        
       | Kapura wrote:
       | What is useful about this? I am a game programmer, and I cannot
       | imagine a world where this improves any part of the development
       | process. It seems to me to be a way to copy a game without
       | literally copying the assets and code; plagiarism with extra
       | steps. What am I missing?
        
       | jasonkstevens wrote:
       | AI no longer plays Doom-it _is_ Doom.
        
       ___________________________________________________________________
       (page generated 2024-08-28 23:01 UTC)