hngopher.com

       [HN Gopher] Show HN: Beating Pokemon Red with RL and <10M Parame...
       ___________________________________________________________________
        
       Show HN: Beating Pokemon Red with RL and <10M Parameters
        
       Hi everyone!  After spending hundreds of hours, we're excited to
       finally share our progress in developing a reinforcement learning
       system to beat Pokemon Red. Our system successfully completes the
       game using a policy under 10M parameters, PPO, and a few novel
       techniques. With the release of Claude Plays Pokemon, now feels
       like the perfect time to showcase our work.  We'd love to get
       feedback!
        
       Author : drubs
       Score  : 172 points
       Date   : 2025-03-05 17:07 UTC (1 days ago)
        
 (HTM) web link (drubinstein.github.io)
 (TXT) w3m dump (drubinstein.github.io)
        
       | jononor wrote:
       | Very nice! Nice to see demonstrations of reinforcement learning
       | being used to solve non-trivial tasks.
        
       | xinpw8 wrote:
       | This is a first-in-world, isn't it?
        
       | worble wrote:
       | Heads up, clicking "Next Page" just takes you to an empty screen,
       | you have to use the navigation links on the left if you want to
       | get read past the first screen.
        
         | drubs wrote:
         | Thanks for the heads up. I just pushed a fix.
        
           | worble wrote:
           | I think you fixed the one below the puffer.ai image, but not
           | the one above Authors.
        
             | drubs wrote:
             | and...fixed!
        
               | xinpw8 wrote:
               | i am sorry for my awful qa on the site :((((((((((((
        
       | bee_rider wrote:
       | Ah, very neat.
       | 
       | Maybe some day the "rival" character in Pokemon can be played by
       | a RL system, haha. That way you can have a "real player
       | (simulated)" for your rival.
        
         | xinpw8 wrote:
         | a cool idea, except that battling actually doesn't even matter
         | to the ai. if you look at what the agent is doing during a
         | battle, it is sort of spamming options + picking damaging
         | attacks. it would be a stretch to say that agents were 'good'
         | at battling...
        
           | wegfawefgawefg wrote:
           | if youve done the work to to make the rival rl based and have
           | the ability to go around youd probably have added basic
           | battle controls
        
             | xinpw8 wrote:
             | as it stands, battling is wholly unimportant to completing
             | the game, as long as the agents can eventually complete the
             | trainer battles mandatory for plot advancement. it's funny
             | because everyone thinks about battling when they think
             | about pokemon. my first fn i wrote, back when we were still
             | bumping around pallet town, was a battle reward function.
             | it was trash and didn't work and was over-complicated. the
             | crux of the problem is exploration over a vast, open-world
             | map, and completion of the sundry storyline tasks at distal
             | parts of said map in the correct sequence without the
             | policy collapsing and without agents overfitting to, say,
             | overworld loops.
        
               | wegfawefgawefg wrote:
               | you missed my point.
               | 
               | I know all about rl. Ive read go-explore 1/2, and I have
               | personally implemented intrinsic curiosity.
               | 
               | I was just commenting on what rhe other person said,
               | which is that it would be cool to have the npcs be agents
               | that battle and train too, to which you said they could
               | not be made to, to which I say, we have the technology.
               | :)
        
               | drubs wrote:
               | Sounds cool to me.
        
       | modeless wrote:
       | Can't Pokemon be beaten by almost random play?
        
         | VertanaNinjai wrote:
         | It can be brute forced if that's what you mean. It has a fairly
         | low difficulty curve and these old games have a grid system for
         | movement and action selections. That's why they're pointing out
         | the lower parameter amount and CPU. The point I took away is
         | doing more with less.
        
           | xinpw8 wrote:
           | It definitely cannot be beaten using random inputs. It
           | doesn't even get out of Pallet Town after billions of steps.
           | We tested...
        
             | fancyswimtime wrote:
             | the game has been beaten by fish
        
               | xinpw8 wrote:
               | dyor we only tested it with a pufferfish, courtesy of
               | puffer.ai / pufferlib RL library. i promise it doesn't
               | work with random inputs.
        
               | gusgus01 wrote:
               | I'm not sure if you're just making a play on words, but I
               | believe the commenter was talking about the streamer who
               | sets up their fishtank to map to inputs and then let's
               | their fish "play games". They beat pokemon sapphire
               | supposedly.
               | https://www.polygon.com/2020/11/9/21556590/fish-pokemon-
               | sapp...
        
               | cjbillington wrote:
               | Based on the other examples of random inputs not being
               | sufficient, I dare say the fish-based attempt may have
               | been fraudulent.
        
         | tehsauce wrote:
         | It's impossible to beat with random actions or brute force, but
         | you can get surprisingly far. It doesn't take too long to get
         | halfway through route 1, but even with insane compute you'll
         | never make it even to viridian forest.
        
         | bloomingkales wrote:
         | The win condition of the game is the entire state of the game
         | configured in a certain way. So there exists a lot of win
         | conditions, you just have to do a search.
        
         | drdeca wrote:
         | Judging by the "pi plays Pokemon Sapphire", uh, not in a
         | reasonable amount of time? It's been at it for over 3 years,
         | hasn't gotten a gym badge yet, mostly stays in the starting
         | town.
        
       | bubblyworld wrote:
       | What an awesome project! I'm curious - I would have thought that
       | rewarding unique coordinates would be enough to get the agent to
       | (eventually) explore all areas, including the key ones. What did
       | the agents end up doing before key areas got an extra reward?
       | 
       | (and how on earth did you port Pokemon red to a RL environment?
       | O.o)
        
         | drubs wrote:
         | The environments wouldn't concentrate enough in the Rocket
         | Hideout beneath Celadon Game Corner. The agent would have the
         | player wander the world reward hacking. With wild battles
         | enabled, the environments would end up in Lavender Tower
         | fighting Gastly.
         | 
         | > (and how on earth did you port Pokemon red to a RL
         | environment? O.o)
         | 
         | Read and find out :)
        
           | bubblyworld wrote:
           | Thanks haha, I kept reading =D I see, so it's not just that
           | you have to _visit_ the key areas, they need to show up in
           | the episodes enough to provide a signal for training.
        
             | drubs wrote:
             | Yup!
        
         | wegfawefgawefg wrote:
         | you dont port it you wrap it. you can put anything in an rl
         | environment. usually emulators are done with bizhawk, and some
         | lua. worst case theres ffi or screen capture.
        
           | drubs wrote:
           | My first version of this project 5 years ago involved a
           | python-lua named pipe using Bizhawk actually. No clue where
           | that code went
        
           | bubblyworld wrote:
           | Right, my thought was that this would be way too slow for
           | episode rollout (versus an accelerated implementation in jax
           | or something), but I guess not!
        
       | rvz wrote:
       | Note: What makes this interesting is that this is a pre-LLM
       | project which shows that in some projects you don't need an "LLM"
       | for this. All you need is just a plain old reinforcement learning
       | algorithm and a deep neural network which is perfect for this.
       | 
       | This is what I want to see more of and goes against the hype of
       | LLMs. What a great RL project.
       | 
       | Meanwhile, "Claude" is still stuck somewhere in the game. Imagine
       | the costs of running that vs this project.
        
         | mclau156 wrote:
         | Claude 3.7 recently failed to finish Pokemon after getting
         | stuck in a corner and deciding it was impossible to get out
        
           | xinpw8 wrote:
           | not our agents a hierarchical approach would be superior. add
           | rl to claude and it's gg
        
       | mclau156 wrote:
       | Could you have used the decompilations of pokemon on github?
       | https://github.com/pret/pokered
        
         | drubs wrote:
         | There's an entire section on how the decompilations were used
         | :)
        
           | mclau156 wrote:
           | Ok sorry I thought maybe there was a chance that the decomp
           | project could edited in a way that would create a ROM that
           | allowed RL to be done easier, but it seems like it just came
           | in handy for looking up values along with the GB ASM
           | tutorial, the alternative of my thought process is re-
           | creating pokemon red in a modern language which you also
           | mentioned
        
         | xinpw8 wrote:
         | if you helped with pret then god bless you
        
       | levocardia wrote:
       | Really cool work. It seems like some critical areas (team rocket,
       | safari zone) rely on encoding game knowledge into the reward
       | function somehow, which "smuggles in" external intelligence about
       | the game. A lot of these are related to planning, which makes me
       | wonder whether you could "bolt on" an LLM to do things like steer
       | the RL agent, dynamically choose what to reward, or even do some
       | of the planning itself. Do you think there's any low-hanging
       | fruit on this front?
        
         | drubs wrote:
         | Wrote about this in the results section. I think there is a way
         | to mix the two and simplify the rewards in the process. A lot
         | of the magic behind getting the agent to teach and use cut
         | probably could have been handled by an LLM.
        
         | Xelynega wrote:
         | For well-known games like "Pokemon Red" I wonder how much of
         | that game knowledge would be "smuggled in" by an LLM in it's
         | training data if you just replaced the external info in the
         | reward function with it/used it to make up for other
         | deficiencies.
         | 
         | I think they allude to this in their conclusion, but it's less
         | about the low-hanging fruit and more about designing a system
         | to feedback game dialogue into the RL decision making process
         | in a way that can be mutated as part of the RL(be it an LLM or
         | something else)
        
       | differintegral wrote:
       | This is very cool, congrats!
       | 
       | I wonder, does anyone have a sense of the approximate raw number
       | of button presses required to beat the game? Mostly curious to
       | see how that compares to the parameter count.
        
         | tarentel wrote:
         | I imagine < 10000.
         | https://github.com/KeeyanGhoreshi/PokemonFireredSingleSequen...
         | and https://www.youtube.com/watch?v=6gjsAA_5Agk. I believe this
         | is something like 200k and is a slightly different game. Quite
         | a bit less than 10m either way.
        
       | benopal64 wrote:
       | Incredible work. I am just learning about PyBoy from your
       | project, and it made me think of many fun ways to use that
       | library to play Pokemon autonomously.
        
         | xinpw8 wrote:
         | Very good to hear. Join the pyboy/pokemon discords!
         | https://discord.gg/UXpjQTgs https://discord.gg/EVS3tAGm
        
       | kerkeslager wrote:
       | Are there any uses for AI yet that _aren 't_ either:
       | 
       | 1. Doing things humans do for fun. 2. Doing things that AI is
       | horribly terrible at.
       | 
       | ?
        
         | sadeshmukh wrote:
         | Medical field, spotting things
         | 
         | Autonomous drones
         | 
         | Financial fraud detection
         | 
         | Scheduling of trains/buses/etc
         | 
         | I personally do like chatbots but you probably don't
        
           | xinpw8 wrote:
           | the only chatbot for me is smarterchild
        
             | bigfishrunning wrote:
             | I feel like that sentence aged me.
        
         | drubs wrote:
         | There's a ton of applications for AI. Back when I was at
         | Spotify, I co-authored Basic Pitch
         | (https://basicpitch.spotify.com/), an audio-to-midi library.
         | There are a ton of uses for AI outside of what's heavily
         | publicized.
        
       | nimish wrote:
       | Considering how many things are less complicated than Pokemon,
       | this is very cool
        
       | novia wrote:
       | Please stream the gameplay to twitch so people can compare.
        
         | tehsauce wrote:
         | We have a shared community map where you can watch hundreds of
         | agents from multiple peoples training runs playing in real
         | time!
         | 
         | https://pwhiddy.github.io/pokerl-map-viz/
        
           | Matthyze wrote:
           | That's amazing. Really awesome work.
        
       | endofreach wrote:
       | > Pokemon Red takes 25 hours on average for a new player to
       | complete.
       | 
       | Seriously? I've never really played video games, but i remember
       | spending so much time on pokemon red when i was young. Not sure
       | if i ever really finished more than once. But i'm pretty sure i
       | must have played for more than 50h or so before even close to
       | finish. My memory might trick me though.
       | 
       | Not sure which pokemon version it was, but i got so hooked trying
       | to get this "secret" pokemon which was just a bunch of pixels.
       | Some kind of bug (of the game, not the type of pokemon). You had
       | to do specific things in a park and other things and then surf up
       | and down x-times on the right shore of an island... or something
       | like that. I had no idea how it worked and got so hooked, i must
       | have spent most of my playing time on things like that.
       | 
       | Oh boy, memories...
        
         | ludicity wrote:
         | It definitely took me way more than 25 hours as a kid to beat
         | Pokemon Blue! But I was so young that I didn't understand that
         | "Oak: Hello!" meant that someone called Oak was talking.
         | 
         | The glitched Pokemon you're talking about is Missingno by the
         | way! I remember surfing up and down Cinnabar Island to do the
         | same thing.
        
           | xinpw8 wrote:
           | i had to look up how to do cut. like, i was hard-stuck.
        
           | endofreach wrote:
           | Awesome! Missingno was what i meant. Thank you!
        
         | Uehreka wrote:
         | There's a guy on Youtube named JRose11 who is on a quest to
         | beat Pokemon Red with all 151 of the original Pokemon
         | individually. He's about 100 Pokemon in at this point. He
         | doesn't use crazy speedrunning tactics (he wants to approximate
         | a normal-ish playthrough) but because he knows exactly where to
         | go, what to do and what's skippable almost all of his runs are
         | under 10 hours (many are under 6 and he did it with Mewtwo in
         | just under 2).
        
         | oreally wrote:
         | The estimates seem to be in today's reported numbers based off
         | howlongtobeat. Back in the day it was intended to last 60hours
         | iirc.
        
       | throwaway314155 wrote:
       | Awesome! Why do you think the reward for reading signs helped?
       | I'm assuming the model doesn't gain the ability to read and
       | understand english just from RL, so what purpose does it serve
       | other than to maybe waste ticks on signs that ultimately don't
       | need to be read?
        
         | drubs wrote:
         | It's silly, but signs were a way to incentivize the agent to
         | explore deeper into the Safari Zone among other areas.
        
       | N_Lens wrote:
       | Wow nice work. 10M is a tiny model and I suspect this might be
       | the future for specialised work. I can also imagine the progress
       | towards AGI/ASI to have smaller models used as submodules.
       | 
       | brains basically have "modules" like this as well - neuronal
       | columns that handle specialised tasks. For example when you're
       | driving on the road, the understanding whether the distance
       | between you and the vehicle in front is increasing or decreasing
       | is a finely tuned function of a specialised part of the brain.
        
       | KeplerBoy wrote:
       | Really missing the arxiv link. The whole page reads like the
       | arxiv link should be in the next paragraph, but it never
       | appeared.
        
       ___________________________________________________________________
       (page generated 2025-03-06 23:01 UTC)