[HN Gopher] Show HN: Beating Pokemon Red with RL and <10M Parame...
___________________________________________________________________
Show HN: Beating Pokemon Red with RL and <10M Parameters
Hi everyone! After spending hundreds of hours, we're excited to
finally share our progress in developing a reinforcement learning
system to beat Pokemon Red. Our system successfully completes the
game using a policy under 10M parameters, PPO, and a few novel
techniques. With the release of Claude Plays Pokemon, now feels
like the perfect time to showcase our work. We'd love to get
feedback!
Author : drubs
Score : 172 points
Date : 2025-03-05 17:07 UTC (1 days ago)
(HTM) web link (drubinstein.github.io)
(TXT) w3m dump (drubinstein.github.io)
| jononor wrote:
| Very nice! Nice to see demonstrations of reinforcement learning
| being used to solve non-trivial tasks.
| xinpw8 wrote:
| This is a first-in-world, isn't it?
| worble wrote:
| Heads up, clicking "Next Page" just takes you to an empty screen,
| you have to use the navigation links on the left if you want to
| get read past the first screen.
| drubs wrote:
| Thanks for the heads up. I just pushed a fix.
| worble wrote:
| I think you fixed the one below the puffer.ai image, but not
| the one above Authors.
| drubs wrote:
| and...fixed!
| xinpw8 wrote:
| i am sorry for my awful qa on the site :((((((((((((
| bee_rider wrote:
| Ah, very neat.
|
| Maybe some day the "rival" character in Pokemon can be played by
| a RL system, haha. That way you can have a "real player
| (simulated)" for your rival.
| xinpw8 wrote:
| a cool idea, except that battling actually doesn't even matter
| to the ai. if you look at what the agent is doing during a
| battle, it is sort of spamming options + picking damaging
| attacks. it would be a stretch to say that agents were 'good'
| at battling...
| wegfawefgawefg wrote:
| if youve done the work to to make the rival rl based and have
| the ability to go around youd probably have added basic
| battle controls
| xinpw8 wrote:
| as it stands, battling is wholly unimportant to completing
| the game, as long as the agents can eventually complete the
| trainer battles mandatory for plot advancement. it's funny
| because everyone thinks about battling when they think
| about pokemon. my first fn i wrote, back when we were still
| bumping around pallet town, was a battle reward function.
| it was trash and didn't work and was over-complicated. the
| crux of the problem is exploration over a vast, open-world
| map, and completion of the sundry storyline tasks at distal
| parts of said map in the correct sequence without the
| policy collapsing and without agents overfitting to, say,
| overworld loops.
| wegfawefgawefg wrote:
| you missed my point.
|
| I know all about rl. Ive read go-explore 1/2, and I have
| personally implemented intrinsic curiosity.
|
| I was just commenting on what rhe other person said,
| which is that it would be cool to have the npcs be agents
| that battle and train too, to which you said they could
| not be made to, to which I say, we have the technology.
| :)
| drubs wrote:
| Sounds cool to me.
| modeless wrote:
| Can't Pokemon be beaten by almost random play?
| VertanaNinjai wrote:
| It can be brute forced if that's what you mean. It has a fairly
| low difficulty curve and these old games have a grid system for
| movement and action selections. That's why they're pointing out
| the lower parameter amount and CPU. The point I took away is
| doing more with less.
| xinpw8 wrote:
| It definitely cannot be beaten using random inputs. It
| doesn't even get out of Pallet Town after billions of steps.
| We tested...
| fancyswimtime wrote:
| the game has been beaten by fish
| xinpw8 wrote:
| dyor we only tested it with a pufferfish, courtesy of
| puffer.ai / pufferlib RL library. i promise it doesn't
| work with random inputs.
| gusgus01 wrote:
| I'm not sure if you're just making a play on words, but I
| believe the commenter was talking about the streamer who
| sets up their fishtank to map to inputs and then let's
| their fish "play games". They beat pokemon sapphire
| supposedly.
| https://www.polygon.com/2020/11/9/21556590/fish-pokemon-
| sapp...
| cjbillington wrote:
| Based on the other examples of random inputs not being
| sufficient, I dare say the fish-based attempt may have
| been fraudulent.
| tehsauce wrote:
| It's impossible to beat with random actions or brute force, but
| you can get surprisingly far. It doesn't take too long to get
| halfway through route 1, but even with insane compute you'll
| never make it even to viridian forest.
| bloomingkales wrote:
| The win condition of the game is the entire state of the game
| configured in a certain way. So there exists a lot of win
| conditions, you just have to do a search.
| drdeca wrote:
| Judging by the "pi plays Pokemon Sapphire", uh, not in a
| reasonable amount of time? It's been at it for over 3 years,
| hasn't gotten a gym badge yet, mostly stays in the starting
| town.
| bubblyworld wrote:
| What an awesome project! I'm curious - I would have thought that
| rewarding unique coordinates would be enough to get the agent to
| (eventually) explore all areas, including the key ones. What did
| the agents end up doing before key areas got an extra reward?
|
| (and how on earth did you port Pokemon red to a RL environment?
| O.o)
| drubs wrote:
| The environments wouldn't concentrate enough in the Rocket
| Hideout beneath Celadon Game Corner. The agent would have the
| player wander the world reward hacking. With wild battles
| enabled, the environments would end up in Lavender Tower
| fighting Gastly.
|
| > (and how on earth did you port Pokemon red to a RL
| environment? O.o)
|
| Read and find out :)
| bubblyworld wrote:
| Thanks haha, I kept reading =D I see, so it's not just that
| you have to _visit_ the key areas, they need to show up in
| the episodes enough to provide a signal for training.
| drubs wrote:
| Yup!
| wegfawefgawefg wrote:
| you dont port it you wrap it. you can put anything in an rl
| environment. usually emulators are done with bizhawk, and some
| lua. worst case theres ffi or screen capture.
| drubs wrote:
| My first version of this project 5 years ago involved a
| python-lua named pipe using Bizhawk actually. No clue where
| that code went
| bubblyworld wrote:
| Right, my thought was that this would be way too slow for
| episode rollout (versus an accelerated implementation in jax
| or something), but I guess not!
| rvz wrote:
| Note: What makes this interesting is that this is a pre-LLM
| project which shows that in some projects you don't need an "LLM"
| for this. All you need is just a plain old reinforcement learning
| algorithm and a deep neural network which is perfect for this.
|
| This is what I want to see more of and goes against the hype of
| LLMs. What a great RL project.
|
| Meanwhile, "Claude" is still stuck somewhere in the game. Imagine
| the costs of running that vs this project.
| mclau156 wrote:
| Claude 3.7 recently failed to finish Pokemon after getting
| stuck in a corner and deciding it was impossible to get out
| xinpw8 wrote:
| not our agents a hierarchical approach would be superior. add
| rl to claude and it's gg
| mclau156 wrote:
| Could you have used the decompilations of pokemon on github?
| https://github.com/pret/pokered
| drubs wrote:
| There's an entire section on how the decompilations were used
| :)
| mclau156 wrote:
| Ok sorry I thought maybe there was a chance that the decomp
| project could edited in a way that would create a ROM that
| allowed RL to be done easier, but it seems like it just came
| in handy for looking up values along with the GB ASM
| tutorial, the alternative of my thought process is re-
| creating pokemon red in a modern language which you also
| mentioned
| xinpw8 wrote:
| if you helped with pret then god bless you
| levocardia wrote:
| Really cool work. It seems like some critical areas (team rocket,
| safari zone) rely on encoding game knowledge into the reward
| function somehow, which "smuggles in" external intelligence about
| the game. A lot of these are related to planning, which makes me
| wonder whether you could "bolt on" an LLM to do things like steer
| the RL agent, dynamically choose what to reward, or even do some
| of the planning itself. Do you think there's any low-hanging
| fruit on this front?
| drubs wrote:
| Wrote about this in the results section. I think there is a way
| to mix the two and simplify the rewards in the process. A lot
| of the magic behind getting the agent to teach and use cut
| probably could have been handled by an LLM.
| Xelynega wrote:
| For well-known games like "Pokemon Red" I wonder how much of
| that game knowledge would be "smuggled in" by an LLM in it's
| training data if you just replaced the external info in the
| reward function with it/used it to make up for other
| deficiencies.
|
| I think they allude to this in their conclusion, but it's less
| about the low-hanging fruit and more about designing a system
| to feedback game dialogue into the RL decision making process
| in a way that can be mutated as part of the RL(be it an LLM or
| something else)
| differintegral wrote:
| This is very cool, congrats!
|
| I wonder, does anyone have a sense of the approximate raw number
| of button presses required to beat the game? Mostly curious to
| see how that compares to the parameter count.
| tarentel wrote:
| I imagine < 10000.
| https://github.com/KeeyanGhoreshi/PokemonFireredSingleSequen...
| and https://www.youtube.com/watch?v=6gjsAA_5Agk. I believe this
| is something like 200k and is a slightly different game. Quite
| a bit less than 10m either way.
| benopal64 wrote:
| Incredible work. I am just learning about PyBoy from your
| project, and it made me think of many fun ways to use that
| library to play Pokemon autonomously.
| xinpw8 wrote:
| Very good to hear. Join the pyboy/pokemon discords!
| https://discord.gg/UXpjQTgs https://discord.gg/EVS3tAGm
| kerkeslager wrote:
| Are there any uses for AI yet that _aren 't_ either:
|
| 1. Doing things humans do for fun. 2. Doing things that AI is
| horribly terrible at.
|
| ?
| sadeshmukh wrote:
| Medical field, spotting things
|
| Autonomous drones
|
| Financial fraud detection
|
| Scheduling of trains/buses/etc
|
| I personally do like chatbots but you probably don't
| xinpw8 wrote:
| the only chatbot for me is smarterchild
| bigfishrunning wrote:
| I feel like that sentence aged me.
| drubs wrote:
| There's a ton of applications for AI. Back when I was at
| Spotify, I co-authored Basic Pitch
| (https://basicpitch.spotify.com/), an audio-to-midi library.
| There are a ton of uses for AI outside of what's heavily
| publicized.
| nimish wrote:
| Considering how many things are less complicated than Pokemon,
| this is very cool
| novia wrote:
| Please stream the gameplay to twitch so people can compare.
| tehsauce wrote:
| We have a shared community map where you can watch hundreds of
| agents from multiple peoples training runs playing in real
| time!
|
| https://pwhiddy.github.io/pokerl-map-viz/
| Matthyze wrote:
| That's amazing. Really awesome work.
| endofreach wrote:
| > Pokemon Red takes 25 hours on average for a new player to
| complete.
|
| Seriously? I've never really played video games, but i remember
| spending so much time on pokemon red when i was young. Not sure
| if i ever really finished more than once. But i'm pretty sure i
| must have played for more than 50h or so before even close to
| finish. My memory might trick me though.
|
| Not sure which pokemon version it was, but i got so hooked trying
| to get this "secret" pokemon which was just a bunch of pixels.
| Some kind of bug (of the game, not the type of pokemon). You had
| to do specific things in a park and other things and then surf up
| and down x-times on the right shore of an island... or something
| like that. I had no idea how it worked and got so hooked, i must
| have spent most of my playing time on things like that.
|
| Oh boy, memories...
| ludicity wrote:
| It definitely took me way more than 25 hours as a kid to beat
| Pokemon Blue! But I was so young that I didn't understand that
| "Oak: Hello!" meant that someone called Oak was talking.
|
| The glitched Pokemon you're talking about is Missingno by the
| way! I remember surfing up and down Cinnabar Island to do the
| same thing.
| xinpw8 wrote:
| i had to look up how to do cut. like, i was hard-stuck.
| endofreach wrote:
| Awesome! Missingno was what i meant. Thank you!
| Uehreka wrote:
| There's a guy on Youtube named JRose11 who is on a quest to
| beat Pokemon Red with all 151 of the original Pokemon
| individually. He's about 100 Pokemon in at this point. He
| doesn't use crazy speedrunning tactics (he wants to approximate
| a normal-ish playthrough) but because he knows exactly where to
| go, what to do and what's skippable almost all of his runs are
| under 10 hours (many are under 6 and he did it with Mewtwo in
| just under 2).
| oreally wrote:
| The estimates seem to be in today's reported numbers based off
| howlongtobeat. Back in the day it was intended to last 60hours
| iirc.
| throwaway314155 wrote:
| Awesome! Why do you think the reward for reading signs helped?
| I'm assuming the model doesn't gain the ability to read and
| understand english just from RL, so what purpose does it serve
| other than to maybe waste ticks on signs that ultimately don't
| need to be read?
| drubs wrote:
| It's silly, but signs were a way to incentivize the agent to
| explore deeper into the Safari Zone among other areas.
| N_Lens wrote:
| Wow nice work. 10M is a tiny model and I suspect this might be
| the future for specialised work. I can also imagine the progress
| towards AGI/ASI to have smaller models used as submodules.
|
| brains basically have "modules" like this as well - neuronal
| columns that handle specialised tasks. For example when you're
| driving on the road, the understanding whether the distance
| between you and the vehicle in front is increasing or decreasing
| is a finely tuned function of a specialised part of the brain.
| KeplerBoy wrote:
| Really missing the arxiv link. The whole page reads like the
| arxiv link should be in the next paragraph, but it never
| appeared.
___________________________________________________________________
(page generated 2025-03-06 23:01 UTC)