[HN Gopher] Deep Reinforcement Learning: Zero to Hero
___________________________________________________________________
Deep Reinforcement Learning: Zero to Hero
Author : alessiodm
Score : 482 points
Date : 2024-05-05 23:12 UTC (23 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| alessiodm wrote:
| While trying to learn the latest in Deep Reinforcement Learning,
| I was able to take advantage of many excellent resources (see
| credits [1]), but I couldn't find one that provided the right
| balance between theory and practice for my personal experience.
| So I decided to create something myself, and open-source it for
| the community, in case it might be useful to someone else.
|
| None of that would have been possible without all the resources
| listed in [1], but I rewrote all algorithms in this series of
| Python notebooks from scratch, with a "pedagogical approach" in
| mind. It is a hands-on step-by-step tutorial about Deep
| Reinforcement Learning techniques (up ~2018/2019 SoTA) guiding
| through theory and coding exercises on the most utilized
| algorithms (QLearning, DQN, SAC, PPO, etc.)
|
| I shamelessly stole the title from a hero of mine, Andrej
| Karpathy, and his "Neural Network: Zero To Hero" [2] work. I also
| meant to work on a series of YouTube videos, but didn't have the
| time yet. If this posts gets any type of interest, I might go
| back to it. Thank you.
|
| P.S.: A friend of mine suggested me to post here, so I followed
| their advice: this is my first post, I hope it properly abides
| with the rules of the community.
|
| [1] https://github.com/alessiodm/drl-zh/blob/main/00_Intro.ipynb
| [2] https://karpathy.ai/zero-to-hero.html
| verdverm wrote:
| very cool, thanks for putting this together
|
| It would be great to see a page dedicated to SoTA techniques &
| results
| alessiodm wrote:
| Thank you so much! And very good advice: I have an extremely
| brief and not-descriptive list in the "Next" notebook,
| initially intended for that. But it definitely falls short.
|
| I may actually expand it in a second "more advanced" series
| of notebooks, to explore model-based RL, curiosity, and other
| recent topics: even if not comprehensive, some hands on basic
| coding exercise on those topics might be of interest
| nonetheless.
| tunnuz wrote:
| Does it rely heavily on python, or could someone use a
| different language to go through the material?
| alessiodm wrote:
| Yes, the material relies heavily on Python. I intentionally
| used popular open-source libraries (such as Gymnasium for RL
| environments, and PyTorch for deep learning) and Python
| itself given their popularity in the field, so that the
| content and learnings could be readily applicable to real-
| world projects.
|
| The theory and algorithms per-se are general: they can be re-
| implemented in any language, as long as there are comparable
| libraries to use. But the notebooks are primarily in Python,
| and the (attempted) "frictionless" learning experience would
| lose a bit if the setup is in a different language, and it'll
| likely take a little bit more effort to follow along.
| viraptor wrote:
| In case you want to expand to more chapters one day: there's lots
| of tutorials of doing the simple things that has been verified to
| work, but if I'm struggling it's normally with something people
| barely ever mention - what to do when things go wrong. For
| example your actions just consistently get stuck at maximum. Or
| the exploration doesn't kick in, regardless how noisy you make
| the off-policy training. Or ...
|
| I wish there were more practical resources for when you've got
| the basics usually working, but suddenly get issues nobody really
| talks about. (beyond "just tweak some stuff until it works"
| anyway)
| alessiodm wrote:
| Thanks a lot, and another great suggestion for improvement. I
| also found that the common advice is "tweak hyperparameters
| until you find the right combination". That can definitely
| help. But usually issues hide in different "corners", both of
| the problem space and its formulation, the algorithm itself
| (e.g., just different random seeds have big variance in
| performance), and more.
|
| As you mentioned, in real applications of DRL things tend to go
| wrong more often than right: "it doesn't work just yet" [1].
| And my short tutorial definitely lacks in the area of
| troubleshooting, tuning, and "productionisation". If I carve
| time for expansion, this will likely make top of list. Thanks
| again.
|
| [1] https://www.alexirpan.com/2018/02/14/rl-hard.html
| ubj wrote:
| Thanks for sharing [1], that was a great read. I'd be curious
| to see an updated version of that article, since it's about 6
| years old now. For example, Boston Dynamics has transitioned
| from MPC to RL for controlling its Spot robots [2]. Davide
| Scaramuzza, whose team created autonomous FPV drones that
| beat expert human pilots, has also discussed how his team had
| to transition from MPC to RL [3].
|
| [2]: https://bostondynamics.com/blog/starting-on-the-right-
| foot-w...
|
| [3]: https://www.incontrolpodcast.com/1632769/13775734-ep15-d
| avid...
| alessiodm wrote:
| Thank you for the amazing links as well! You are right that
| the article [1] is 6 years old now, and indeed the field
| has evolved. But the algorithms and techniques I share in
| the GitHub repo are the "classic" ones (dating back then
| too), for which that post is still relevant - at least from
| an historical perspective.
|
| You bring up a very good point though: more recent
| advancements and assessments should be linked and/or
| mentioned in the repo (e.g., in the resources and/or an
| appendix). I will try to do that sometime.
| achandra03 wrote:
| This looks really interesting! I tried exploring deep RL myself
| some time ago but could never get my agents to make any
| meaningful progress, and as someone with very little stats/ML
| background it was difficult to debug what was going wrong. Will
| try following this and seeing what happens!
| alessiodm wrote:
| Thank you very much! I'd be really interested to know if your
| agents will eventually make progress, and if these notebooks
| help - even if a tiny bit!
|
| If you just want to see if these algorithm can even work at
| all, feel free to jump on the `solution` folder and pick any
| algorithm you think could work and just try it out there. If it
| does, then you can have all the fun rewriting it from scratch
| :) Thanks again!
| barrenko wrote:
| I mean, resources like these are great, but RL in itself is
| quite dense and topic heavy, so not sure there is any way to
| reduce the inherent difficulty level, any beginner should be
| made clear to that. That's my primary gripe with ML topics
| (especially RL related).
| alessiodm wrote:
| Thank you. It is true, indeed the material does assume some
| prior knowledge (which I mention in the introduction). In
| particular: being proficient in Python, or at least in one
| high-level programming language, be familiar with deep
| learning and neural networks, and - to get into the theory
| and mathematics (optional) - basic calculus, algebra,
| statistics, and probability theory.
|
| Nonetheless, especially for RL foundations, I found that a
| practical understanding of the algorithms at a basic level,
| writing them yourself, and "playing" with them and their
| results (especially in small toy settings like the grid
| world) provided the best way to start getting a basic
| intuition in the field. Hence, this resource :)
| zaptrem wrote:
| I spent three semesters in college learning RL only to be
| massively disappointed in the end after discovering that the
| latest and greatest RL techniques can't even beat a simple
| heuristic in Tetris.
| alessiodm wrote:
| RL can be massively disappointing, indeed. And I agree with you
| (and with the amazing post I already referenced [1]) that it is
| hard to get it to work at all. Sorry to hear you have been
| disappointed so much!
|
| Nonetheless, I would personally recommend even just learning
| the basics and fundamentals of RL. Beyond supervised,
| unsupervised, and the most-recent and well-deservedly hyped
| semi-supervised learning (generative AI, LLMs, and so on),
| reinforcement learning indeed models the learning problem in a
| very elegant way: an agent interacting with an environment and
| getting feedback. Which is, arguably, a very intuitive and
| natural way of modeling it. You could consider backward error
| correction / propagation as an implicit reward signal, but that
| would be a very limited view.
|
| On a positive note, RL has very practical sucessful
| applications today - even if in niche fields. For example, LLM
| fine-tuning techniques like RLHF successfully apply RL to
| modern AI systems, companies like Covariant are working on
| large robotics models which definitely use RL, and generally as
| a research field I believe (but I may be proven wrong!) there
| is so much more to explore. For example, check Nvidia Eureka
| that combines LLM to RL [2]: pretty cool stuff IMHO!
|
| Far from attempting to convince you on the strength and
| capabilities of DRL, just recommending folks to not discard it
| right away and at least give it a chance to learn the basics,
| even just for an intellectual exercise :) Thanks again!
|
| [1] https://www.alexirpan.com/2018/02/14/rl-hard.html
|
| [2] https://blogs.nvidia.com/blog/eureka-robotics-research/
| jmward01 wrote:
| I modeled part of my company's business problem as a MAB
| problem and saved my company 10% off their biggest cost and,
| just as important, showcased an automated truth signal that
| helped us understand what was, and wasn't, working in several
| of our features. Like all tools, finding the right place to use
| RL concepts is a big deal. I think one thing that is often
| missed in a classroom setting is pushing more real world
| examples of where powerful ideas can be used. Talking about
| optimal policies is great, but if you don't help people
| understand where those ideas can be applied then it is just a
| bunch of fun math. (which is often a good enough reason on its
| own :)
| smokel wrote:
| For those not in the know, "MAB" is short for Multi-Armed
| Bandit [1], which is a decision-making framework that is
| often discussed in the broader context of reinforcement
| learning.
|
| In my limited understanding, MAB problems are simpler than
| those tackled by Deep Reinforcement Learning (DRL), because
| typically there is no state involved in bandit problems.
| However, I have no idea about their scale in practical
| applications, and would love to know more about said business
| problem.
|
| [1] https://en.wikipedia.org/wiki/Multi-armed_bandit
| jmward01 wrote:
| There are often times when you have n possible providers of
| service y, each with strengths and weaknesses. If you have
| some ultimate truth signal (like follow on costs which are
| linked to quality, which was what I used) then you can
| model the providers as bandits and use something like UCB1
| to choose which to use. If you then apply this to every
| individual customer what you end up doing is learning the
| optimal vendor for each customer which gives you a higher
| efficiency than had you picked just one 'best all around'
| vendor for all customers. So the pattern here is: If you
| have n_service_providers and n_customers and a value signal
| to optimize then maybe MAB is the place to go for some
| possible quick gains. Of course if you have a huge state
| space to explore instead of just n_service_providers, for
| instance you want to model combinations of choices, using
| something like a NN to learn the state space value function
| is also a great way to go.
| vineyardlabs wrote:
| RL seems to be in this weird middle ground right now where
| nobody knows how to make it work all that well but almost
| everybody at the top levels of ML research agrees it's a vital
| component of further advances in AI.
| bluishgreen wrote:
| "Shamelessly stole the title from a hero of mine". Your
| Shamelessness is all fine. But at first I thought this is a post
| from Andrej Karpathy. He has one of the best personal brands out
| there on the internet, while personal brands can't be enforced,
| this confused me at first.
| alessiodm wrote:
| TL;DR: If more folks feel this way, please upvote this comment:
| I'll be happy to take down this post, change the title, and
| either re-post it or just don't - the GitHub repo is out there
| - that that should be more than enough. Sorry again for the
| confusion (I just upvoted it).
|
| I am deeply sorry about the confusion. And the last thing I
| intended was to grab any attention away from Andrej, and / or
| being confused with him.
|
| I tried to find a way to edit the post title, but I couldn't
| find one. Is there just a limited time window to do that? If
| you know how to do it, I'd be happy to edit it right away in
| case.
|
| I didn't even think this post would get any attention at all -
| it is my first post indeed here, and I really did it just b/c
| if anybody could use this project to learn RL I was happy to
| share.
| khiner wrote:
| Throwing in my vote - I wasn't confused, saw your GH link and
| a "Zero to Hero" course name on RL, seems clear to me and
| "Zero to Hero" is a classic title for a first course, nice
| that you gave props to Andrea too! Multiple people can and
| should make ML guides and reference each other. Thanks for
| putting in the time to share your learnings and make a
| fantastic resource out of it!
| alessiodm wrote:
| Thanks a lot. It makes me feel better to hear that the post
| is not completely confusing and appropriating - I really
| didn't mean that, or to use it as a trick for attention.
| FezzikTheGiant wrote:
| this is a great resource nonetheless. Even if you did use the
| name to get attention how does it matter? I still see it as a
| net positive. Thanks for sharing this
| alessiodm wrote:
| Thank you!
| gradascent wrote:
| I didn't find it confusing at all. I think it's totally ok to
| re-use phrasing made famous by someone else - this is how
| language evolves after all.
| alessiodm wrote:
| Thank you, I appreciate it.
| ultra_nick wrote:
| Didn't "Zero to Hero" come from Disney's Hercules movie
| before Karparthy used it?
| alessiodm wrote:
| Didn't know that, but now I have an excuse to go watch a
| movie :D
| levocardia wrote:
| This looks great - maybe add a link to the youtube videos in the
| README?
| alessiodm wrote:
| Thank you so much! Unfortunately, that is a mistake in the
| README that I just noticed (thank you for pointing it out!) :(
| As I mentioned in the first post, I didn't get to make the
| YouTube videos yet. But it seems the community would be indeed
| interested.
|
| I will try to get to them (and in the meantime fix the README,
| sorry about that!)
| chaosprint wrote:
| Great resources! Thank you for making this.
|
| I'm attaching here a DRL framework I made for music generation,
| similar to OpenAI Gym. If anyone wants to test the algorithms OP
| includes, you are welcome to use it. Issues and PRs are also
| welcome.
|
| https://github.com/chaosprint/RaveForce
| malomalsky wrote:
| If there anything like that, but for NLP?
| barrenko wrote:
| There's the series this material references - "Neural networks:
| zero to hero" that has GPT related parts.
| alessiodm wrote:
| I took the Deep Learning course [1] by deeplearning.ai in the
| past, and their resources where incredibly good IMHO. Hence, I
| would suggest to take a look at their NLP specialization [2].
|
| +1000 to "Neural networks: zero to hero" already mentioned as
| well.
|
| [1] https://www.deeplearning.ai/courses/deep-learning-
| specializa... [2] https://www.deeplearning.ai/courses/natural-
| language-process...
| spmurrayzzz wrote:
| There is an NLP section in Jeremy Howard's "Practical Deep
| Learning for Coders" course (free):
| https://course.fast.ai/Lessons/lesson4.html
|
| The whole course is fantastic. I recommend it frequently to
| folks who want to start with DL basics and ramp up quickly to
| more advanced material.
| fancyfredbot wrote:
| This is really nice, great idea. I am going to make a suggestion
| which I hope is helpful - I don't mean to be critical of this
| nice project.
|
| After going through the MDP example, I have one comment on the
| way you introduce the non-deterministic transition function. In
| your example the non-determinism comes from the agent making
| "mistakes", it can mistakenly go left or right when trying to go
| up or down:
|
| 1) You could introduce the mistakes more clearly as it isn't
| really explained the agent makes mistakes in the text, and so the
| comment about mistakes in the transition() function is initally a
| bit confusing.
|
| 2) I think the way this introduces non-determinism could be more
| didactic if the non-determinism came from the environment, not
| the agent? For example the agent might be moving on a rough
| surface and moving its tracks/limbs/whatever might not always
| produce the intended outcome. As you present it the transition is
| a function from an action _to a random action_ to a random state,
| and the definition is just a function from an action to a random
| state.
| dukeofdoom wrote:
| Maybe I can use this in my pygame game
| wegfawefgawefg wrote:
| A few years ago I made something similar. It doesnt go all the
| way to ppo, and has a different style.
|
| https://learndrl.com/
|
| I won't claim it is better or worse, but if anyone here is trying
| to learn, having the same information presented in multiple forms
| is always nice.
| jezzamon wrote:
| Awesome, I've been sort of stuck in the limbo of doing courses
| that taught me some theory but missing the hands on knowledge I
| need to really use RL. This looks like exactly the type of course
| I'm looking for!
| alessiodm wrote:
| Thank you! I'll be be curious if / how these notebooks help and
| how your experience is! Any feedback welcome!
| mode80 wrote:
| Thanks for making this!
|
| Note: I was carefully reading along and well into the third
| notebook before I realized that the code sections marked "TODO"
| were actual exercises for the reader to implement! (And the tests
| which follow are for the reader to check their work.)
|
| This is a clever approach. It just wasn't obvious to me from the
| outset.
|
| (I thought the TODOs were just some fiddly details you didn't
| want distracting readers from the big picture. But in fact, those
| are the important parts.)
___________________________________________________________________
(page generated 2024-05-06 23:02 UTC)