[HN Gopher] Deep Reinforcement Learning: Zero to Hero
       ___________________________________________________________________
        
       Deep Reinforcement Learning: Zero to Hero
        
       Author : alessiodm
       Score  : 482 points
       Date   : 2024-05-05 23:12 UTC (23 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | alessiodm wrote:
       | While trying to learn the latest in Deep Reinforcement Learning,
       | I was able to take advantage of many excellent resources (see
       | credits [1]), but I couldn't find one that provided the right
       | balance between theory and practice for my personal experience.
       | So I decided to create something myself, and open-source it for
       | the community, in case it might be useful to someone else.
       | 
       | None of that would have been possible without all the resources
       | listed in [1], but I rewrote all algorithms in this series of
       | Python notebooks from scratch, with a "pedagogical approach" in
       | mind. It is a hands-on step-by-step tutorial about Deep
       | Reinforcement Learning techniques (up ~2018/2019 SoTA) guiding
       | through theory and coding exercises on the most utilized
       | algorithms (QLearning, DQN, SAC, PPO, etc.)
       | 
       | I shamelessly stole the title from a hero of mine, Andrej
       | Karpathy, and his "Neural Network: Zero To Hero" [2] work. I also
       | meant to work on a series of YouTube videos, but didn't have the
       | time yet. If this posts gets any type of interest, I might go
       | back to it. Thank you.
       | 
       | P.S.: A friend of mine suggested me to post here, so I followed
       | their advice: this is my first post, I hope it properly abides
       | with the rules of the community.
       | 
       | [1] https://github.com/alessiodm/drl-zh/blob/main/00_Intro.ipynb
       | [2] https://karpathy.ai/zero-to-hero.html
        
         | verdverm wrote:
         | very cool, thanks for putting this together
         | 
         | It would be great to see a page dedicated to SoTA techniques &
         | results
        
           | alessiodm wrote:
           | Thank you so much! And very good advice: I have an extremely
           | brief and not-descriptive list in the "Next" notebook,
           | initially intended for that. But it definitely falls short.
           | 
           | I may actually expand it in a second "more advanced" series
           | of notebooks, to explore model-based RL, curiosity, and other
           | recent topics: even if not comprehensive, some hands on basic
           | coding exercise on those topics might be of interest
           | nonetheless.
        
         | tunnuz wrote:
         | Does it rely heavily on python, or could someone use a
         | different language to go through the material?
        
           | alessiodm wrote:
           | Yes, the material relies heavily on Python. I intentionally
           | used popular open-source libraries (such as Gymnasium for RL
           | environments, and PyTorch for deep learning) and Python
           | itself given their popularity in the field, so that the
           | content and learnings could be readily applicable to real-
           | world projects.
           | 
           | The theory and algorithms per-se are general: they can be re-
           | implemented in any language, as long as there are comparable
           | libraries to use. But the notebooks are primarily in Python,
           | and the (attempted) "frictionless" learning experience would
           | lose a bit if the setup is in a different language, and it'll
           | likely take a little bit more effort to follow along.
        
       | viraptor wrote:
       | In case you want to expand to more chapters one day: there's lots
       | of tutorials of doing the simple things that has been verified to
       | work, but if I'm struggling it's normally with something people
       | barely ever mention - what to do when things go wrong. For
       | example your actions just consistently get stuck at maximum. Or
       | the exploration doesn't kick in, regardless how noisy you make
       | the off-policy training. Or ...
       | 
       | I wish there were more practical resources for when you've got
       | the basics usually working, but suddenly get issues nobody really
       | talks about. (beyond "just tweak some stuff until it works"
       | anyway)
        
         | alessiodm wrote:
         | Thanks a lot, and another great suggestion for improvement. I
         | also found that the common advice is "tweak hyperparameters
         | until you find the right combination". That can definitely
         | help. But usually issues hide in different "corners", both of
         | the problem space and its formulation, the algorithm itself
         | (e.g., just different random seeds have big variance in
         | performance), and more.
         | 
         | As you mentioned, in real applications of DRL things tend to go
         | wrong more often than right: "it doesn't work just yet" [1].
         | And my short tutorial definitely lacks in the area of
         | troubleshooting, tuning, and "productionisation". If I carve
         | time for expansion, this will likely make top of list. Thanks
         | again.
         | 
         | [1] https://www.alexirpan.com/2018/02/14/rl-hard.html
        
           | ubj wrote:
           | Thanks for sharing [1], that was a great read. I'd be curious
           | to see an updated version of that article, since it's about 6
           | years old now. For example, Boston Dynamics has transitioned
           | from MPC to RL for controlling its Spot robots [2]. Davide
           | Scaramuzza, whose team created autonomous FPV drones that
           | beat expert human pilots, has also discussed how his team had
           | to transition from MPC to RL [3].
           | 
           | [2]: https://bostondynamics.com/blog/starting-on-the-right-
           | foot-w...
           | 
           | [3]: https://www.incontrolpodcast.com/1632769/13775734-ep15-d
           | avid...
        
             | alessiodm wrote:
             | Thank you for the amazing links as well! You are right that
             | the article [1] is 6 years old now, and indeed the field
             | has evolved. But the algorithms and techniques I share in
             | the GitHub repo are the "classic" ones (dating back then
             | too), for which that post is still relevant - at least from
             | an historical perspective.
             | 
             | You bring up a very good point though: more recent
             | advancements and assessments should be linked and/or
             | mentioned in the repo (e.g., in the resources and/or an
             | appendix). I will try to do that sometime.
        
       | achandra03 wrote:
       | This looks really interesting! I tried exploring deep RL myself
       | some time ago but could never get my agents to make any
       | meaningful progress, and as someone with very little stats/ML
       | background it was difficult to debug what was going wrong. Will
       | try following this and seeing what happens!
        
         | alessiodm wrote:
         | Thank you very much! I'd be really interested to know if your
         | agents will eventually make progress, and if these notebooks
         | help - even if a tiny bit!
         | 
         | If you just want to see if these algorithm can even work at
         | all, feel free to jump on the `solution` folder and pick any
         | algorithm you think could work and just try it out there. If it
         | does, then you can have all the fun rewriting it from scratch
         | :) Thanks again!
        
         | barrenko wrote:
         | I mean, resources like these are great, but RL in itself is
         | quite dense and topic heavy, so not sure there is any way to
         | reduce the inherent difficulty level, any beginner should be
         | made clear to that. That's my primary gripe with ML topics
         | (especially RL related).
        
           | alessiodm wrote:
           | Thank you. It is true, indeed the material does assume some
           | prior knowledge (which I mention in the introduction). In
           | particular: being proficient in Python, or at least in one
           | high-level programming language, be familiar with deep
           | learning and neural networks, and - to get into the theory
           | and mathematics (optional) - basic calculus, algebra,
           | statistics, and probability theory.
           | 
           | Nonetheless, especially for RL foundations, I found that a
           | practical understanding of the algorithms at a basic level,
           | writing them yourself, and "playing" with them and their
           | results (especially in small toy settings like the grid
           | world) provided the best way to start getting a basic
           | intuition in the field. Hence, this resource :)
        
       | zaptrem wrote:
       | I spent three semesters in college learning RL only to be
       | massively disappointed in the end after discovering that the
       | latest and greatest RL techniques can't even beat a simple
       | heuristic in Tetris.
        
         | alessiodm wrote:
         | RL can be massively disappointing, indeed. And I agree with you
         | (and with the amazing post I already referenced [1]) that it is
         | hard to get it to work at all. Sorry to hear you have been
         | disappointed so much!
         | 
         | Nonetheless, I would personally recommend even just learning
         | the basics and fundamentals of RL. Beyond supervised,
         | unsupervised, and the most-recent and well-deservedly hyped
         | semi-supervised learning (generative AI, LLMs, and so on),
         | reinforcement learning indeed models the learning problem in a
         | very elegant way: an agent interacting with an environment and
         | getting feedback. Which is, arguably, a very intuitive and
         | natural way of modeling it. You could consider backward error
         | correction / propagation as an implicit reward signal, but that
         | would be a very limited view.
         | 
         | On a positive note, RL has very practical sucessful
         | applications today - even if in niche fields. For example, LLM
         | fine-tuning techniques like RLHF successfully apply RL to
         | modern AI systems, companies like Covariant are working on
         | large robotics models which definitely use RL, and generally as
         | a research field I believe (but I may be proven wrong!) there
         | is so much more to explore. For example, check Nvidia Eureka
         | that combines LLM to RL [2]: pretty cool stuff IMHO!
         | 
         | Far from attempting to convince you on the strength and
         | capabilities of DRL, just recommending folks to not discard it
         | right away and at least give it a chance to learn the basics,
         | even just for an intellectual exercise :) Thanks again!
         | 
         | [1] https://www.alexirpan.com/2018/02/14/rl-hard.html
         | 
         | [2] https://blogs.nvidia.com/blog/eureka-robotics-research/
        
         | jmward01 wrote:
         | I modeled part of my company's business problem as a MAB
         | problem and saved my company 10% off their biggest cost and,
         | just as important, showcased an automated truth signal that
         | helped us understand what was, and wasn't, working in several
         | of our features. Like all tools, finding the right place to use
         | RL concepts is a big deal. I think one thing that is often
         | missed in a classroom setting is pushing more real world
         | examples of where powerful ideas can be used. Talking about
         | optimal policies is great, but if you don't help people
         | understand where those ideas can be applied then it is just a
         | bunch of fun math. (which is often a good enough reason on its
         | own :)
        
           | smokel wrote:
           | For those not in the know, "MAB" is short for Multi-Armed
           | Bandit [1], which is a decision-making framework that is
           | often discussed in the broader context of reinforcement
           | learning.
           | 
           | In my limited understanding, MAB problems are simpler than
           | those tackled by Deep Reinforcement Learning (DRL), because
           | typically there is no state involved in bandit problems.
           | However, I have no idea about their scale in practical
           | applications, and would love to know more about said business
           | problem.
           | 
           | [1] https://en.wikipedia.org/wiki/Multi-armed_bandit
        
             | jmward01 wrote:
             | There are often times when you have n possible providers of
             | service y, each with strengths and weaknesses. If you have
             | some ultimate truth signal (like follow on costs which are
             | linked to quality, which was what I used) then you can
             | model the providers as bandits and use something like UCB1
             | to choose which to use. If you then apply this to every
             | individual customer what you end up doing is learning the
             | optimal vendor for each customer which gives you a higher
             | efficiency than had you picked just one 'best all around'
             | vendor for all customers. So the pattern here is: If you
             | have n_service_providers and n_customers and a value signal
             | to optimize then maybe MAB is the place to go for some
             | possible quick gains. Of course if you have a huge state
             | space to explore instead of just n_service_providers, for
             | instance you want to model combinations of choices, using
             | something like a NN to learn the state space value function
             | is also a great way to go.
        
         | vineyardlabs wrote:
         | RL seems to be in this weird middle ground right now where
         | nobody knows how to make it work all that well but almost
         | everybody at the top levels of ML research agrees it's a vital
         | component of further advances in AI.
        
       | bluishgreen wrote:
       | "Shamelessly stole the title from a hero of mine". Your
       | Shamelessness is all fine. But at first I thought this is a post
       | from Andrej Karpathy. He has one of the best personal brands out
       | there on the internet, while personal brands can't be enforced,
       | this confused me at first.
        
         | alessiodm wrote:
         | TL;DR: If more folks feel this way, please upvote this comment:
         | I'll be happy to take down this post, change the title, and
         | either re-post it or just don't - the GitHub repo is out there
         | - that that should be more than enough. Sorry again for the
         | confusion (I just upvoted it).
         | 
         | I am deeply sorry about the confusion. And the last thing I
         | intended was to grab any attention away from Andrej, and / or
         | being confused with him.
         | 
         | I tried to find a way to edit the post title, but I couldn't
         | find one. Is there just a limited time window to do that? If
         | you know how to do it, I'd be happy to edit it right away in
         | case.
         | 
         | I didn't even think this post would get any attention at all -
         | it is my first post indeed here, and I really did it just b/c
         | if anybody could use this project to learn RL I was happy to
         | share.
        
           | khiner wrote:
           | Throwing in my vote - I wasn't confused, saw your GH link and
           | a "Zero to Hero" course name on RL, seems clear to me and
           | "Zero to Hero" is a classic title for a first course, nice
           | that you gave props to Andrea too! Multiple people can and
           | should make ML guides and reference each other. Thanks for
           | putting in the time to share your learnings and make a
           | fantastic resource out of it!
        
             | alessiodm wrote:
             | Thanks a lot. It makes me feel better to hear that the post
             | is not completely confusing and appropriating - I really
             | didn't mean that, or to use it as a trick for attention.
        
           | FezzikTheGiant wrote:
           | this is a great resource nonetheless. Even if you did use the
           | name to get attention how does it matter? I still see it as a
           | net positive. Thanks for sharing this
        
             | alessiodm wrote:
             | Thank you!
        
           | gradascent wrote:
           | I didn't find it confusing at all. I think it's totally ok to
           | re-use phrasing made famous by someone else - this is how
           | language evolves after all.
        
             | alessiodm wrote:
             | Thank you, I appreciate it.
        
           | ultra_nick wrote:
           | Didn't "Zero to Hero" come from Disney's Hercules movie
           | before Karparthy used it?
        
             | alessiodm wrote:
             | Didn't know that, but now I have an excuse to go watch a
             | movie :D
        
       | levocardia wrote:
       | This looks great - maybe add a link to the youtube videos in the
       | README?
        
         | alessiodm wrote:
         | Thank you so much! Unfortunately, that is a mistake in the
         | README that I just noticed (thank you for pointing it out!) :(
         | As I mentioned in the first post, I didn't get to make the
         | YouTube videos yet. But it seems the community would be indeed
         | interested.
         | 
         | I will try to get to them (and in the meantime fix the README,
         | sorry about that!)
        
       | chaosprint wrote:
       | Great resources! Thank you for making this.
       | 
       | I'm attaching here a DRL framework I made for music generation,
       | similar to OpenAI Gym. If anyone wants to test the algorithms OP
       | includes, you are welcome to use it. Issues and PRs are also
       | welcome.
       | 
       | https://github.com/chaosprint/RaveForce
        
       | malomalsky wrote:
       | If there anything like that, but for NLP?
        
         | barrenko wrote:
         | There's the series this material references - "Neural networks:
         | zero to hero" that has GPT related parts.
        
         | alessiodm wrote:
         | I took the Deep Learning course [1] by deeplearning.ai in the
         | past, and their resources where incredibly good IMHO. Hence, I
         | would suggest to take a look at their NLP specialization [2].
         | 
         | +1000 to "Neural networks: zero to hero" already mentioned as
         | well.
         | 
         | [1] https://www.deeplearning.ai/courses/deep-learning-
         | specializa... [2] https://www.deeplearning.ai/courses/natural-
         | language-process...
        
         | spmurrayzzz wrote:
         | There is an NLP section in Jeremy Howard's "Practical Deep
         | Learning for Coders" course (free):
         | https://course.fast.ai/Lessons/lesson4.html
         | 
         | The whole course is fantastic. I recommend it frequently to
         | folks who want to start with DL basics and ramp up quickly to
         | more advanced material.
        
       | fancyfredbot wrote:
       | This is really nice, great idea. I am going to make a suggestion
       | which I hope is helpful - I don't mean to be critical of this
       | nice project.
       | 
       | After going through the MDP example, I have one comment on the
       | way you introduce the non-deterministic transition function. In
       | your example the non-determinism comes from the agent making
       | "mistakes", it can mistakenly go left or right when trying to go
       | up or down:
       | 
       | 1) You could introduce the mistakes more clearly as it isn't
       | really explained the agent makes mistakes in the text, and so the
       | comment about mistakes in the transition() function is initally a
       | bit confusing.
       | 
       | 2) I think the way this introduces non-determinism could be more
       | didactic if the non-determinism came from the environment, not
       | the agent? For example the agent might be moving on a rough
       | surface and moving its tracks/limbs/whatever might not always
       | produce the intended outcome. As you present it the transition is
       | a function from an action _to a random action_ to a random state,
       | and the definition is just a function from an action to a random
       | state.
        
       | dukeofdoom wrote:
       | Maybe I can use this in my pygame game
        
       | wegfawefgawefg wrote:
       | A few years ago I made something similar. It doesnt go all the
       | way to ppo, and has a different style.
       | 
       | https://learndrl.com/
       | 
       | I won't claim it is better or worse, but if anyone here is trying
       | to learn, having the same information presented in multiple forms
       | is always nice.
        
       | jezzamon wrote:
       | Awesome, I've been sort of stuck in the limbo of doing courses
       | that taught me some theory but missing the hands on knowledge I
       | need to really use RL. This looks like exactly the type of course
       | I'm looking for!
        
         | alessiodm wrote:
         | Thank you! I'll be be curious if / how these notebooks help and
         | how your experience is! Any feedback welcome!
        
       | mode80 wrote:
       | Thanks for making this!
       | 
       | Note: I was carefully reading along and well into the third
       | notebook before I realized that the code sections marked "TODO"
       | were actual exercises for the reader to implement! (And the tests
       | which follow are for the reader to check their work.)
       | 
       | This is a clever approach. It just wasn't obvious to me from the
       | outset.
       | 
       | (I thought the TODOs were just some fiddly details you didn't
       | want distracting readers from the big picture. But in fact, those
       | are the important parts.)
        
       ___________________________________________________________________
       (page generated 2024-05-06 23:02 UTC)