[HN Gopher] CS234: Reinforcement Learning Winter 2025
       ___________________________________________________________________
        
       CS234: Reinforcement Learning Winter 2025
        
       Author : jonbaer
       Score  : 190 points
       Date   : 2025-11-26 00:33 UTC (22 hours ago)
        
 (HTM) web link (web.stanford.edu)
 (TXT) w3m dump (web.stanford.edu)
        
       | zerosizedweasle wrote:
       | Given Ilya's podcast this is an interesting title.
        
         | actionfromafar wrote:
         | So, basically AI Winter? :-)
        
           | airspresso wrote:
           | That's how I read it XD "oh no, RL is dead too"
        
         | TNWin wrote:
         | I didn't get the reference. Please elaborate.
        
           | apwell23 wrote:
           | he said RL sucks because it narrowly optimizes to solve a
           | certain set of problems in certain sets of conditions.
           | 
           | he compared it to students who win at math competition but
           | cant do anything practical .
        
           | egl2020 wrote:
           | Karpathy colorfully described RL as "sucking supervision bits
           | through a straw".
        
       | sillysaurusx wrote:
       | It's been said that RL is the worst way to train a model, except
       | for all the others. Many prominent scientists seem to doubt that
       | this is how we'll be training cutting edge models in a decade. I
       | agree, and I encourage you to try to think of alternative
       | paradigms as you go through this course.
       | 
       | If that seems unlikely, remember that image generation didn't
       | take off till diffusion models, and GPTs didn't take off till
       | RLHF. If you've been around long enough it'll seem obvious that
       | this isn't the final step. The challenge for you is, find the one
       | that's better.
        
         | whatshisface wrote:
         | RL is barely even a training method, its more of a dataset
         | generation method.
        
           | theOGognf wrote:
           | I feel like both this comment and the parent comment
           | highlight how RL has been going through a cycle of
           | misunderstanding recently from another one of its popularity
           | booms due to being used to train LLMs
        
             | mistercheph wrote:
             | care to correct the misunderstanding?
        
               | mountainriver wrote:
               | I mean DPO, PPO, and GRPO all use losses that are not
               | what's used with SFT for one.
               | 
               | They also force exploration as a part of the algorithm.
               | 
               | They can be used for synthetic data generation once the
               | reward model is good enough.
        
             | phyalow wrote:
             | Its reductive, but also roughly correct.
        
               | singularity2001 wrote:
               | While collecting data according to policy is part of RL,
               | 'reductive' is an understatement. It's like saying
               | algebra is all about scalar products. Well yes, 1%
        
         | paswut wrote:
         | What about for combinatorial optimization? When you have a
         | simulation of the world what other paradigms are fitting
        
           | whatever1 wrote:
           | More likely we will develop general super intelligent AI
           | before we (together with our super intelligent friends) solve
           | the problem of combinatorial optimization.
        
             | hyperbovine wrote:
             | There's nothing to solve. The CoD kills you no matter what.
             | P=NP or maybe quantum computing is the only hope of making
             | serious progress on large-scale combinatorial optimization.
        
         | charcircuit wrote:
         | GPT wouldn't have even been possible, let alone take off,
         | without self supervised learning.
        
           | mountainriver wrote:
           | RLHF is what gave us the ChatGPT moment. Self supervised
           | learning was the base for this.
           | 
           | SSL creates all the connections and RL learns to walk the
           | paths
        
             | charcircuit wrote:
             | The easy to use web interface gave us the ChatGPT moment.
             | Take a look at AI Dungeon for GPT2. It went viral due to
             | making using GPT2 accessible.
        
         | PaulRobinson wrote:
         | You're assuming that people are only interested in image and
         | text generation.
         | 
         | RL excels at learning control problems. It is mathematically
         | guaranteed to provide an optimal solution for the state and
         | controls you provide it, given enough runtime. For some
         | problems (playing computer games), that runtime is surprisingly
         | short.
         | 
         | There is a reason self-driving cars use RL, and don't use GPTs.
        
           | srean wrote:
           | You are exactly right.
           | 
           | Control theory and reinforcement learning are different ways
           | of looking at the same problem. They traditionally and
           | culturally focussed on different aspects.
        
           | noobcoder wrote:
           | I have been using it to train it on my game hotlapdaily
           | 
           | Apparently AI sets the best time even better than the pros It
           | is really useful when it comes to controlled environment
           | optimizations
        
           | bchasknga wrote:
           | > self-driving cars use RL
           | 
           | Some part of it, but I would argue with a lot of guardrail in
           | place and not as common as you think. I don't think the
           | majority of the planner/control stack out there in SDC is
           | based. I also don't think any production SDCs are RL-based.
        
         | rishabhaiover wrote:
         | I like to think of RLHF as a technique that I, as a student,
         | used to apply to score good marks in my exam. As soon as I
         | started working, I realized that out-of-distribution
         | generalization can't be only achieved from practicing in an
         | environment with verifiable rewards.
        
         | poorman wrote:
         | RL is still widely used in the advertising industry. Don't let
         | anyone tell you otherwise. When you have millions to billions
         | of visits and you are trying to optimize an outcome RL is very
         | good at that. Add in context with contextual multi-armed
         | bandits and you have something very good at driving people
         | towards purchasing.
        
       | kgarten wrote:
       | Are the videos available somewhere?
       | 
       | spring course is on YouTube
       | https://m.youtube.com/playlist?list=PLoROMvodv4rN4wG6Nk6sNpT...
        
       | pedrolins wrote:
       | I was excited to check out lecture videos thinking they were
       | public, but quickly saw that they were closed.
       | 
       | One of the things I miss most about the pandemic was how all of
       | these institutions opened up for the world. Lately they have been
       | closing down not only newer course offerings but also putting old
       | videos private. Even MIT OCW falls apart once you get into some
       | advanced graduate courses.
       | 
       | I understand that universities should prioritize their alumni,
       | but there's literally no cost in making the underlying material
       | (especially lectures!) available on the internet. It delivers
       | immense value to the world.
        
         | moosedev wrote:
         | 2024 lecture videos are on YouTube:
         | https://youtube.com/playlist?list=PLoROMvodv4rN4wG6Nk6sNpTEb...
        
           | storus wrote:
           | Those don't have DPO/GRPO which arguably made some parts of
           | RL obsolete.
        
             | nafizh wrote:
             | check out cs 336 stanford, they cover DPO/GRPO and relevant
             | parts needed to train LLMs.
        
             | upbeat_general wrote:
             | I can assure you that lacking knowledge in DPO (and
             | especially GRPO it's just stripped down PPO) is not a
             | dealbreaker.
        
           | rllearner wrote:
           | One of my favorite parts of the 2024 series on Youtube was
           | when Prof B explained her excitement just before introducing
           | UCB algorithms (Lecture 11): "So now we're going to see one
           | of my favorite ideas in the course, which is optimism under
           | uncertainty... I think it's a lovely principle because it
           | shows why it's provably optimal to be optimistic about
           | things. Which is kind of beautiful."
           | 
           | Those moments are the best part of classroom education. When
           | a super knowledgeable person spends a few weeks helping you
           | get to the point where you can finally understand something
           | cool. And you can sense their excitement to tell you about
           | it. I still remember learning Gauss-Bonnet, Stokes Theorem,
           | and the Central Limit Theorem. I think optimism under
           | uncertainty falls in that group.
        
         | TomasBM wrote:
         | I've seen arguments that opening up fresh material makes it
         | easy for less honest institutions to plagiarize your work. I've
         | even heard professors say they don't want to share their slides
         | or record their lectures, because it's their copyright.
         | 
         | I personally don't like this, because it makes a place more
         | exclusive with legal moats, not genuine prestige. If you're a
         | professor, this also makes your work less known, not more. IMO
         | the only beneficiaries are either those who paid a lot to be
         | there, lecturers who don't want to adapt, and university
         | admins.
        
           | outside1234 wrote:
           | I wish we would speed run this to where these super star
           | profs open their classes to 20,000 people at a lower price
           | point (but where this yields them more profit)
        
             | ibrahima wrote:
             | That's basically MOOCs, but those kinda fizzled out. It's
             | tough to actually stay focused for a full-length
             | university-level course outside of a university environment
             | IMO, especially if you're working and have a family, etc.
             | 
             | (I mean, I have no idea how Coursera/edX/etc are doing
             | behind the scenes, but it doesn't seem like people talk
             | about them the way they used to ~10 years ago.)
        
               | TomasBM wrote:
               | They're still around and offering new online courses. I
               | hope they don't have any problems to keep afloat, because
               | they do offer useful material at the very least.
               | 
               | I agree it's hard, but I think it's because initially the
               | lecturers were involved in the _online community_ , which
               | can be tiring and unrewarding even if you don't have
               | other obligations.
               | 
               | I think the courses should have purely standalone
               | material that lecturers can publish, earn extra money,
               | and refresh the content when it makes sense. _Maybe_
               | platform moderators could help with some questions or
               | grading, but it 's even easier to have chatbot support
               | for that nowadays. Also, platforms really need to
               | improve.
               | 
               | So, I think the problem with MOOCs has been the
               | execution, not the concept itself.
        
               | geodel wrote:
               | Most MOOCs are venture funded companies not lifestyle
               | business so they will not likely do sensible user
               | friendly things. They just need to somehow show investors
               | that hyper growth will happen. (Doesn't seem like though
               | that it did happen)
        
               | geodel wrote:
               | They are mostly used for _professional_ courses. Learning
               | python, java, gitlab runners, micro services with NodeJS,
               | project management and things like that
        
               | ndriscoll wrote:
               | Most of the MOOCs were also watered down versions of a
               | real course to attempt to make them accessible to a
               | larger audience (e.g. the Stanford Coursera Machine
               | Learning course that didn't want to assume any calculus
               | or linear algebra background), which made them into more
               | of a pointless brand advertisement than an actual
               | learning resource.
        
             | TomasBM wrote:
             | I'd definitely support that.
             | 
             | On the flip side, that'd require many professors and other
             | participants in universities to rethink the role of a
             | university degree, which proves to be much more difficult.
        
             | pkoird wrote:
             | Reminds me of something I wrote a year ago
             | https://praveshkoirala.com/2024/11/21/the-democratization-
             | of...
        
           | levocardia wrote:
           | >I've even heard professors say they don't want to share
           | their slides or record their lectures, because it's their
           | copyright.
           | 
           | No, it's because they don't want people to find out they've
           | been reusing the same slide deck since 2004
        
       | storus wrote:
       | RL is extremely brittle, it's often difficult to make it
       | converge. Even Stanford folks admit that. Are there any solutions
       | for this?
        
         | mountainriver wrote:
         | FlowRL is one, it's learning the full distribution of rewards
         | rather than just optimizing toward a single maximum
        
           | storus wrote:
           | Thanks, that looks very promising!
        
       | _giorgio_ wrote:
       | Kindly suggest some books about RL?
       | 
       | I've already studied a lot of deep learning.
       | 
       | Please confirm if these resoruces are good, or suggest yours:
       | 
       | Sutton et al. - Reinforcement Learning
       | 
       | Kevin Patrick Murphy - Reinforcement Learning, an overview
       | https://arxiv.org/abs/2412.05265
       | 
       | Sebastian Raschka (upcoming book)
       | 
       | ...
        
         | i_don_t_know wrote:
         | I believe Kochenderfer et.al.'s book "Algorithms for decision
         | making" is also about reinforcement learning and related
         | approaches. Free PDFs are available at
         | https://algorithmsbook.com
        
       | mlmonkey wrote:
       | As a "tradional" ML guy who missed out on learning about RL in
       | school, I'm confused about how to use RL in "traditional"
       | problems.
       | 
       | Take, for example, a typical binary classifier with a BCE loss.
       | Suppose I wanted to shoehorn RL onto this: how would I do that?
       | 
       | Or, for example, the House Value problem (given a set of features
       | about a house for sale, predict its expected sale value). How
       | would I slap RL onto that?
       | 
       | I guess my confusion comes from how the losses are hooked up.
       | Traditional losses (BCE, RMSE, etc.) I know about; but how do you
       | bring RL loss into problems?
        
         | robrenaud wrote:
         | I just wouldn't.
         | 
         | RL is nice in that it is handles messy cases where you don't
         | have per example labels.
         | 
         | How do you build a learned chess playing bot? Essentially the
         | state of the art is to find a clever way of turning the problem
         | of playing chess into a sequence of supervised learning
         | problems.
        
           | mlmonkey wrote:
           | So IIUC RL is applicable only when the outcome is not
           | immediately available.
           | 
           | Let's say I do have a problem in that setting; say the chess
           | problem, where I have a chess board with the positions of
           | chess pieces and some features like turn number, my color,
           | time left on the clock, etc. are available.
           | 
           | Would I train a DNN with these features? Are there some
           | libraries where I can try out some toy problems?
           | 
           | I guess coming from a classical ML background I am quite
           | clueless about RL but want to learn more. I tried reading the
           | Sutton and Barto book, but got lost in the terminology. I'm a
           | more hands-on person.
        
             | jebarker wrote:
             | OpenAI has an excellent interactive course on Deep RL:
             | https://spinningup.openai.com/en/latest/
        
             | egl2020 wrote:
             | The AlphaGo paper might be what you need. It requires some
             | work to understand, but is clearly written. I read it when
             | it came out and was confident enough to give a talk on it.
             | (I don't have the slides any more; I did this when I was at
             | a FAANG and left them behind.)
        
         | egl2020 wrote:
         | Three considerations that come into play in deciding about
         | using RL: 1) how informative is the loss on each example, 2)
         | can you see how to adjust the model based on the loss signal,
         | and 3) how complex is the feature space?
         | 
         | For the house value problem, you can quantify how far the
         | prediction is from the true value, there are lots of regression
         | models with proven methods of adjusting the model parameters
         | (e.g. gradient descent), and the feature space comprises mostly
         | monotone, weakly interacting features like quality of
         | neighborhood schools and square footage. It's a "traditional"
         | problem and can be solved as well as possible by the
         | traditional methods we know and love. RL is unnecessary, might
         | require more data than you have, and might produce an inferior
         | result.
         | 
         | In contrast, for a sequential decision problem like playing go,
         | the binary won-lost signal doesn't tell us much about how well
         | or poorly the game was played, it's not clear how to improve
         | the strategy, and there are a large number of moves at each
         | turn with no evident ranking. In this setting RL is a difficult
         | but possible approach.
        
         | nonameiguess wrote:
         | RL is a technique for finding an optimal policy for Markov
         | decision processes. If you can define state spaces and action
         | spaces for a sequential decision problem with uncertain
         | outcomes, then reinforcement learning is typically a pretty
         | good way of finding a function mapping states to actions,
         | assuming it isn't a sufficiently small problem that an exact
         | solution exists.
         | 
         | I don't really see why you would want to use it for binary
         | classification or continuous predictive modeling. It's why it
         | excels in game play and operational control. You need to make
         | decisions now that constrain possible decision in the future,
         | but you cannot know the outcome until that future comes and you
         | cannot attribute causality to the outcome even when you learn
         | what it is. This isn't "hot dog/not a hot dog" that generally
         | has an unambiguously correct answer and the classification
         | itself is directly either correct or incorrect. In RL, a
         | decision made early in a game _probably_ leads causally to a
         | particular outcome somewhere down the line, but the exact
         | extent to which any single action contributes is unknown and
         | probably unknowable in many cases.
        
       ___________________________________________________________________
       (page generated 2025-11-26 23:01 UTC)