[HN Gopher] CS234: Reinforcement Learning Winter 2025
___________________________________________________________________
CS234: Reinforcement Learning Winter 2025
Author : jonbaer
Score : 190 points
Date : 2025-11-26 00:33 UTC (22 hours ago)
(HTM) web link (web.stanford.edu)
(TXT) w3m dump (web.stanford.edu)
| zerosizedweasle wrote:
| Given Ilya's podcast this is an interesting title.
| actionfromafar wrote:
| So, basically AI Winter? :-)
| airspresso wrote:
| That's how I read it XD "oh no, RL is dead too"
| TNWin wrote:
| I didn't get the reference. Please elaborate.
| apwell23 wrote:
| he said RL sucks because it narrowly optimizes to solve a
| certain set of problems in certain sets of conditions.
|
| he compared it to students who win at math competition but
| cant do anything practical .
| egl2020 wrote:
| Karpathy colorfully described RL as "sucking supervision bits
| through a straw".
| sillysaurusx wrote:
| It's been said that RL is the worst way to train a model, except
| for all the others. Many prominent scientists seem to doubt that
| this is how we'll be training cutting edge models in a decade. I
| agree, and I encourage you to try to think of alternative
| paradigms as you go through this course.
|
| If that seems unlikely, remember that image generation didn't
| take off till diffusion models, and GPTs didn't take off till
| RLHF. If you've been around long enough it'll seem obvious that
| this isn't the final step. The challenge for you is, find the one
| that's better.
| whatshisface wrote:
| RL is barely even a training method, its more of a dataset
| generation method.
| theOGognf wrote:
| I feel like both this comment and the parent comment
| highlight how RL has been going through a cycle of
| misunderstanding recently from another one of its popularity
| booms due to being used to train LLMs
| mistercheph wrote:
| care to correct the misunderstanding?
| mountainriver wrote:
| I mean DPO, PPO, and GRPO all use losses that are not
| what's used with SFT for one.
|
| They also force exploration as a part of the algorithm.
|
| They can be used for synthetic data generation once the
| reward model is good enough.
| phyalow wrote:
| Its reductive, but also roughly correct.
| singularity2001 wrote:
| While collecting data according to policy is part of RL,
| 'reductive' is an understatement. It's like saying
| algebra is all about scalar products. Well yes, 1%
| paswut wrote:
| What about for combinatorial optimization? When you have a
| simulation of the world what other paradigms are fitting
| whatever1 wrote:
| More likely we will develop general super intelligent AI
| before we (together with our super intelligent friends) solve
| the problem of combinatorial optimization.
| hyperbovine wrote:
| There's nothing to solve. The CoD kills you no matter what.
| P=NP or maybe quantum computing is the only hope of making
| serious progress on large-scale combinatorial optimization.
| charcircuit wrote:
| GPT wouldn't have even been possible, let alone take off,
| without self supervised learning.
| mountainriver wrote:
| RLHF is what gave us the ChatGPT moment. Self supervised
| learning was the base for this.
|
| SSL creates all the connections and RL learns to walk the
| paths
| charcircuit wrote:
| The easy to use web interface gave us the ChatGPT moment.
| Take a look at AI Dungeon for GPT2. It went viral due to
| making using GPT2 accessible.
| PaulRobinson wrote:
| You're assuming that people are only interested in image and
| text generation.
|
| RL excels at learning control problems. It is mathematically
| guaranteed to provide an optimal solution for the state and
| controls you provide it, given enough runtime. For some
| problems (playing computer games), that runtime is surprisingly
| short.
|
| There is a reason self-driving cars use RL, and don't use GPTs.
| srean wrote:
| You are exactly right.
|
| Control theory and reinforcement learning are different ways
| of looking at the same problem. They traditionally and
| culturally focussed on different aspects.
| noobcoder wrote:
| I have been using it to train it on my game hotlapdaily
|
| Apparently AI sets the best time even better than the pros It
| is really useful when it comes to controlled environment
| optimizations
| bchasknga wrote:
| > self-driving cars use RL
|
| Some part of it, but I would argue with a lot of guardrail in
| place and not as common as you think. I don't think the
| majority of the planner/control stack out there in SDC is
| based. I also don't think any production SDCs are RL-based.
| rishabhaiover wrote:
| I like to think of RLHF as a technique that I, as a student,
| used to apply to score good marks in my exam. As soon as I
| started working, I realized that out-of-distribution
| generalization can't be only achieved from practicing in an
| environment with verifiable rewards.
| poorman wrote:
| RL is still widely used in the advertising industry. Don't let
| anyone tell you otherwise. When you have millions to billions
| of visits and you are trying to optimize an outcome RL is very
| good at that. Add in context with contextual multi-armed
| bandits and you have something very good at driving people
| towards purchasing.
| kgarten wrote:
| Are the videos available somewhere?
|
| spring course is on YouTube
| https://m.youtube.com/playlist?list=PLoROMvodv4rN4wG6Nk6sNpT...
| pedrolins wrote:
| I was excited to check out lecture videos thinking they were
| public, but quickly saw that they were closed.
|
| One of the things I miss most about the pandemic was how all of
| these institutions opened up for the world. Lately they have been
| closing down not only newer course offerings but also putting old
| videos private. Even MIT OCW falls apart once you get into some
| advanced graduate courses.
|
| I understand that universities should prioritize their alumni,
| but there's literally no cost in making the underlying material
| (especially lectures!) available on the internet. It delivers
| immense value to the world.
| moosedev wrote:
| 2024 lecture videos are on YouTube:
| https://youtube.com/playlist?list=PLoROMvodv4rN4wG6Nk6sNpTEb...
| storus wrote:
| Those don't have DPO/GRPO which arguably made some parts of
| RL obsolete.
| nafizh wrote:
| check out cs 336 stanford, they cover DPO/GRPO and relevant
| parts needed to train LLMs.
| upbeat_general wrote:
| I can assure you that lacking knowledge in DPO (and
| especially GRPO it's just stripped down PPO) is not a
| dealbreaker.
| rllearner wrote:
| One of my favorite parts of the 2024 series on Youtube was
| when Prof B explained her excitement just before introducing
| UCB algorithms (Lecture 11): "So now we're going to see one
| of my favorite ideas in the course, which is optimism under
| uncertainty... I think it's a lovely principle because it
| shows why it's provably optimal to be optimistic about
| things. Which is kind of beautiful."
|
| Those moments are the best part of classroom education. When
| a super knowledgeable person spends a few weeks helping you
| get to the point where you can finally understand something
| cool. And you can sense their excitement to tell you about
| it. I still remember learning Gauss-Bonnet, Stokes Theorem,
| and the Central Limit Theorem. I think optimism under
| uncertainty falls in that group.
| TomasBM wrote:
| I've seen arguments that opening up fresh material makes it
| easy for less honest institutions to plagiarize your work. I've
| even heard professors say they don't want to share their slides
| or record their lectures, because it's their copyright.
|
| I personally don't like this, because it makes a place more
| exclusive with legal moats, not genuine prestige. If you're a
| professor, this also makes your work less known, not more. IMO
| the only beneficiaries are either those who paid a lot to be
| there, lecturers who don't want to adapt, and university
| admins.
| outside1234 wrote:
| I wish we would speed run this to where these super star
| profs open their classes to 20,000 people at a lower price
| point (but where this yields them more profit)
| ibrahima wrote:
| That's basically MOOCs, but those kinda fizzled out. It's
| tough to actually stay focused for a full-length
| university-level course outside of a university environment
| IMO, especially if you're working and have a family, etc.
|
| (I mean, I have no idea how Coursera/edX/etc are doing
| behind the scenes, but it doesn't seem like people talk
| about them the way they used to ~10 years ago.)
| TomasBM wrote:
| They're still around and offering new online courses. I
| hope they don't have any problems to keep afloat, because
| they do offer useful material at the very least.
|
| I agree it's hard, but I think it's because initially the
| lecturers were involved in the _online community_ , which
| can be tiring and unrewarding even if you don't have
| other obligations.
|
| I think the courses should have purely standalone
| material that lecturers can publish, earn extra money,
| and refresh the content when it makes sense. _Maybe_
| platform moderators could help with some questions or
| grading, but it 's even easier to have chatbot support
| for that nowadays. Also, platforms really need to
| improve.
|
| So, I think the problem with MOOCs has been the
| execution, not the concept itself.
| geodel wrote:
| Most MOOCs are venture funded companies not lifestyle
| business so they will not likely do sensible user
| friendly things. They just need to somehow show investors
| that hyper growth will happen. (Doesn't seem like though
| that it did happen)
| geodel wrote:
| They are mostly used for _professional_ courses. Learning
| python, java, gitlab runners, micro services with NodeJS,
| project management and things like that
| ndriscoll wrote:
| Most of the MOOCs were also watered down versions of a
| real course to attempt to make them accessible to a
| larger audience (e.g. the Stanford Coursera Machine
| Learning course that didn't want to assume any calculus
| or linear algebra background), which made them into more
| of a pointless brand advertisement than an actual
| learning resource.
| TomasBM wrote:
| I'd definitely support that.
|
| On the flip side, that'd require many professors and other
| participants in universities to rethink the role of a
| university degree, which proves to be much more difficult.
| pkoird wrote:
| Reminds me of something I wrote a year ago
| https://praveshkoirala.com/2024/11/21/the-democratization-
| of...
| levocardia wrote:
| >I've even heard professors say they don't want to share
| their slides or record their lectures, because it's their
| copyright.
|
| No, it's because they don't want people to find out they've
| been reusing the same slide deck since 2004
| storus wrote:
| RL is extremely brittle, it's often difficult to make it
| converge. Even Stanford folks admit that. Are there any solutions
| for this?
| mountainriver wrote:
| FlowRL is one, it's learning the full distribution of rewards
| rather than just optimizing toward a single maximum
| storus wrote:
| Thanks, that looks very promising!
| _giorgio_ wrote:
| Kindly suggest some books about RL?
|
| I've already studied a lot of deep learning.
|
| Please confirm if these resoruces are good, or suggest yours:
|
| Sutton et al. - Reinforcement Learning
|
| Kevin Patrick Murphy - Reinforcement Learning, an overview
| https://arxiv.org/abs/2412.05265
|
| Sebastian Raschka (upcoming book)
|
| ...
| i_don_t_know wrote:
| I believe Kochenderfer et.al.'s book "Algorithms for decision
| making" is also about reinforcement learning and related
| approaches. Free PDFs are available at
| https://algorithmsbook.com
| mlmonkey wrote:
| As a "tradional" ML guy who missed out on learning about RL in
| school, I'm confused about how to use RL in "traditional"
| problems.
|
| Take, for example, a typical binary classifier with a BCE loss.
| Suppose I wanted to shoehorn RL onto this: how would I do that?
|
| Or, for example, the House Value problem (given a set of features
| about a house for sale, predict its expected sale value). How
| would I slap RL onto that?
|
| I guess my confusion comes from how the losses are hooked up.
| Traditional losses (BCE, RMSE, etc.) I know about; but how do you
| bring RL loss into problems?
| robrenaud wrote:
| I just wouldn't.
|
| RL is nice in that it is handles messy cases where you don't
| have per example labels.
|
| How do you build a learned chess playing bot? Essentially the
| state of the art is to find a clever way of turning the problem
| of playing chess into a sequence of supervised learning
| problems.
| mlmonkey wrote:
| So IIUC RL is applicable only when the outcome is not
| immediately available.
|
| Let's say I do have a problem in that setting; say the chess
| problem, where I have a chess board with the positions of
| chess pieces and some features like turn number, my color,
| time left on the clock, etc. are available.
|
| Would I train a DNN with these features? Are there some
| libraries where I can try out some toy problems?
|
| I guess coming from a classical ML background I am quite
| clueless about RL but want to learn more. I tried reading the
| Sutton and Barto book, but got lost in the terminology. I'm a
| more hands-on person.
| jebarker wrote:
| OpenAI has an excellent interactive course on Deep RL:
| https://spinningup.openai.com/en/latest/
| egl2020 wrote:
| The AlphaGo paper might be what you need. It requires some
| work to understand, but is clearly written. I read it when
| it came out and was confident enough to give a talk on it.
| (I don't have the slides any more; I did this when I was at
| a FAANG and left them behind.)
| egl2020 wrote:
| Three considerations that come into play in deciding about
| using RL: 1) how informative is the loss on each example, 2)
| can you see how to adjust the model based on the loss signal,
| and 3) how complex is the feature space?
|
| For the house value problem, you can quantify how far the
| prediction is from the true value, there are lots of regression
| models with proven methods of adjusting the model parameters
| (e.g. gradient descent), and the feature space comprises mostly
| monotone, weakly interacting features like quality of
| neighborhood schools and square footage. It's a "traditional"
| problem and can be solved as well as possible by the
| traditional methods we know and love. RL is unnecessary, might
| require more data than you have, and might produce an inferior
| result.
|
| In contrast, for a sequential decision problem like playing go,
| the binary won-lost signal doesn't tell us much about how well
| or poorly the game was played, it's not clear how to improve
| the strategy, and there are a large number of moves at each
| turn with no evident ranking. In this setting RL is a difficult
| but possible approach.
| nonameiguess wrote:
| RL is a technique for finding an optimal policy for Markov
| decision processes. If you can define state spaces and action
| spaces for a sequential decision problem with uncertain
| outcomes, then reinforcement learning is typically a pretty
| good way of finding a function mapping states to actions,
| assuming it isn't a sufficiently small problem that an exact
| solution exists.
|
| I don't really see why you would want to use it for binary
| classification or continuous predictive modeling. It's why it
| excels in game play and operational control. You need to make
| decisions now that constrain possible decision in the future,
| but you cannot know the outcome until that future comes and you
| cannot attribute causality to the outcome even when you learn
| what it is. This isn't "hot dog/not a hot dog" that generally
| has an unambiguously correct answer and the classification
| itself is directly either correct or incorrect. In RL, a
| decision made early in a game _probably_ leads causally to a
| particular outcome somewhere down the line, but the exact
| extent to which any single action contributes is unknown and
| probably unknowable in many cases.
___________________________________________________________________
(page generated 2025-11-26 23:01 UTC)