[HN Gopher] Q-learning is not yet scalable
___________________________________________________________________
Q-learning is not yet scalable
Author : jxmorris12
Score : 205 points
Date : 2025-06-15 00:56 UTC (22 hours ago)
(HTM) web link (seohong.me)
(TXT) w3m dump (seohong.me)
| whatshisface wrote:
| The benefit of off-policy learning is fundamentally limited by
| the fact that data from ineffective early exploration isn't that
| useful for improving on later more refined policies. It's clear
| if you think of a few examples: chess blunders, spasmodic
| movement, or failing to solve a puzzle. This becomes especially
| clear once you realize that data only becomes off-policy when it
| describes something the policy would not do. I think the solution
| to this problem is (unfortunately) related to the need for better
| generalization / sample efficiency.
| getnormality wrote:
| Doesn't this claim prove too much? What about the cited dog
| that walked in 20 minutes with off-policy learning? Or are you
| making a more nuanced point?
| AndrewKemendo wrote:
| Q learning isn't scalable because of the stream barrier, however
| streaming DRL (TD-Lambda) is scalable:
|
| https://arxiv.org/abs/2410.14606
|
| Note that this is from Turing award winner Richard Sutton's lab
| at UofA
|
| RL works
| getnormality wrote:
| But does this address scaling to long-horizon tasks, which is
| what the article is about?
| AndrewKemendo wrote:
| Yes because it's management of long term reward by default
| s-mon wrote:
| While I like the blogpost, I think the use of unexplained
| acronyms undermines the opportunity of this blogpost to be useful
| to the wider audience. Small nit: make sure acronyms and jargon
| is explained.
| keremk wrote:
| For these kinds of blogposts where the content is very good but
| not very approachable due to assumption of extensive prior
| knowledge, I find using an AI tool very useful, to explain and
| simplify. Just used the new browser Dia for this, and it worked
| really well for me. Or you can use your favorite model provider
| and copy and paste. This way the post stays concise, and yet
| you can use your AI tools to ask questions and clarify.
| anonthrowawy wrote:
| i actually think thats what made it crisp
| levocardia wrote:
| It's clearly written for an audience of other RL researchers,
| given than the conclusion is "will someone please come up with
| Q-learning methods that scale!"
| andy_xor_andrew wrote:
| the magic thing about off-policy techniques such as Q-Learning is
| that they will converge on an optimal result even if they only
| ever see sub-optimal training data.
|
| For example, you can use a dataset of chess games from agents
| that move totally randomly (with no strategy at all) and use that
| as an input for Q-Learning, and it will still converge on an
| optimal policy (albeit more slowly than if you had more high-
| quality inputs)
| Ericson2314 wrote:
| I would think this being true is the definition of the task
| being "ergodic" (distorting that term slightly, maybe). But I
| would also expect non-ergodic tasks to exist.
| andy_xor_andrew wrote:
| The article mentions AlphaGo/Mu/Zero was not based on Q-Learning
| - I'm no expert but I thought AlphaGo was based on DeepMind's
| "Deep Q-Learning"? Is that not right?
| energy123 wrote:
| DeepMind's earlier success with Atari was based on offline
| Q-Learning
| lalaland1125 wrote:
| This blog post is unfortunately missing what I consider the
| bigger reason why Q learning is not scalable:
|
| As horizon increases, the number of possible states (usually)
| increases exponentially. This means you require exponentially
| increasing data to have a hope of training a Q that can handle
| those states.
|
| This is less of an issue for on policy learning, because only
| near policy states are important, and on policy learning
| explicitly only samples those states. So even though there are
| exponential possible states your training data is laser focused
| on the important ones.
| Ericson2314 wrote:
| https://news.ycombinator.com/item?id=44280505 I think that
| thead might help?
|
| Total layman here, but maybe some tasks are "uniform" despite
| being "deep" in such a way that poor samples still suffice? I
| would call those "ergodic" tasks. But surely there are other
| tasks where this is not the case?
| lalaland1125 wrote:
| Good clarification. I have edited my post accordingly.
|
| There are situations where states increase at much slower
| rates than exponential.
|
| Those situations are a good fit for Q learning.
| arthurcolle wrote:
| Feel the Majorana-1
| elchananHaas wrote:
| I think the article's analysis of overapproximation bias is
| correct. The issue is that due to the Max operator in the Q
| learning noise is amplified over timesteps. Some methods to
| reduce this bias, such as https://arxiv.org/abs/1509.06461 were
| successful in improving the RL agents performance. Studies have
| found that this happens even more for the states that the
| network hasn't visited many times.
|
| An exponential number of states only matters if there is no
| pattern to them. If there is some structure that the network
| can learn then it can perform well. This is a strength of deep
| learning, not a weakness. The trick is getting the right
| training objective, which the article claims q learning isn't.
|
| I do wonder if MuZero and other model based RL systems are the
| solution to the author's concerns. MuZero can reanalyze prior
| trajectories to improve training efficiency. The Monte Carlo
| Tree Search (MCTS) is a principled way to perform horizon
| reduction by unrolling the model multiple steps. The max
| operator in MCTS could cause similar issues but the search
| progressing deeper counteracts this.
| jhrmnn wrote:
| Is this essentially the same difference as between vanilla
| regular-grid and importance-sampling Monte Carlo integration?
| briandw wrote:
| This papers assumes that you know quite a bit about RL already.
| If you really want to dig into RL, this intro course from David
| Silver (Deep Mind) is excellent:
| https://youtu.be/2pWv7GOvuf0?si=CmFJHNnNqraL5i0s
| ArtRichards wrote:
| Thank you for this link.
| Onavo wrote:
| Q learning is great as a hello world RL project for teaching
| undergraduates.
| msgodel wrote:
| I feel like the null TD algorithm is much better if you want a
| "hello world."
| paraschopra wrote:
| Humans actually do both. We learn from on-policy by exploring
| consequences of our own behavior. But we also learn off-policy,
| say from expert demonstrations (but difference being we can tell
| good behaviors from bad, and learn from a filtered list of what
| we consider as good behaviors). In most, off-policy RL, a lot of
| behaviors are bad and yet they get into the training set and
| hence leading to slower training.
| taneq wrote:
| > difference being we can tell good behaviors from bad
|
| Not always! That's what makes some expert demonstrations so
| fascinating, watching someone do something "completely wrong"
| (according to novice level 'best practice') and achieve
| superior results. Of course, sometimes this just means that you
| can get away with using that kind of technique (or making that
| kind of blunder) if you're _just that good_.
| GistNoesis wrote:
| The stated problem is getting Off-policy RL to work, aka discover
| a policy smarter than the one it was shown in its dataset.
|
| If I understand correctly, they show random play, and expect
| perfect play to emerge from the naive Q-learning training
| objective.
|
| In layman's term, they expect the algorithm to observe random
| smashing of keys on a piano, and produce a full-fledge symphony.
|
| The main reason it doesn't work is because it's fundamentally
| some Out Of Distribution training.
|
| Neural networks works best in interpolation mode. When you get
| into Out Of Distribution mode, aka extrapolation mode, you rely
| on some additional regularization.
|
| One such regularization you can add, is to trying to predict the
| next observations, and build an internal model whose features
| help make the decision for the next action. Other regularization
| may be to unroll in your head multiple actions in a row and use
| the prediction as a training signal. But all these strategies are
| no longer the domain of the "model-free" RL they are trying to
| do.
|
| Other regularization, can be making the decision function more
| smooth, often by reducing the number of parameters (which goes
| against the idea of scaling).
|
| The adage is "no plan survive first contact with the enemy".
| There needs to be some form of exploration. You must somehow
| learn about the areas of the environment where you need to
| operate. Without interaction from the environment, one way to do
| this is to "grok" a simple model of the environment (fitting
| perfectly all observation (by searching for it) so as to build a
| perfect simulator), and learn on-policy from this simulation.
|
| Alternatively if you have already some not so bad demonstrations
| in your training dataset, you can get it to work a little better
| than the policy of the dataset, and that's why it seems promising
| but it's really not because it's just relying of all the various
| facets of the complexity already present in the dataset.
|
| If you allow some iterative gathering phase of information from
| the environment, interlaced with some off-policy training, it's
| the well known domain of Bayesian methods to allow efficient
| exploration of the space like "kriging", "gaussian process
| regression", multi-arm bandits and "energy-based modeling", which
| allow you to trade more compute for sample efficiency.
|
| The principle being you try model what you know and don't know
| about the environment. There is a trade-off between the
| uncertainty that you have because you have not explored the area
| of space yet and the uncertainty because the model don't fit the
| observation perfectly yet. You force yourself to explore unknown
| area so as not to have regrets (Thomson Sampling) ) but still
| sample promising regions of the space.
|
| In contrast to on-policy learning, the "bayesian exploration
| learning" learn in an off-policy fashion all possible policies.
| Your robot doesn't only learn to go from A to B in the fastest
| way. Instead it explicitly tries to learn various locomotion
| policies, like trotting or galloping, and other gaits and use
| them to go from A to B, but spend more time perfecting galloping
| as it seems that galloping is faster than trotting.
|
| Possibly you can also learn adaptive strategy like they do in
| sim-to-real experiments where your learned policy is based on
| unknown parameters like how much weight your robot carry, and
| your learned policy will estimate on-the-fly these unknown
| parameters to become more robust (aka filling in the missing
| parameters to let the optimal "Model Predictive Control" work).
| mrfox321 wrote:
| Read the paper.
|
| They control for the data being in-distribution
|
| Their dataset also has examples of the problem being solved.
| itkovian_ wrote:
| Completely agree and think it's a great summary. To summarize
| very succinctly; you're chasing a moving target where the target
| changes based on how you move. There's no ground truth to zero in
| on in value-based RL. You minimise a difference in which both
| sides of the equation have your APPROXIMATION in them.
|
| I don't think it's hopeless though, I actually think RL is very
| close to working because what it lacked this whole time was a
| reliable world model/forward dynamics function (because then you
| don't have to explore, you can plan). And now we've got that.
| toootooo wrote:
| How can we eliminate Q-learning's bias in long-horizon, off-
| policy tasks?....
| Straw wrote:
| This post mischaracterizes AlphaZero/MuZero.
|
| AlphaZero/MuZero are not model based, and aren't on policy
| either. They train at a significantly higher temperature, and
| thus intentional suboptimal play, than they use when playing to
| win. LeelaChessZero has further improvements to reduce the bias
| on training on suboptimal play.
|
| There's a well known tradeoff in TD learning based on how many
| steps ahead you look- 1-step TD converges off policy, but can
| give you total nonsense/high bias when your Q function isn't
| trained. Many-step can't give you nonsense because it scores
| based on the real result, but that real result came from
| suboptimal play so your own off-policyness biases the results,
| plus it's higher variance. It's not hard to adjust between these
| two in AlphaZero training as you progress to minimize overall
| bias/variance. (As in, AlphaZero can easily do this- I'm not
| saying the tuning of the schedule of how to do it is easy!)
| bee_rider wrote:
| It is the weekend, so let's anthropomorphize.
|
| The idea of sub-optimal play to increase learning is
| interesting. We can notice the human social phenomenon of being
| annoyed at players who play games "too mechanically" or
| boringly, and admiration of players (even in an overly studied
| game like chess) with an "intuitive" style.
|
| I wonder how AI training strategies would change if the number
| of games they were allowed to play was fairly limited, to the
| handful of thousands of matches of a game that a person might
| play over the course of their lives. And perhaps if their
| "rank" was evaluated over the course of their training, like it
| is for humans.
| Straw wrote:
| Its simpler than that- if you always play what you believe is
| optimal, you will never explore strategies that may in fact
| perform better.
| pixelpoet wrote:
| Probably a dumb/obvious question: if the bias comes from
| selecting the Q-maximum, can't this simply be replaced by
| sampling from a PDF?
| abeppu wrote:
| I think this means you're not longer optimizing for the right
| thing. My understanding is that max is important because your
| value of taking action A from state S includes the presumption
| that you'll pick the best available action under the policy
| from each successive state you visit. If you swap in some PDF,
| you're asking what's the value of taking action A if I then act
| by random sampling from that distribution thereafter.
| abeppu wrote:
| The thing that worked for them was reducing the horizon. From my
| limited and dated understanding, I thought that's what the gamma
| term was for: exponentially you discount the value of stuff in
| the future to the point of being negligible (or less than the
| epsilon of differences you can even represent). So ... why/when
| is exponential discounting not enough?
| apophis-ren wrote:
| It's mentioned in the article. But for really, really long-
| horizon tasks, it might be reasonable that you don't want to
| have a small discount factor.
|
| For example, if you have really sparse rewards in a long-
| horizon task (say, a reward appears 1000 timesteps after the
| action), then a discount factor of even 0.99 won't help to
| capture that: 0.99 ^ 1000 [?] 4e^-5.
|
| Essentially, if your discount factor is too small for an
| environment, it will be near impossible to learn certain credit
| assignments.
| isaacimagine wrote:
| No mention of Decision Transformers or Trajectory Transformers?
| Both are offline approaches that tend to do very well at long-
| horizon tasks, as they bypass the credit assignment problem by
| virtue of having an attention mechanism.
|
| Most RL researchers consider these approaches not to be "real
| RL", as they can't assign credit outside the context window, and
| therefore can't learn infinite-horizon tasks. With 1m+ context
| windows, perhaps this is less of an issue in practice? Curious to
| hear thoughts.
|
| DT: https://arxiv.org/abs/2106.01345
|
| TT: https://arxiv.org/abs/2106.02039
| highd wrote:
| TFP cites decision transformers. Just using a transformer does
| not bypass the credit assignment problem. Transformers are an
| architecture for solving sequence modeling problems, e.g. the
| credit assignment problem as arises in RL. There have been many
| other such architectures.
|
| The hardness of the credit assignment problem is a statement
| about data sparsity. Architecture choices do not "bypass" it.
| isaacimagine wrote:
| TFP: https://arxiv.org/abs/2506.04168
|
| The DT citation [10] is used on a single line, in a paragraph
| listing prior work, as an "and more". Another paper that uses
| DTs [53] is also cited in a similar way. The authors do not
| test or discuss DTs.
|
| > hardness of the credit assignment ... data sparsity.
|
| That is true, but not the point I'm making. "Bypassing credit
| assignment", in the context of long-horizon task modeling, is
| a statement about using attention to allocate long-horizon
| reward without horizon-reducing discount, not architecture
| choice.
|
| To expand: if I have an environment with a key that unlocks a
| door thousands of steps later, Q-Learning may not propagate
| the reward signal from opening the door to the moment of
| picking up the key, because of the discount of future reward
| terms over a long horizon. A decision transformer, however,
| can attend to the moment of picking up the key while opening
| the door, which bypasses the problem of establishing this
| long-horizon causal connection.
|
| (Of course, attention cannot assign reward if the moment the
| key was picked up is beyond the extent of the context
| window.)
| highd wrote:
| You can do Q-Learning with a transformer. You simply define
| the state space as the observation sequence. This is in
| fact natural to do in partially observed settings. So your
| distinction does not make sense.
| isaacimagine wrote:
| DT's reward-to-go vs. QL's Bellman incl. discount, not
| choice of architecture for policy. You could also do DTs
| with RNNs (though own problems w/ memory).
|
| Apologies if we're talking past one another.
| marcosdumay wrote:
| I guess it's worth pointing that when humans learn the long-
| horizon tasks that we learn by repetitive training, we segment
| them in tasks with a shorter horizon and compose them
| hierarchically later.
| BoiledCabbage wrote:
| It does (naively I'll admit) seem like the problem is one more
| of approach more than algorithm.
|
| Yes the model may not be able to tackle long horizon tasks from
| scratch, but learn some shorter horizon skills first then learn
| a longer horizon by leveraging groupings of those smaller
| skills. Chunking like we all do.
|
| Nobody learns how to fly a commercial airplane plane cross
| country as a sequence of micro hand and arm movements. We learn
| to pick up a ball that way when young, but learning to fly or
| play a sport consists of a hierarchy of learned skills and
| plans.
| MichaelRazum wrote:
| TLDR, just use PPO? I always found it kind of confusing, that on
| paper SAC or other algorithms seem to be much more sample
| efficient - but in practice it looks, as the author mentioned
| that they often do not work.
|
| PS: Not sure where to put algorithms like TD-MPC or DreamerV3,
| since they are kind of in between, right?
___________________________________________________________________
(page generated 2025-06-15 23:01 UTC)