[HN Gopher] Q-learning is not yet scalable
       ___________________________________________________________________
        
       Q-learning is not yet scalable
        
       Author : jxmorris12
       Score  : 205 points
       Date   : 2025-06-15 00:56 UTC (22 hours ago)
        
 (HTM) web link (seohong.me)
 (TXT) w3m dump (seohong.me)
        
       | whatshisface wrote:
       | The benefit of off-policy learning is fundamentally limited by
       | the fact that data from ineffective early exploration isn't that
       | useful for improving on later more refined policies. It's clear
       | if you think of a few examples: chess blunders, spasmodic
       | movement, or failing to solve a puzzle. This becomes especially
       | clear once you realize that data only becomes off-policy when it
       | describes something the policy would not do. I think the solution
       | to this problem is (unfortunately) related to the need for better
       | generalization / sample efficiency.
        
         | getnormality wrote:
         | Doesn't this claim prove too much? What about the cited dog
         | that walked in 20 minutes with off-policy learning? Or are you
         | making a more nuanced point?
        
       | AndrewKemendo wrote:
       | Q learning isn't scalable because of the stream barrier, however
       | streaming DRL (TD-Lambda) is scalable:
       | 
       | https://arxiv.org/abs/2410.14606
       | 
       | Note that this is from Turing award winner Richard Sutton's lab
       | at UofA
       | 
       | RL works
        
         | getnormality wrote:
         | But does this address scaling to long-horizon tasks, which is
         | what the article is about?
        
           | AndrewKemendo wrote:
           | Yes because it's management of long term reward by default
        
       | s-mon wrote:
       | While I like the blogpost, I think the use of unexplained
       | acronyms undermines the opportunity of this blogpost to be useful
       | to the wider audience. Small nit: make sure acronyms and jargon
       | is explained.
        
         | keremk wrote:
         | For these kinds of blogposts where the content is very good but
         | not very approachable due to assumption of extensive prior
         | knowledge, I find using an AI tool very useful, to explain and
         | simplify. Just used the new browser Dia for this, and it worked
         | really well for me. Or you can use your favorite model provider
         | and copy and paste. This way the post stays concise, and yet
         | you can use your AI tools to ask questions and clarify.
        
         | anonthrowawy wrote:
         | i actually think thats what made it crisp
        
         | levocardia wrote:
         | It's clearly written for an audience of other RL researchers,
         | given than the conclusion is "will someone please come up with
         | Q-learning methods that scale!"
        
       | andy_xor_andrew wrote:
       | the magic thing about off-policy techniques such as Q-Learning is
       | that they will converge on an optimal result even if they only
       | ever see sub-optimal training data.
       | 
       | For example, you can use a dataset of chess games from agents
       | that move totally randomly (with no strategy at all) and use that
       | as an input for Q-Learning, and it will still converge on an
       | optimal policy (albeit more slowly than if you had more high-
       | quality inputs)
        
         | Ericson2314 wrote:
         | I would think this being true is the definition of the task
         | being "ergodic" (distorting that term slightly, maybe). But I
         | would also expect non-ergodic tasks to exist.
        
       | andy_xor_andrew wrote:
       | The article mentions AlphaGo/Mu/Zero was not based on Q-Learning
       | - I'm no expert but I thought AlphaGo was based on DeepMind's
       | "Deep Q-Learning"? Is that not right?
        
         | energy123 wrote:
         | DeepMind's earlier success with Atari was based on offline
         | Q-Learning
        
       | lalaland1125 wrote:
       | This blog post is unfortunately missing what I consider the
       | bigger reason why Q learning is not scalable:
       | 
       | As horizon increases, the number of possible states (usually)
       | increases exponentially. This means you require exponentially
       | increasing data to have a hope of training a Q that can handle
       | those states.
       | 
       | This is less of an issue for on policy learning, because only
       | near policy states are important, and on policy learning
       | explicitly only samples those states. So even though there are
       | exponential possible states your training data is laser focused
       | on the important ones.
        
         | Ericson2314 wrote:
         | https://news.ycombinator.com/item?id=44280505 I think that
         | thead might help?
         | 
         | Total layman here, but maybe some tasks are "uniform" despite
         | being "deep" in such a way that poor samples still suffice? I
         | would call those "ergodic" tasks. But surely there are other
         | tasks where this is not the case?
        
           | lalaland1125 wrote:
           | Good clarification. I have edited my post accordingly.
           | 
           | There are situations where states increase at much slower
           | rates than exponential.
           | 
           | Those situations are a good fit for Q learning.
        
         | arthurcolle wrote:
         | Feel the Majorana-1
        
         | elchananHaas wrote:
         | I think the article's analysis of overapproximation bias is
         | correct. The issue is that due to the Max operator in the Q
         | learning noise is amplified over timesteps. Some methods to
         | reduce this bias, such as https://arxiv.org/abs/1509.06461 were
         | successful in improving the RL agents performance. Studies have
         | found that this happens even more for the states that the
         | network hasn't visited many times.
         | 
         | An exponential number of states only matters if there is no
         | pattern to them. If there is some structure that the network
         | can learn then it can perform well. This is a strength of deep
         | learning, not a weakness. The trick is getting the right
         | training objective, which the article claims q learning isn't.
         | 
         | I do wonder if MuZero and other model based RL systems are the
         | solution to the author's concerns. MuZero can reanalyze prior
         | trajectories to improve training efficiency. The Monte Carlo
         | Tree Search (MCTS) is a principled way to perform horizon
         | reduction by unrolling the model multiple steps. The max
         | operator in MCTS could cause similar issues but the search
         | progressing deeper counteracts this.
        
         | jhrmnn wrote:
         | Is this essentially the same difference as between vanilla
         | regular-grid and importance-sampling Monte Carlo integration?
        
       | briandw wrote:
       | This papers assumes that you know quite a bit about RL already.
       | If you really want to dig into RL, this intro course from David
       | Silver (Deep Mind) is excellent:
       | https://youtu.be/2pWv7GOvuf0?si=CmFJHNnNqraL5i0s
        
         | ArtRichards wrote:
         | Thank you for this link.
        
       | Onavo wrote:
       | Q learning is great as a hello world RL project for teaching
       | undergraduates.
        
         | msgodel wrote:
         | I feel like the null TD algorithm is much better if you want a
         | "hello world."
        
       | paraschopra wrote:
       | Humans actually do both. We learn from on-policy by exploring
       | consequences of our own behavior. But we also learn off-policy,
       | say from expert demonstrations (but difference being we can tell
       | good behaviors from bad, and learn from a filtered list of what
       | we consider as good behaviors). In most, off-policy RL, a lot of
       | behaviors are bad and yet they get into the training set and
       | hence leading to slower training.
        
         | taneq wrote:
         | > difference being we can tell good behaviors from bad
         | 
         | Not always! That's what makes some expert demonstrations so
         | fascinating, watching someone do something "completely wrong"
         | (according to novice level 'best practice') and achieve
         | superior results. Of course, sometimes this just means that you
         | can get away with using that kind of technique (or making that
         | kind of blunder) if you're _just that good_.
        
       | GistNoesis wrote:
       | The stated problem is getting Off-policy RL to work, aka discover
       | a policy smarter than the one it was shown in its dataset.
       | 
       | If I understand correctly, they show random play, and expect
       | perfect play to emerge from the naive Q-learning training
       | objective.
       | 
       | In layman's term, they expect the algorithm to observe random
       | smashing of keys on a piano, and produce a full-fledge symphony.
       | 
       | The main reason it doesn't work is because it's fundamentally
       | some Out Of Distribution training.
       | 
       | Neural networks works best in interpolation mode. When you get
       | into Out Of Distribution mode, aka extrapolation mode, you rely
       | on some additional regularization.
       | 
       | One such regularization you can add, is to trying to predict the
       | next observations, and build an internal model whose features
       | help make the decision for the next action. Other regularization
       | may be to unroll in your head multiple actions in a row and use
       | the prediction as a training signal. But all these strategies are
       | no longer the domain of the "model-free" RL they are trying to
       | do.
       | 
       | Other regularization, can be making the decision function more
       | smooth, often by reducing the number of parameters (which goes
       | against the idea of scaling).
       | 
       | The adage is "no plan survive first contact with the enemy".
       | There needs to be some form of exploration. You must somehow
       | learn about the areas of the environment where you need to
       | operate. Without interaction from the environment, one way to do
       | this is to "grok" a simple model of the environment (fitting
       | perfectly all observation (by searching for it) so as to build a
       | perfect simulator), and learn on-policy from this simulation.
       | 
       | Alternatively if you have already some not so bad demonstrations
       | in your training dataset, you can get it to work a little better
       | than the policy of the dataset, and that's why it seems promising
       | but it's really not because it's just relying of all the various
       | facets of the complexity already present in the dataset.
       | 
       | If you allow some iterative gathering phase of information from
       | the environment, interlaced with some off-policy training, it's
       | the well known domain of Bayesian methods to allow efficient
       | exploration of the space like "kriging", "gaussian process
       | regression", multi-arm bandits and "energy-based modeling", which
       | allow you to trade more compute for sample efficiency.
       | 
       | The principle being you try model what you know and don't know
       | about the environment. There is a trade-off between the
       | uncertainty that you have because you have not explored the area
       | of space yet and the uncertainty because the model don't fit the
       | observation perfectly yet. You force yourself to explore unknown
       | area so as not to have regrets (Thomson Sampling) ) but still
       | sample promising regions of the space.
       | 
       | In contrast to on-policy learning, the "bayesian exploration
       | learning" learn in an off-policy fashion all possible policies.
       | Your robot doesn't only learn to go from A to B in the fastest
       | way. Instead it explicitly tries to learn various locomotion
       | policies, like trotting or galloping, and other gaits and use
       | them to go from A to B, but spend more time perfecting galloping
       | as it seems that galloping is faster than trotting.
       | 
       | Possibly you can also learn adaptive strategy like they do in
       | sim-to-real experiments where your learned policy is based on
       | unknown parameters like how much weight your robot carry, and
       | your learned policy will estimate on-the-fly these unknown
       | parameters to become more robust (aka filling in the missing
       | parameters to let the optimal "Model Predictive Control" work).
        
         | mrfox321 wrote:
         | Read the paper.
         | 
         | They control for the data being in-distribution
         | 
         | Their dataset also has examples of the problem being solved.
        
       | itkovian_ wrote:
       | Completely agree and think it's a great summary. To summarize
       | very succinctly; you're chasing a moving target where the target
       | changes based on how you move. There's no ground truth to zero in
       | on in value-based RL. You minimise a difference in which both
       | sides of the equation have your APPROXIMATION in them.
       | 
       | I don't think it's hopeless though, I actually think RL is very
       | close to working because what it lacked this whole time was a
       | reliable world model/forward dynamics function (because then you
       | don't have to explore, you can plan). And now we've got that.
        
       | toootooo wrote:
       | How can we eliminate Q-learning's bias in long-horizon, off-
       | policy tasks?....
        
       | Straw wrote:
       | This post mischaracterizes AlphaZero/MuZero.
       | 
       | AlphaZero/MuZero are not model based, and aren't on policy
       | either. They train at a significantly higher temperature, and
       | thus intentional suboptimal play, than they use when playing to
       | win. LeelaChessZero has further improvements to reduce the bias
       | on training on suboptimal play.
       | 
       | There's a well known tradeoff in TD learning based on how many
       | steps ahead you look- 1-step TD converges off policy, but can
       | give you total nonsense/high bias when your Q function isn't
       | trained. Many-step can't give you nonsense because it scores
       | based on the real result, but that real result came from
       | suboptimal play so your own off-policyness biases the results,
       | plus it's higher variance. It's not hard to adjust between these
       | two in AlphaZero training as you progress to minimize overall
       | bias/variance. (As in, AlphaZero can easily do this- I'm not
       | saying the tuning of the schedule of how to do it is easy!)
        
         | bee_rider wrote:
         | It is the weekend, so let's anthropomorphize.
         | 
         | The idea of sub-optimal play to increase learning is
         | interesting. We can notice the human social phenomenon of being
         | annoyed at players who play games "too mechanically" or
         | boringly, and admiration of players (even in an overly studied
         | game like chess) with an "intuitive" style.
         | 
         | I wonder how AI training strategies would change if the number
         | of games they were allowed to play was fairly limited, to the
         | handful of thousands of matches of a game that a person might
         | play over the course of their lives. And perhaps if their
         | "rank" was evaluated over the course of their training, like it
         | is for humans.
        
           | Straw wrote:
           | Its simpler than that- if you always play what you believe is
           | optimal, you will never explore strategies that may in fact
           | perform better.
        
       | pixelpoet wrote:
       | Probably a dumb/obvious question: if the bias comes from
       | selecting the Q-maximum, can't this simply be replaced by
       | sampling from a PDF?
        
         | abeppu wrote:
         | I think this means you're not longer optimizing for the right
         | thing. My understanding is that max is important because your
         | value of taking action A from state S includes the presumption
         | that you'll pick the best available action under the policy
         | from each successive state you visit. If you swap in some PDF,
         | you're asking what's the value of taking action A if I then act
         | by random sampling from that distribution thereafter.
        
       | abeppu wrote:
       | The thing that worked for them was reducing the horizon. From my
       | limited and dated understanding, I thought that's what the gamma
       | term was for: exponentially you discount the value of stuff in
       | the future to the point of being negligible (or less than the
       | epsilon of differences you can even represent). So ... why/when
       | is exponential discounting not enough?
        
         | apophis-ren wrote:
         | It's mentioned in the article. But for really, really long-
         | horizon tasks, it might be reasonable that you don't want to
         | have a small discount factor.
         | 
         | For example, if you have really sparse rewards in a long-
         | horizon task (say, a reward appears 1000 timesteps after the
         | action), then a discount factor of even 0.99 won't help to
         | capture that: 0.99 ^ 1000 [?] 4e^-5.
         | 
         | Essentially, if your discount factor is too small for an
         | environment, it will be near impossible to learn certain credit
         | assignments.
        
       | isaacimagine wrote:
       | No mention of Decision Transformers or Trajectory Transformers?
       | Both are offline approaches that tend to do very well at long-
       | horizon tasks, as they bypass the credit assignment problem by
       | virtue of having an attention mechanism.
       | 
       | Most RL researchers consider these approaches not to be "real
       | RL", as they can't assign credit outside the context window, and
       | therefore can't learn infinite-horizon tasks. With 1m+ context
       | windows, perhaps this is less of an issue in practice? Curious to
       | hear thoughts.
       | 
       | DT: https://arxiv.org/abs/2106.01345
       | 
       | TT: https://arxiv.org/abs/2106.02039
        
         | highd wrote:
         | TFP cites decision transformers. Just using a transformer does
         | not bypass the credit assignment problem. Transformers are an
         | architecture for solving sequence modeling problems, e.g. the
         | credit assignment problem as arises in RL. There have been many
         | other such architectures.
         | 
         | The hardness of the credit assignment problem is a statement
         | about data sparsity. Architecture choices do not "bypass" it.
        
           | isaacimagine wrote:
           | TFP: https://arxiv.org/abs/2506.04168
           | 
           | The DT citation [10] is used on a single line, in a paragraph
           | listing prior work, as an "and more". Another paper that uses
           | DTs [53] is also cited in a similar way. The authors do not
           | test or discuss DTs.
           | 
           | > hardness of the credit assignment ... data sparsity.
           | 
           | That is true, but not the point I'm making. "Bypassing credit
           | assignment", in the context of long-horizon task modeling, is
           | a statement about using attention to allocate long-horizon
           | reward without horizon-reducing discount, not architecture
           | choice.
           | 
           | To expand: if I have an environment with a key that unlocks a
           | door thousands of steps later, Q-Learning may not propagate
           | the reward signal from opening the door to the moment of
           | picking up the key, because of the discount of future reward
           | terms over a long horizon. A decision transformer, however,
           | can attend to the moment of picking up the key while opening
           | the door, which bypasses the problem of establishing this
           | long-horizon causal connection.
           | 
           | (Of course, attention cannot assign reward if the moment the
           | key was picked up is beyond the extent of the context
           | window.)
        
             | highd wrote:
             | You can do Q-Learning with a transformer. You simply define
             | the state space as the observation sequence. This is in
             | fact natural to do in partially observed settings. So your
             | distinction does not make sense.
        
               | isaacimagine wrote:
               | DT's reward-to-go vs. QL's Bellman incl. discount, not
               | choice of architecture for policy. You could also do DTs
               | with RNNs (though own problems w/ memory).
               | 
               | Apologies if we're talking past one another.
        
       | marcosdumay wrote:
       | I guess it's worth pointing that when humans learn the long-
       | horizon tasks that we learn by repetitive training, we segment
       | them in tasks with a shorter horizon and compose them
       | hierarchically later.
        
         | BoiledCabbage wrote:
         | It does (naively I'll admit) seem like the problem is one more
         | of approach more than algorithm.
         | 
         | Yes the model may not be able to tackle long horizon tasks from
         | scratch, but learn some shorter horizon skills first then learn
         | a longer horizon by leveraging groupings of those smaller
         | skills. Chunking like we all do.
         | 
         | Nobody learns how to fly a commercial airplane plane cross
         | country as a sequence of micro hand and arm movements. We learn
         | to pick up a ball that way when young, but learning to fly or
         | play a sport consists of a hierarchy of learned skills and
         | plans.
        
       | MichaelRazum wrote:
       | TLDR, just use PPO? I always found it kind of confusing, that on
       | paper SAC or other algorithms seem to be much more sample
       | efficient - but in practice it looks, as the author mentioned
       | that they often do not work.
       | 
       | PS: Not sure where to put algorithms like TD-MPC or DreamerV3,
       | since they are kind of in between, right?
        
       ___________________________________________________________________
       (page generated 2025-06-15 23:01 UTC)