[HN Gopher] Bitter Lesson is about AI agents
       ___________________________________________________________________
        
       Bitter Lesson is about AI agents
        
       Author : ankit219
       Score  : 79 points
       Date   : 2025-03-23 09:16 UTC (13 hours ago)
        
 (HTM) web link (ankitmaloo.com)
 (TXT) w3m dump (ankitmaloo.com)
        
       | dtagames wrote:
       | Good stuff but the original "Bitter Lesson" article has the real
       | meat, which is that by applying more compute power we get better
       | results (just more accurate token predictions, really) than with
       | human guiderails.
        
       | gpapilion wrote:
       | More generally beats better. That's the continual lesson from
       | data intensive workloads. More compute, more data, more
       | bandwidth.
       | 
       | The part that I've been scratching my head at is whether we see a
       | retreat from aspects of this due to the high costs associated
       | with it. For cpu based workloads this was a workable solution,
       | since the price has been reducing. gpus have generally scaled
       | pricing as a constant of available flops, and the current
       | hardware approach equates to pouring in power to achieve better
       | results.
        
       | lsy wrote:
       | Going back to the original "Bitter Lesson" article, I think the
       | analogy to chess computers could be instructive here. A lot of
       | institutional resources were spent trying to achieve "superhuman"
       | chess performance, it was achieved, and today almost the entire
       | TAM for computer chess is covered by good-enough Stockfish, while
       | most of the money tied up in chess is in matching human players
       | with each other across the world, and playing against computers
       | is sort of what you do when you're learning, or don't have an
       | internet connection, or you're embarrassed about your skill and
       | don't want to get trash-talked by an Estonian teenager.
       | 
       | The "Second Bitter Lesson" of AI might be that "just because
       | massive amounts of compute make something _possible_ doesn 't
       | mean that there will be a commensurately massive market to
       | justify that compute".
       | 
       | "Bitter Lesson" I think also underplays the amount of energy and
       | structure and design that has to go into compute-intensive
       | systems to make them succeed: Deep Blue and current engines like
       | Stockfish take advantage of tablebases of opening and closing
       | positions that are more like GOFAI than deep tree search. And the
       | current crop of LLMs are not only taking advantage of expanded
       | compute, but of the hard-won ability of companies in the 21st
       | century to not only build and resource massive server farms, but
       | mobilize armies of contractors in low-COL areas to hand-train
       | models into usefulness.
        
         | diego_sandoval wrote:
         | The main useful outcome we get from chess is entertainment.
         | 
         | Entertainment that comes from a Human vs. Human match is higher
         | than Human vs. AI, at least for spectators.
         | 
         | But many sectors of the economy don't gain much from it being
         | done by humans. I don't care if my car was made by all humans
         | or all robots, as long as it's the best car I can get for the
         | money.
         | 
         | I think you're extrapolating a bit too much from the specific
         | case of chess.
        
         | ip26 wrote:
         | It's not really _about_ how the compute-intensive resources
         | come to bear. You can draw a parallel to Moore's law. Node
         | advancement is one of the most expensive and cutting edge
         | efforts by humanity today. But it's also simultaneously true
         | that software companies have succeeded or failed by betting for
         | or against computers getting faster. There are famous examples
         | of companies in the 80's that designed software that was simply
         | not usable on the computers on hand when the project began, but
         | was incredible on the (much faster) computers of launch day.
         | 
         | The bitter lesson is very similar. In essence, when building on
         | top of AI models, bet on the AI models getting _much_ faster
         | and more capable.
        
           | immibis wrote:
           | And there is software today that is simply not usable on
           | computers today, but will be incredible on computers in 20
           | years time if clock speed continue doubling every 2 years.
           | 
           | Most of it is written in Electron.
        
       | serjester wrote:
       | This misses that if the agent is occasionally going haywire, the
       | user is leaving and never coming back. AI deployments are about
       | managing expectations - you're much better off with an agent
       | that's 80 +/- 10% successful than 90 +/- 40%. The more you lean
       | into full automation, the more guardrails you give up and the
       | more variance your system has. This is a real problem.
        
         | ed wrote:
         | Do you have a real world example of this? Claude Code for
         | example doesn't fit the pattern of "higher success but more
         | variance." If anything the variance is lower as the model (and
         | tightly coupled agent) gets better.
        
           | TRiG_Ireland wrote:
           | The only AI I've ever dealt with is unwillingly, when
           | companies use AI chat bots to replace human support. They
           | certainly make me want to leave and not come back.
        
         | fancyfredbot wrote:
         | Sutton might have said you just need a loss function which
         | penalises variance and the model will learn to reduce variance
         | itself. He thinks this will be more effective than hand coded
         | guardrails. He's probably right.
         | 
         | I don't know how you write that loss function mind you. Sounds
         | tricky. But I doubt Sutton was saying it's easy, just that if
         | you can do it then it's effective.
        
           | nsonha wrote:
           | Penalises on training? Not runtime? The risk is that.
        
         | ankit219 wrote:
         | You don't have to tolerate agent/AI going haywire. In a simple
         | example, say of multiple parallel generations. It's compute
         | intensive and it reduces the probability of your agent going
         | haywire. You need mechanisms and evals to detect the best
         | output in this scenario of course, that is still important.
         | With more compute, you are preventing your final output to be
         | haywire despite the variance.
        
       | ed wrote:
       | It'd be nice if this post included a high-level cookbook for
       | training the 3rd approach. The hand-waving around RL sounds
       | great, but how do you accurately simulate a customer for learning
       | at scale?
        
       | typon wrote:
       | The counter argument is a bitter lesson that Tesla is learning
       | from Waymo and the lesson might be bitter enough to tank the
       | company. Waymo's approach to self driving isn't end to end - they
       | have classical control combined with tons of deep learning,
       | creating a final product that actually works in the real world,
       | meanwhile the purely data driving approach from Tesla has failed
       | to deliver a working product.
        
         | ModernMech wrote:
         | The lesson from Tesla is that AI is not just a magic box where
         | you can put in data and get out intelligence. There are more to
         | working systems than compute, and when they operate in the real
         | world, data isn't enough. The key problem with Tesla cars that
         | keep them from succeeding is not that they don't have enough
         | data, but they have no idea what to do with it. Even if they
         | had infinite compute and all the driving videos in the world,
         | it wouldn't be enough to overcome the limitations of their
         | sensors.
        
           | dangus wrote:
           | Tesla is a poor counterargument because it is no longer a
           | market leader. It has poor management compared to 10 years
           | ago and seems to be unable to attract top talent (poor labor
           | relations).
           | 
           | Tesla is being leapfrogged by competitors across the auto
           | industry. All it has is first mover status (charging
           | network).
           | 
           | Tesla purposefully limits the capabilities of its self
           | driving by refusing to implement it with sensors that go
           | beyond smartphone cameras.
           | 
           | My belief is that Tesla doesn't want to actually deliver a
           | car that can drive itself because the end result of Waymo is
           | that fewer people will need to own a car and fleets of short
           | term rental self-driving cars won't spend frivolous money on
           | prestige and luxury like consumer car buyers. They won't
           | lease a car and replace it every 2-3 years like some car
           | owners do just because they like having a new car. Fleet
           | vehicle operators purchase cars with razor thin margins and
           | make decisions based solely on economics, as well as having a
           | lot more purchasing leverage over car manufacturers.
           | 
           | I don't think Tesla ever wants self driving to work, they
           | just want to sell the idea of the software.
        
             | immibis wrote:
             | Tesla removed the LIDAR and thought advances in AI would be
             | able to do without one. They were wrong.
        
               | jsight wrote:
               | Tesla didn't remove LIDAR, they never had it. So far,
               | that bet is looking pretty reasonable. It seems evident
               | at the moment that the most formidable competitors in
               | this space could build a solid FSD product with cameras
               | alone, with the biggest variable being time.
        
           | xg15 wrote:
           | > _The key problem with Tesla cars that keep them from
           | succeeding is not that they don 't have enough data, but they
           | have no idea what to do with it. Even if they had infinite
           | compute and all the driving videos in the world, it wouldn't
           | be enough to overcome the limitations of their sensors._
           | 
           | Isn't this effectively a refutation of the "bitter lesson"?
        
         | jsight wrote:
         | I'd argue that bitter lesson might be the other way around.
         | Waymo has been experimenting with more end-to-end approaches
         | and is likely to end up with something that looks more like
         | that than a "classical control" approach, though maybe not
         | quite the same approach as Tesla's current setup.
         | 
         | IMO, this is the best public description of the current state
         | of the art: https://www.youtube.com/watch?v=92e5zD_-xDw
         | 
         | I expect Waymo to continue to evolve in a similar direction.
        
       | patcon wrote:
       | YES to the nature analogy.
       | 
       | We are not guaranteed a world pliable to our human understanding.
       | The fact that we feel entitled to such things is just a product
       | of our current brief moment in the information-theoretic
       | landscape, where humans have created and have domination over
       | most of the information environment we navigate. This is a rare
       | moment for any actor. Most of our long history has been spent in
       | environments that are unmanaged ecologies that have blossomed
       | around any one actor.
       | 
       | imho neither we nor any single AI agent will understand the world
       | as fully as we do. We should retire the idea that we are destined
       | to be privileged to that knowledge.
       | 
       | https://nodescription.net/notes/#2021-05-04
        
       | extr wrote:
       | I bring this up often at work. There is more ROI in assuming
       | models will continue to improve, and planning/engineering with
       | that future in mind, rather than using a worse model and spending
       | a lot of dev time shoring up it's weaknesses, prompt engineering,
       | etc. The best models today will be cheaper tomorrow. The worst
       | models today will literally cease to exist. You want to lean into
       | this - have the AI handle as much as it possibly can.
       | 
       | Eg: We were using Flash 1.5 for awhile. Spent a lot of time
       | prompt engineering to get it to do exactly what we wanted and be
       | more reliable. Probably should have just done multi-shot and said
       | "take best of 3", because as soon as Flash 2.0 came out, all the
       | problems evaporated.
        
         | ankit219 wrote:
         | Thats the core of the argument. We are switching from a 100%
         | deterministic and controlled worldview (in software terms) to a
         | scenario where it's probabilistic, and we haven't updated
         | ourselves accordingly. Best of n (with parallelization) is
         | probably the simplest fix instead of such rigorous prompt
         | engineering. Still many teams do want a deterministic output
         | and spend a lot of time on prompts (as opposed to evals to
         | choose the best output).
        
       | TylerLives wrote:
       | It's actually about LLMs. They're fundamentally limited by our
       | preconceptions. Can we go back to games and AlphaZero?
        
       | xg15 wrote:
       | It's not wrong, but I find the underlying corrolay pretty creepy
       | that actually trying to understand those problems and fix errors
       | at edge cases is also a fool's errand, because why try to
       | understand a specific behavior if you can just (try to) finetune
       | it away?
       | 
       | So we'll have to get used for good to a future where AI is
       | unpredictable, usually does what you want, but has a 0.1% chance
       | of randomly going haywire and no one will know how to fix it?
       | 
       | Also, the focus on hardware seems to imply that it's strictly a
       | game of capital - who has access to the most compute resources
       | wins, the others can stop trying. Wouldn't this lead to massive
       | centralization?
        
         | latentsea wrote:
         | >So we'll have to get used for good to a future where AI is
         | unpredictable, usually does what you want, but has a 0.1%
         | chance of randomly going haywire and no one will know how to
         | fix it?
         | 
         | Just like humans. I don't think is a solvable problem either.
        
       | moojacob wrote:
       | > For instance, in customer service, an RL agent might discover
       | that sometimes asking a clarifying question early in the
       | conversation, even when seemingly obvious, leads to much better
       | resolution rates. This isn't something we would typically program
       | into a wrapper, but the agent found this pattern through
       | extensive trial and error. The key is having enough computational
       | power to run these experiments and learn from them.
       | 
       | I am working on a gpt wrapper in customer support. I've focused
       | on letting the LMs do what they do best, which is writing
       | responses using context. The human is responsible for managing
       | the context instead. That part is a much harder problem than RL
       | folks expect it to be. How does your AI agent know all the nuance
       | of a business? How does it know you switched your policy on
       | returns? You'd have to have a human sign off on all replies to
       | customer inquiries. But then, why not make an actual UI at that
       | point instead of an "agent" chatbox.
       | 
       | Games are simple, we know all the rules. Like chess. Deepmind can
       | train on 50 million games. But we don't know all the rules in
       | customer support. Are you going to let an AI agent train itself
       | on 50 million customer interactions and be happy with it sucking
       | for the first 20 million?
        
         | ip26 wrote:
         | The bitter lesson would suggest eventually the LM agent will
         | train itself, brute force, on _something_ and extract the
         | context itself. Perhaps it will scrape all your policy
         | documents and figure out which ones are most recently dated.
        
       | abstractcontrol wrote:
       | > Investment Strategy: Organizations should invest more in
       | computing infrastructure than in complex algorithmic development.
       | 
       | > Competitive Advantage: The winners in AI won't be those with
       | the cleverest algorithms, but those who can effectively harness
       | the most compute power.
       | 
       | > Career Focus: As AI engineers, our value lies not in crafting
       | perfect algorithms but in building systems that can effectively
       | leverage massive computational resources. That is a fundamental
       | shift in mental models of how to build software.
       | 
       | I think the author has a fundamental misconception what making
       | best use of computational resources requires. It's algorithms.
       | His recommendation boils down to not do the one thing that would
       | allow us to make the best use of computational resources.
       | 
       | His assumptions would only be correct if all the best algorithms
       | were already known, which is clearly not the case at present.
       | 
       | Rich Sutton said something similar, but when he said it, he was
       | thinking of old engineering intensive approaches, so it made
       | sense in the context in which he said it and for the audience he
       | directed it at. It was hardly groundbreaking either, the people
       | whom he wrote the article for all thought the same thing already.
       | 
       | People like the author of this article don't understand the
       | context and are taking his words as gospel. There is no reason
       | not to think that there won't be different machine learning
       | methods to supplant the current ones, and it's certain they won't
       | be found by people who are convinced that algorithmic development
       | is useless.
        
         | amarant wrote:
         | >There is no reason not to think that there won't be different
         | machine learning methods to supplant the current ones,
         | 
         | Sorry, is that a triple negative? I'm confused, but I think
         | you're saying there WILL be improved algorithms in the future?
         | That seems to jive better with the rest of your comment, but I
         | just wanted to make sure I understood you correctly!
         | 
         | So.. Did I?
        
         | aDyslecticCrow wrote:
         | I'm by the same mind.
         | 
         | I dare say ChatGPT 3.0 and 4.0 are the only recent examples
         | where pure computing produced a significant edge compared to
         | algorithmic improvements. And that edge lasted a solid year
         | before others caught up. Even among the recent improvements;
         | 
         | 1. Gaussian splashing, a hand-crafted method threw the entire
         | field of Nerf models out the water. 2. Deepseek o1 is used for
         | training reasoning without a reasoning dataset. 3. Inception-
         | labs 16x speedup is done using a diffusion model instead of the
         | next token prediction. 4. Deepseek distillation, compressing a
         | larger model into a smaller model.
         | 
         | That sets aside the introduction of the Transformer and
         | diffusion model themselves, which triggered the current wave in
         | the first place.
         | 
         | AI is still a vastly immature field. We have not formally
         | explored it carefully but rather randomly tested things. Good
         | ideas are being dismissed for whatever randomly worked
         | elsewhere. I suspect we are still missing a lot of fundamental
         | understanding, even at the activation function level.
         | 
         | We need clever ideas more than compute. But the stock market
         | seems to have mixed them up.
        
       | PollardsRho wrote:
       | The time span on which these developments take place matter a lot
       | for whether the bitter lesson is relevant to a particular AI
       | deployment. The best AI models of the future will not have 100K
       | lines of hand-coded edge cases, and developing those to make the
       | models of today better won't be a long-term way to move towards
       | better AI.
       | 
       | On the other hand, most companies don't have unlimited time to
       | wait for improvements on the core AI side of things, and even so
       | building competitive advantages like a large existing customer
       | base or really good private data sets to train next-gen AI tools
       | have huge long-term benefits.
       | 
       | There's been an extraordinary amount of labor hours put into
       | developing games that could run, through whatever tricks were
       | necessary, on whatever hardware actually existed for consumers at
       | the time the developers were working. Many of those tricks are no
       | longer necessary, and clearly the way to high-definition real-
       | time graphics was not in stacking 20 years of tricks onto
       | 2000-era hardware. I don't think anyone working on that stuff
       | actually thought that was going to happen, though. Many of the
       | companies dominating the gaming industry now are the ones that
       | built up brands and customers and experience in all of the other
       | aspects of the industry, making sure that when better underlying
       | scaling came there they had the experience, revenue, and know-how
       | to make use of that tooling more effectively.
        
         | spongebobstoes wrote:
         | Why must the best model not have 100k edge cases hand coded?
         | 
         | Our firsthand experiences as humans can be viewed as such.
         | People constantly over index on their own anecdata, and are the
         | best "models" so far.
        
       | _wire_ wrote:
       | If only artificial intelligence was intelligent!
       | 
       | Oh, well...
        
       | sgt101 wrote:
       | I don't get how RL can be applied in a domain where there is no
       | simulator.
       | 
       | So for customer service, to do RL on real customers... well this
       | sounds like it's going to be staggeringly slow and very expensive
       | in terms of peeved customers.
        
       | RachelF wrote:
       | I think an even more bitter lesson is coming very soon: AI will
       | run out of human-generated content to train on.
       | 
       | Already AI companies are probably training AI with AI generated
       | slop.
       | 
       | Sure there will be tweaks etc, but can we make it more
       | intelligent than its teachers?
        
       | noosphr wrote:
       | For a blog post of 1,200 words the bitter lesson has done more
       | damage to AI research and funding than blowing up a nuclear bomb
       | at neurips would.
       | 
       | Every time I try to write a reasonable blog post about why it's
       | wrong it blows up to tens of thousands of words and no one can be
       | bothered to read it, let alone the supporting citations.
       | 
       | In the spirit of low effort anec-data pulled from memory:
       | 
       | The raw compute needed to brute force any problem can only be
       | known after the problem is solved. There is no sane upper limit
       | to how much computation, memory and data any given task will take
       | and humans are terrible at estimating how hard tasks actually
       | are. We are after all only 60 years late for the undergraduate
       | summer project that would solve computer vision.
       | 
       | Today VLMs are the best brute force approach to solving computer
       | vision we have, and they look like they will take a PB of state
       | to solve and the compute needed to train them will be available
       | some time around 2040.
       | 
       | What do we do with the problems that are too hard to solve with
       | the limited compute that we have? Lie down for 80 years and wait
       | for compute to catch up? Or solve a smaller problem using
       | specialized tricks that don't require a $10B super computer to
       | build?
       | 
       | The bitter lesson is nothing of the sort, there is plenty of
       | space for thinking hard, and there always will be.
        
         | RobinL wrote:
         | I appreciate this comment. I'm currently working hard on
         | something seemly straightforward (address matching) and
         | sometimes I feel demotivated because it feels like whatever
         | progress I make, the bitter lesson will get me in the end.
         | Reading your comment made me feel that maybe it's worth the
         | effort after all. I have also taken some comfort in the fact
         | that current LLMs cannot perform this task very well.
        
           | uxcolumbo wrote:
           | Can you provide more about the problems in address matching
           | and what you are trying to solve?
           | 
           | Do you mean street address matching? Isn't that already
           | solved? (excuse the naive question)
        
             | kridsdale1 wrote:
             | I don't know of any open source solution. (I work in
             | mapping)
        
       ___________________________________________________________________
       (page generated 2025-03-23 23:00 UTC)