[HN Gopher] Bitter Lesson is about AI agents
___________________________________________________________________
Bitter Lesson is about AI agents
Author : ankit219
Score : 79 points
Date : 2025-03-23 09:16 UTC (13 hours ago)
(HTM) web link (ankitmaloo.com)
(TXT) w3m dump (ankitmaloo.com)
| dtagames wrote:
| Good stuff but the original "Bitter Lesson" article has the real
| meat, which is that by applying more compute power we get better
| results (just more accurate token predictions, really) than with
| human guiderails.
| gpapilion wrote:
| More generally beats better. That's the continual lesson from
| data intensive workloads. More compute, more data, more
| bandwidth.
|
| The part that I've been scratching my head at is whether we see a
| retreat from aspects of this due to the high costs associated
| with it. For cpu based workloads this was a workable solution,
| since the price has been reducing. gpus have generally scaled
| pricing as a constant of available flops, and the current
| hardware approach equates to pouring in power to achieve better
| results.
| lsy wrote:
| Going back to the original "Bitter Lesson" article, I think the
| analogy to chess computers could be instructive here. A lot of
| institutional resources were spent trying to achieve "superhuman"
| chess performance, it was achieved, and today almost the entire
| TAM for computer chess is covered by good-enough Stockfish, while
| most of the money tied up in chess is in matching human players
| with each other across the world, and playing against computers
| is sort of what you do when you're learning, or don't have an
| internet connection, or you're embarrassed about your skill and
| don't want to get trash-talked by an Estonian teenager.
|
| The "Second Bitter Lesson" of AI might be that "just because
| massive amounts of compute make something _possible_ doesn 't
| mean that there will be a commensurately massive market to
| justify that compute".
|
| "Bitter Lesson" I think also underplays the amount of energy and
| structure and design that has to go into compute-intensive
| systems to make them succeed: Deep Blue and current engines like
| Stockfish take advantage of tablebases of opening and closing
| positions that are more like GOFAI than deep tree search. And the
| current crop of LLMs are not only taking advantage of expanded
| compute, but of the hard-won ability of companies in the 21st
| century to not only build and resource massive server farms, but
| mobilize armies of contractors in low-COL areas to hand-train
| models into usefulness.
| diego_sandoval wrote:
| The main useful outcome we get from chess is entertainment.
|
| Entertainment that comes from a Human vs. Human match is higher
| than Human vs. AI, at least for spectators.
|
| But many sectors of the economy don't gain much from it being
| done by humans. I don't care if my car was made by all humans
| or all robots, as long as it's the best car I can get for the
| money.
|
| I think you're extrapolating a bit too much from the specific
| case of chess.
| ip26 wrote:
| It's not really _about_ how the compute-intensive resources
| come to bear. You can draw a parallel to Moore's law. Node
| advancement is one of the most expensive and cutting edge
| efforts by humanity today. But it's also simultaneously true
| that software companies have succeeded or failed by betting for
| or against computers getting faster. There are famous examples
| of companies in the 80's that designed software that was simply
| not usable on the computers on hand when the project began, but
| was incredible on the (much faster) computers of launch day.
|
| The bitter lesson is very similar. In essence, when building on
| top of AI models, bet on the AI models getting _much_ faster
| and more capable.
| immibis wrote:
| And there is software today that is simply not usable on
| computers today, but will be incredible on computers in 20
| years time if clock speed continue doubling every 2 years.
|
| Most of it is written in Electron.
| serjester wrote:
| This misses that if the agent is occasionally going haywire, the
| user is leaving and never coming back. AI deployments are about
| managing expectations - you're much better off with an agent
| that's 80 +/- 10% successful than 90 +/- 40%. The more you lean
| into full automation, the more guardrails you give up and the
| more variance your system has. This is a real problem.
| ed wrote:
| Do you have a real world example of this? Claude Code for
| example doesn't fit the pattern of "higher success but more
| variance." If anything the variance is lower as the model (and
| tightly coupled agent) gets better.
| TRiG_Ireland wrote:
| The only AI I've ever dealt with is unwillingly, when
| companies use AI chat bots to replace human support. They
| certainly make me want to leave and not come back.
| fancyfredbot wrote:
| Sutton might have said you just need a loss function which
| penalises variance and the model will learn to reduce variance
| itself. He thinks this will be more effective than hand coded
| guardrails. He's probably right.
|
| I don't know how you write that loss function mind you. Sounds
| tricky. But I doubt Sutton was saying it's easy, just that if
| you can do it then it's effective.
| nsonha wrote:
| Penalises on training? Not runtime? The risk is that.
| ankit219 wrote:
| You don't have to tolerate agent/AI going haywire. In a simple
| example, say of multiple parallel generations. It's compute
| intensive and it reduces the probability of your agent going
| haywire. You need mechanisms and evals to detect the best
| output in this scenario of course, that is still important.
| With more compute, you are preventing your final output to be
| haywire despite the variance.
| ed wrote:
| It'd be nice if this post included a high-level cookbook for
| training the 3rd approach. The hand-waving around RL sounds
| great, but how do you accurately simulate a customer for learning
| at scale?
| typon wrote:
| The counter argument is a bitter lesson that Tesla is learning
| from Waymo and the lesson might be bitter enough to tank the
| company. Waymo's approach to self driving isn't end to end - they
| have classical control combined with tons of deep learning,
| creating a final product that actually works in the real world,
| meanwhile the purely data driving approach from Tesla has failed
| to deliver a working product.
| ModernMech wrote:
| The lesson from Tesla is that AI is not just a magic box where
| you can put in data and get out intelligence. There are more to
| working systems than compute, and when they operate in the real
| world, data isn't enough. The key problem with Tesla cars that
| keep them from succeeding is not that they don't have enough
| data, but they have no idea what to do with it. Even if they
| had infinite compute and all the driving videos in the world,
| it wouldn't be enough to overcome the limitations of their
| sensors.
| dangus wrote:
| Tesla is a poor counterargument because it is no longer a
| market leader. It has poor management compared to 10 years
| ago and seems to be unable to attract top talent (poor labor
| relations).
|
| Tesla is being leapfrogged by competitors across the auto
| industry. All it has is first mover status (charging
| network).
|
| Tesla purposefully limits the capabilities of its self
| driving by refusing to implement it with sensors that go
| beyond smartphone cameras.
|
| My belief is that Tesla doesn't want to actually deliver a
| car that can drive itself because the end result of Waymo is
| that fewer people will need to own a car and fleets of short
| term rental self-driving cars won't spend frivolous money on
| prestige and luxury like consumer car buyers. They won't
| lease a car and replace it every 2-3 years like some car
| owners do just because they like having a new car. Fleet
| vehicle operators purchase cars with razor thin margins and
| make decisions based solely on economics, as well as having a
| lot more purchasing leverage over car manufacturers.
|
| I don't think Tesla ever wants self driving to work, they
| just want to sell the idea of the software.
| immibis wrote:
| Tesla removed the LIDAR and thought advances in AI would be
| able to do without one. They were wrong.
| jsight wrote:
| Tesla didn't remove LIDAR, they never had it. So far,
| that bet is looking pretty reasonable. It seems evident
| at the moment that the most formidable competitors in
| this space could build a solid FSD product with cameras
| alone, with the biggest variable being time.
| xg15 wrote:
| > _The key problem with Tesla cars that keep them from
| succeeding is not that they don 't have enough data, but they
| have no idea what to do with it. Even if they had infinite
| compute and all the driving videos in the world, it wouldn't
| be enough to overcome the limitations of their sensors._
|
| Isn't this effectively a refutation of the "bitter lesson"?
| jsight wrote:
| I'd argue that bitter lesson might be the other way around.
| Waymo has been experimenting with more end-to-end approaches
| and is likely to end up with something that looks more like
| that than a "classical control" approach, though maybe not
| quite the same approach as Tesla's current setup.
|
| IMO, this is the best public description of the current state
| of the art: https://www.youtube.com/watch?v=92e5zD_-xDw
|
| I expect Waymo to continue to evolve in a similar direction.
| patcon wrote:
| YES to the nature analogy.
|
| We are not guaranteed a world pliable to our human understanding.
| The fact that we feel entitled to such things is just a product
| of our current brief moment in the information-theoretic
| landscape, where humans have created and have domination over
| most of the information environment we navigate. This is a rare
| moment for any actor. Most of our long history has been spent in
| environments that are unmanaged ecologies that have blossomed
| around any one actor.
|
| imho neither we nor any single AI agent will understand the world
| as fully as we do. We should retire the idea that we are destined
| to be privileged to that knowledge.
|
| https://nodescription.net/notes/#2021-05-04
| extr wrote:
| I bring this up often at work. There is more ROI in assuming
| models will continue to improve, and planning/engineering with
| that future in mind, rather than using a worse model and spending
| a lot of dev time shoring up it's weaknesses, prompt engineering,
| etc. The best models today will be cheaper tomorrow. The worst
| models today will literally cease to exist. You want to lean into
| this - have the AI handle as much as it possibly can.
|
| Eg: We were using Flash 1.5 for awhile. Spent a lot of time
| prompt engineering to get it to do exactly what we wanted and be
| more reliable. Probably should have just done multi-shot and said
| "take best of 3", because as soon as Flash 2.0 came out, all the
| problems evaporated.
| ankit219 wrote:
| Thats the core of the argument. We are switching from a 100%
| deterministic and controlled worldview (in software terms) to a
| scenario where it's probabilistic, and we haven't updated
| ourselves accordingly. Best of n (with parallelization) is
| probably the simplest fix instead of such rigorous prompt
| engineering. Still many teams do want a deterministic output
| and spend a lot of time on prompts (as opposed to evals to
| choose the best output).
| TylerLives wrote:
| It's actually about LLMs. They're fundamentally limited by our
| preconceptions. Can we go back to games and AlphaZero?
| xg15 wrote:
| It's not wrong, but I find the underlying corrolay pretty creepy
| that actually trying to understand those problems and fix errors
| at edge cases is also a fool's errand, because why try to
| understand a specific behavior if you can just (try to) finetune
| it away?
|
| So we'll have to get used for good to a future where AI is
| unpredictable, usually does what you want, but has a 0.1% chance
| of randomly going haywire and no one will know how to fix it?
|
| Also, the focus on hardware seems to imply that it's strictly a
| game of capital - who has access to the most compute resources
| wins, the others can stop trying. Wouldn't this lead to massive
| centralization?
| latentsea wrote:
| >So we'll have to get used for good to a future where AI is
| unpredictable, usually does what you want, but has a 0.1%
| chance of randomly going haywire and no one will know how to
| fix it?
|
| Just like humans. I don't think is a solvable problem either.
| moojacob wrote:
| > For instance, in customer service, an RL agent might discover
| that sometimes asking a clarifying question early in the
| conversation, even when seemingly obvious, leads to much better
| resolution rates. This isn't something we would typically program
| into a wrapper, but the agent found this pattern through
| extensive trial and error. The key is having enough computational
| power to run these experiments and learn from them.
|
| I am working on a gpt wrapper in customer support. I've focused
| on letting the LMs do what they do best, which is writing
| responses using context. The human is responsible for managing
| the context instead. That part is a much harder problem than RL
| folks expect it to be. How does your AI agent know all the nuance
| of a business? How does it know you switched your policy on
| returns? You'd have to have a human sign off on all replies to
| customer inquiries. But then, why not make an actual UI at that
| point instead of an "agent" chatbox.
|
| Games are simple, we know all the rules. Like chess. Deepmind can
| train on 50 million games. But we don't know all the rules in
| customer support. Are you going to let an AI agent train itself
| on 50 million customer interactions and be happy with it sucking
| for the first 20 million?
| ip26 wrote:
| The bitter lesson would suggest eventually the LM agent will
| train itself, brute force, on _something_ and extract the
| context itself. Perhaps it will scrape all your policy
| documents and figure out which ones are most recently dated.
| abstractcontrol wrote:
| > Investment Strategy: Organizations should invest more in
| computing infrastructure than in complex algorithmic development.
|
| > Competitive Advantage: The winners in AI won't be those with
| the cleverest algorithms, but those who can effectively harness
| the most compute power.
|
| > Career Focus: As AI engineers, our value lies not in crafting
| perfect algorithms but in building systems that can effectively
| leverage massive computational resources. That is a fundamental
| shift in mental models of how to build software.
|
| I think the author has a fundamental misconception what making
| best use of computational resources requires. It's algorithms.
| His recommendation boils down to not do the one thing that would
| allow us to make the best use of computational resources.
|
| His assumptions would only be correct if all the best algorithms
| were already known, which is clearly not the case at present.
|
| Rich Sutton said something similar, but when he said it, he was
| thinking of old engineering intensive approaches, so it made
| sense in the context in which he said it and for the audience he
| directed it at. It was hardly groundbreaking either, the people
| whom he wrote the article for all thought the same thing already.
|
| People like the author of this article don't understand the
| context and are taking his words as gospel. There is no reason
| not to think that there won't be different machine learning
| methods to supplant the current ones, and it's certain they won't
| be found by people who are convinced that algorithmic development
| is useless.
| amarant wrote:
| >There is no reason not to think that there won't be different
| machine learning methods to supplant the current ones,
|
| Sorry, is that a triple negative? I'm confused, but I think
| you're saying there WILL be improved algorithms in the future?
| That seems to jive better with the rest of your comment, but I
| just wanted to make sure I understood you correctly!
|
| So.. Did I?
| aDyslecticCrow wrote:
| I'm by the same mind.
|
| I dare say ChatGPT 3.0 and 4.0 are the only recent examples
| where pure computing produced a significant edge compared to
| algorithmic improvements. And that edge lasted a solid year
| before others caught up. Even among the recent improvements;
|
| 1. Gaussian splashing, a hand-crafted method threw the entire
| field of Nerf models out the water. 2. Deepseek o1 is used for
| training reasoning without a reasoning dataset. 3. Inception-
| labs 16x speedup is done using a diffusion model instead of the
| next token prediction. 4. Deepseek distillation, compressing a
| larger model into a smaller model.
|
| That sets aside the introduction of the Transformer and
| diffusion model themselves, which triggered the current wave in
| the first place.
|
| AI is still a vastly immature field. We have not formally
| explored it carefully but rather randomly tested things. Good
| ideas are being dismissed for whatever randomly worked
| elsewhere. I suspect we are still missing a lot of fundamental
| understanding, even at the activation function level.
|
| We need clever ideas more than compute. But the stock market
| seems to have mixed them up.
| PollardsRho wrote:
| The time span on which these developments take place matter a lot
| for whether the bitter lesson is relevant to a particular AI
| deployment. The best AI models of the future will not have 100K
| lines of hand-coded edge cases, and developing those to make the
| models of today better won't be a long-term way to move towards
| better AI.
|
| On the other hand, most companies don't have unlimited time to
| wait for improvements on the core AI side of things, and even so
| building competitive advantages like a large existing customer
| base or really good private data sets to train next-gen AI tools
| have huge long-term benefits.
|
| There's been an extraordinary amount of labor hours put into
| developing games that could run, through whatever tricks were
| necessary, on whatever hardware actually existed for consumers at
| the time the developers were working. Many of those tricks are no
| longer necessary, and clearly the way to high-definition real-
| time graphics was not in stacking 20 years of tricks onto
| 2000-era hardware. I don't think anyone working on that stuff
| actually thought that was going to happen, though. Many of the
| companies dominating the gaming industry now are the ones that
| built up brands and customers and experience in all of the other
| aspects of the industry, making sure that when better underlying
| scaling came there they had the experience, revenue, and know-how
| to make use of that tooling more effectively.
| spongebobstoes wrote:
| Why must the best model not have 100k edge cases hand coded?
|
| Our firsthand experiences as humans can be viewed as such.
| People constantly over index on their own anecdata, and are the
| best "models" so far.
| _wire_ wrote:
| If only artificial intelligence was intelligent!
|
| Oh, well...
| sgt101 wrote:
| I don't get how RL can be applied in a domain where there is no
| simulator.
|
| So for customer service, to do RL on real customers... well this
| sounds like it's going to be staggeringly slow and very expensive
| in terms of peeved customers.
| RachelF wrote:
| I think an even more bitter lesson is coming very soon: AI will
| run out of human-generated content to train on.
|
| Already AI companies are probably training AI with AI generated
| slop.
|
| Sure there will be tweaks etc, but can we make it more
| intelligent than its teachers?
| noosphr wrote:
| For a blog post of 1,200 words the bitter lesson has done more
| damage to AI research and funding than blowing up a nuclear bomb
| at neurips would.
|
| Every time I try to write a reasonable blog post about why it's
| wrong it blows up to tens of thousands of words and no one can be
| bothered to read it, let alone the supporting citations.
|
| In the spirit of low effort anec-data pulled from memory:
|
| The raw compute needed to brute force any problem can only be
| known after the problem is solved. There is no sane upper limit
| to how much computation, memory and data any given task will take
| and humans are terrible at estimating how hard tasks actually
| are. We are after all only 60 years late for the undergraduate
| summer project that would solve computer vision.
|
| Today VLMs are the best brute force approach to solving computer
| vision we have, and they look like they will take a PB of state
| to solve and the compute needed to train them will be available
| some time around 2040.
|
| What do we do with the problems that are too hard to solve with
| the limited compute that we have? Lie down for 80 years and wait
| for compute to catch up? Or solve a smaller problem using
| specialized tricks that don't require a $10B super computer to
| build?
|
| The bitter lesson is nothing of the sort, there is plenty of
| space for thinking hard, and there always will be.
| RobinL wrote:
| I appreciate this comment. I'm currently working hard on
| something seemly straightforward (address matching) and
| sometimes I feel demotivated because it feels like whatever
| progress I make, the bitter lesson will get me in the end.
| Reading your comment made me feel that maybe it's worth the
| effort after all. I have also taken some comfort in the fact
| that current LLMs cannot perform this task very well.
| uxcolumbo wrote:
| Can you provide more about the problems in address matching
| and what you are trying to solve?
|
| Do you mean street address matching? Isn't that already
| solved? (excuse the naive question)
| kridsdale1 wrote:
| I don't know of any open source solution. (I work in
| mapping)
___________________________________________________________________
(page generated 2025-03-23 23:00 UTC)