[HN Gopher] "The Bitter Lesson" is wrong. Well sort of
___________________________________________________________________
"The Bitter Lesson" is wrong. Well sort of
Author : GavCo
Score : 36 points
Date : 2025-07-20 16:33 UTC (6 hours ago)
(HTM) web link (assaf-pinhasi.medium.com)
(TXT) w3m dump (assaf-pinhasi.medium.com)
| rhaps0dy wrote:
| Sutton was talking about progress in AI overall, whereas Pinhasi
| (OP) is talking about building one model for production right
| now. Of course adding some hand-coded knowledge is essential for
| the latter, but it has not provided much long-term progress.
| (Even CNNs and group-convolutional NNs, which seek to encode
| invariants to increase efficiency while still doing almost only
| learning, seem to be on the way out)
| aabhay wrote:
| The main problem with the "Bitter Lesson" is that there's
| something even bitter-er behind it -- the "Harsh Reality" that
| while we may scale models on compute and data, that simply
| broadly inserting tons of data without any sort of curation
| yields essentially garbage models.
|
| The "Harsh Reality" is that while you may only need data, the
| current best models and companies behind them spend enormously on
| gathering high quality labeled data with extensive oversight and
| curation. This curation is of course being partially automated as
| well, but ultimately there's billions or even tens of billions of
| dollars flowing into gathering, reviewing, and processing
| subjectively high quality data.
|
| Interestingly, in the time that this paper was published, the
| harsh reality was not so harsh. For example in things like face
| detection, (actual) next word prediction, and other purely self
| supervised and not instruction tuned or "Chat" style models, data
| was truly all you needed. You didn't need "good" faces. As long
| as it was indeed a face, the data itself was enough. Now, it's
| not. In order to make these machines useful and not just function
| approximators, we need extremely large dataset curation
| industries.
|
| If you learned the bitter lesson, you better accept the harsh
| reality, too.
| bobbiechen wrote:
| So true. I recently wrote about how Merlin achieved magical
| bird identification not through better algorithms, but better
| expertise in creating great datasets:
| https://digitalseams.com/blog/what-birdsong-and-backends-can...
|
| I think "harsh reality" is one way to look at it, but you can
| also take an optimistic perspective: you really can achieve
| great, magical experiences by putting in (what could be
| considered) unreasonable effort.
| mhuffman wrote:
| Thanks for the intro to Merlin! I just went outside of my
| house and used it on 5 different types of birds and it helped
| me identify 100%. Relevent (possibly out of date) xkcd comic
|
| [0]https://xkcd.com/1425/
| Xymist wrote:
| Relevant - and old enough that those five years have been
| successfully granted!
| pphysch wrote:
| Another name for gathering and curating high-quality datasets
| is "science". One would hope "AI pioneer" USA would embrace
| this harsh reality and invest massively in basic science
| education and infrastructure. But we are seeing the opposite,
| and basically no awareness of this "harsh reality" among the AI
| hype...
| vineyardmike wrote:
| While I agree with you, it's worth noting that current LLM
| training uses a significant percentage of all available written
| data for training. The transition from GPT-2 era models to now
| (GPT-3+) saw the transition from novel models that can kinda
| imitate speech to models that can converse, write code, and use
| tools. It's only _after_ the readily available data was
| exhausted, that future gains came curation and large amounts of
| synthetic data.
| aabhay wrote:
| Transfer learning isn't about "exhausting" all available un-
| curated data, its simply that the systems are large enough to
| support it. There's not that much of a reason to train on all
| available data. And its not all, there's still a very
| significant filtration happening. For example they don't
| train on petabytes of log files, that would just be terribly
| uninteresting data.
| Calavar wrote:
| > The transition from GPT-2 era models to now (GPT-3+) saw
| the transition from novel models that can kinda imitate
| speech to models that can converse, write code, and use
| tools.
|
| Which is fundamentally about data. OpenAI invested an absurd
| amount of money to get the human annotations to drive RHLF.
|
| RHLF itself is a very vanilla reinforcement learning algo +
| some branding/marketing.
| v9v wrote:
| I think your comment has some threads in common with Rodney
| Brooks' response: https://rodneybrooks.com/a-better-lesson/
| macawfish wrote:
| In my opinion the useful part of "the bitter lesson" has nothing
| to do with throwing more compute and more data at stuff, it has
| to do with actually using ML instead of trying to manually and
| cleverly tweak stuff, and with actually leveraging the data you
| have effectively as a part of that (again using more ML) rather
| than trying to manually label everything.
| rdw wrote:
| The bitter lesson is becoming misunderstood as the world moves
| on. Unstated yet core to it is that AI researchers were
| historically attempting to build an understanding of human
| intelligence. They intended to, piece-by-piece, assemble a human
| brain and thus be able to explain (and fix) our own biological
| ones. Much like can be done with physical simulations of knee
| joints. Of course, you can also use that knowledge to create
| useful thinking machines, because you understand it well enough
| to be able to control it. Much like how we have many robotic
| joints.
|
| So, the bitter lesson is based on a disappointment that you're
| building intelligence without understanding why it works.
| DoctorOetker wrote:
| Right, like discovering Huygens principle, or interference,
| integrals/sums of all paths in physics.
|
| It is not because a whole lot of physical phenomena can be
| explained by a couple of foundational principles, that
| understanding those core patterns automatically endows one with
| an understanding of how and why materials refract light and a
| plethora of other specific effects... effects worth
| understanding individually, even if still explained in terms of
| those foundational concepts.
|
| Knowing a complicated set of axioms or postulates endows one to
| derive theorems from them, but those implied theorem proofs are
| nonetheless non-trivial, and have a value of their own (even
| though they can be expressed and expanded into a DAG of
| applications of those "bitterly minimal" axiomatization.
|
| Once enough patterns are correctly modeled by machines, and
| given enough time to analyze it, people will eventually
| discover a better how and why things work (beyond the mere
| abstract, knowledge that latent parameters were fitted against
| a loss function).
|
| In some sense deeper understanding has already come for the
| simpler models like word2vec, where many papers have analyzed
| and explained relations between word vectors. This too lagged
| behind the creation and utilization of word vector embeddings.
|
| It is not inconceivable that someday someone observes an
| analogy between say QKV tensors and triples resulting from
| graph linearization: think subject, object, predicate; (even
| though I hate those triples, try modeling a ternary relation
| like 2+5=7 with SOP-triples, its really only meant to capture
| "sky - is - blue" associations. A better type of triple would
| be player-role-act triples, one can then model ternary
| relations, but one needs to reify the relation)
|
| Similarly, without mathematical training, humans display
| awareness of the concepts of sets, membership, existence, ...
| without a formal system. The chatbots display this awareness.
| It's all vague naive set theory. But _how_ are DNN 's modeling
| set theory? Thats a paper someday.
| ta8645 wrote:
| > you're building intelligence without understanding why it
| works.
|
| But if we do a good enough job of that, it should then be able
| to explain to us why it works (after it does some
| research/science on itself). Yes?
| samrus wrote:
| Bit fantastical. We are a general intelligence and we dont
| understand ourselves
| godelski wrote:
| I'm not sure if the Bitter Lesson is wrong, I think we'd need
| clarification from Sutton (does someone have this?)
|
| But I do know "Scale is All You Need" is wrong. And VERY wrong.
|
| Scaling has done a lot. Without a doubt it is very useful. But
| this is a drastic oversimplification of all the work that has
| happened over the last 10-20 years. ConvNext and "ResNets Strike
| Back" didn't take off for reasons, despite being very impressive.
| There's been a lot of algorithmic changes, a lot of changes to
| training procedures, a lot of changes to _how_ we collect
| data[0], and more.
|
| We have to be very honest, you can't just buy your way to AGI.
| There's still innovation that needs be done. This is great for
| anyone still looking to get into the space. The game isn't close
| to being over. I'd argue that this is great for investors too, as
| there are a lot of techniques looking to try themselves at scale.
| Your unicorns are going to be over here. A dark horse isn't a
| horse that just looks like every other horse. Might be a "safer"
| bet, but that's like betting on amateur jockies and horses that
| just train similar to professional ones. They have to do a lot of
| catch-up, even if the results are fairly certain. At that point
| you're not investing in the tech, you're investing in the person
| or the market strategy.
|
| [0] Okay, I'll buy this one as scale if we really want to argue
| that these changes are about scaling data effectively but we also
| look at smaller datasets differently because of these lessons.
| roadside_picnic wrote:
| "The Bitter Lesson" certainly seems correct when applied to
| whatever the limit of the current state of the art is, but in
| practice solving day-to-day ML problems, outside of FAANG-style
| companies and cutting edge research, data is always much more
| constrained.
|
| I have, multiple times in my career, solved a problem using
| simple, intelligible models that have empirically outperformed
| neural models ultimately because there was not enough data for
| the neural approach to learn anything. As a community we tend to
| obsess over architecture and then infrastructure, but data is
| often the real limiting factor.
|
| When I was early in my career I used to always try to apply very
| general, data hungry, models to all my problems.. with very mixed
| success. As I became more skilled I started to be a staunch
| advocated of only using simple models you could understand, with
| much more successful results (which is what lead to this revised
| opinion). But, at this point in my career, I increasingly see
| that one's approach to modeling should basically be to approach
| the problem more information theoretically: try to figure out the
| model with a channel capacity that best matches your information
| rate.
|
| As a Bayesian, I also think there's a very reasonable explanation
| for why "The Bitter Lesson" rings true over and over again. In ET
| Jaynes' writing he often talks about Bayes' Theorem in terms of
| P(D|H) (i.e. probably of the _D_ ata given the _H_ ypothesis, or
| vice versa), but, especially in the earlier chapters,
| purposefully adds an X to that equation: P(D|H,X) where X is a
| stand in for _all of our prior information about the world_.
| Typically we think of prior data as being literal data, but
| Jaynes ' points out that our entire world of understand is also
| part of our prior context.
|
| In this view, models that "leverage human understanding" (i.e.
| are fully intelligible) are essentially throwing out information
| at the limit. But to my earlier point, if the data falls quite
| short of that limit, then those intelligible models are _adding_
| information in data constrained scenarios. I think the challenge
| in practical application is figuring out where the threshold is
| that you need to adopt a more general approach.
|
| Currently I'm very much in love with Gaussian Processes that, for
| constrained data environments, offer a powerful combination of
| both of these methods. You can give the model prior hints at what
| things should look like in terms of the relative structure of the
| kernel and it's priors (e.g. there should be some roughly annual
| seasonal component, and one roughly weekly seasonal component)
| but otherwise let the data decide.
| littlestymaar wrote:
| The Leela Chess Zero vs Stockfish case also offers an interesting
| perspective on the bitter lesson.
|
| Here's my (maybe a bit loose) recollection of what happened:
|
| Step 1- Stockfish was the typical human-knowledge AI, with tons
| of actual chess knowledge injected in the process of building an
| efficient chess engine.
|
| Step 2. Then came Leela Chess Zero, with its Alpha Zero-inspired
| training, a chess engine trained fully with RL with no prior
| chess knowledge added. And it has beaten Stockfish. This is a
| "bitter lesson" moment.
|
| Step 3. The Stockfish devs added a neural network trained with RL
| to their chess engine, _in addition to their existing
| heuristics_. And Stockfish easily took back its crown.
|
| Yes sending more compute at a problem is an efficient way to
| solve it, but if all you have is compute, you'll pretty certainly
| lose to somebody who has both compute and knowledge.
| symbolicAGI wrote:
| The Stockfish chess engine example nails it.
|
| For AI researchers, the Bitter Lesson is not to rely on
| supervised learning, not to rely on manual data labeling, nor
| on manual ontologies nor manual business rules,
|
| Nor on *manually coded* AI systems, except as the bootstrap
| code.
|
| Unsupervised methods prevail, even if compute expensive.
|
| The challenge from Sutton's Bitter Lesson for AI researchers is
| to develop sufficient unsupervised methods for learning and AI
| self-improvement.
| grillitoazul wrote:
| Perhaps the Bitter Lesson, more data and compute triumph over
| intelligent algorithms, is true only in a certain range, and a
| new intelligent algorithm is needed to go beyond. For example,
| nearest neighbors algorithm obey the Bitter Lesson depending of
| the relation between grid of data points and complexity of the
| problem.
| TheDudeMan wrote:
| The Bitter Lesson is saying, if you're going to use human
| knowledge, be aware that your solution is temporary. It's not
| wrong. And it's not wrong to use human knowledge to solve your
| "today" problem.
| grillitoazul wrote:
| it should be noted that "The Bitter Lesson" is not a general
| principle, for example LLMs are not able to sum 1000 digits
| numbers, but a python program with a few lines can do it. Also we
| provide tools (algorithms) for LLMs, these examples show that
| "The Bitter Lesson" is not true beyond a certain context.
| rhet0rica wrote:
| From TFA:
|
| > No machine learning model was ever built using pure "human
| knowledge" -- because then it wouldn't be a learning model. It
| would be a hard coded algorithm.
|
| I guess the author hasn't heard of expert systems? Systems like
| MYCIN (https://en.wikipedia.org/wiki/Mycin) were heralded as
| incredible leaps forward at the time, and they indeed consisted
| of pure "human knowledge."
|
| I am disturbed whenever a thinkpiece is written by someone who
| obviously didn't do their research.
| suddenlybananas wrote:
| Expert systems aren't a machine learning approach even if
| they're an AI approach.
___________________________________________________________________
(page generated 2025-07-20 23:01 UTC)