hngopher.com

       [HN Gopher] "The Bitter Lesson" is wrong. Well sort of
       ___________________________________________________________________
        
       "The Bitter Lesson" is wrong. Well sort of
        
       Author : GavCo
       Score  : 36 points
       Date   : 2025-07-20 16:33 UTC (6 hours ago)
        
 (HTM) web link (assaf-pinhasi.medium.com)
 (TXT) w3m dump (assaf-pinhasi.medium.com)
        
       | rhaps0dy wrote:
       | Sutton was talking about progress in AI overall, whereas Pinhasi
       | (OP) is talking about building one model for production right
       | now. Of course adding some hand-coded knowledge is essential for
       | the latter, but it has not provided much long-term progress.
       | (Even CNNs and group-convolutional NNs, which seek to encode
       | invariants to increase efficiency while still doing almost only
       | learning, seem to be on the way out)
        
       | aabhay wrote:
       | The main problem with the "Bitter Lesson" is that there's
       | something even bitter-er behind it -- the "Harsh Reality" that
       | while we may scale models on compute and data, that simply
       | broadly inserting tons of data without any sort of curation
       | yields essentially garbage models.
       | 
       | The "Harsh Reality" is that while you may only need data, the
       | current best models and companies behind them spend enormously on
       | gathering high quality labeled data with extensive oversight and
       | curation. This curation is of course being partially automated as
       | well, but ultimately there's billions or even tens of billions of
       | dollars flowing into gathering, reviewing, and processing
       | subjectively high quality data.
       | 
       | Interestingly, in the time that this paper was published, the
       | harsh reality was not so harsh. For example in things like face
       | detection, (actual) next word prediction, and other purely self
       | supervised and not instruction tuned or "Chat" style models, data
       | was truly all you needed. You didn't need "good" faces. As long
       | as it was indeed a face, the data itself was enough. Now, it's
       | not. In order to make these machines useful and not just function
       | approximators, we need extremely large dataset curation
       | industries.
       | 
       | If you learned the bitter lesson, you better accept the harsh
       | reality, too.
        
         | bobbiechen wrote:
         | So true. I recently wrote about how Merlin achieved magical
         | bird identification not through better algorithms, but better
         | expertise in creating great datasets:
         | https://digitalseams.com/blog/what-birdsong-and-backends-can...
         | 
         | I think "harsh reality" is one way to look at it, but you can
         | also take an optimistic perspective: you really can achieve
         | great, magical experiences by putting in (what could be
         | considered) unreasonable effort.
        
           | mhuffman wrote:
           | Thanks for the intro to Merlin! I just went outside of my
           | house and used it on 5 different types of birds and it helped
           | me identify 100%. Relevent (possibly out of date) xkcd comic
           | 
           | [0]https://xkcd.com/1425/
        
             | Xymist wrote:
             | Relevant - and old enough that those five years have been
             | successfully granted!
        
         | pphysch wrote:
         | Another name for gathering and curating high-quality datasets
         | is "science". One would hope "AI pioneer" USA would embrace
         | this harsh reality and invest massively in basic science
         | education and infrastructure. But we are seeing the opposite,
         | and basically no awareness of this "harsh reality" among the AI
         | hype...
        
         | vineyardmike wrote:
         | While I agree with you, it's worth noting that current LLM
         | training uses a significant percentage of all available written
         | data for training. The transition from GPT-2 era models to now
         | (GPT-3+) saw the transition from novel models that can kinda
         | imitate speech to models that can converse, write code, and use
         | tools. It's only _after_ the readily available data was
         | exhausted, that future gains came curation and large amounts of
         | synthetic data.
        
           | aabhay wrote:
           | Transfer learning isn't about "exhausting" all available un-
           | curated data, its simply that the systems are large enough to
           | support it. There's not that much of a reason to train on all
           | available data. And its not all, there's still a very
           | significant filtration happening. For example they don't
           | train on petabytes of log files, that would just be terribly
           | uninteresting data.
        
           | Calavar wrote:
           | > The transition from GPT-2 era models to now (GPT-3+) saw
           | the transition from novel models that can kinda imitate
           | speech to models that can converse, write code, and use
           | tools.
           | 
           | Which is fundamentally about data. OpenAI invested an absurd
           | amount of money to get the human annotations to drive RHLF.
           | 
           | RHLF itself is a very vanilla reinforcement learning algo +
           | some branding/marketing.
        
         | v9v wrote:
         | I think your comment has some threads in common with Rodney
         | Brooks' response: https://rodneybrooks.com/a-better-lesson/
        
       | macawfish wrote:
       | In my opinion the useful part of "the bitter lesson" has nothing
       | to do with throwing more compute and more data at stuff, it has
       | to do with actually using ML instead of trying to manually and
       | cleverly tweak stuff, and with actually leveraging the data you
       | have effectively as a part of that (again using more ML) rather
       | than trying to manually label everything.
        
       | rdw wrote:
       | The bitter lesson is becoming misunderstood as the world moves
       | on. Unstated yet core to it is that AI researchers were
       | historically attempting to build an understanding of human
       | intelligence. They intended to, piece-by-piece, assemble a human
       | brain and thus be able to explain (and fix) our own biological
       | ones. Much like can be done with physical simulations of knee
       | joints. Of course, you can also use that knowledge to create
       | useful thinking machines, because you understand it well enough
       | to be able to control it. Much like how we have many robotic
       | joints.
       | 
       | So, the bitter lesson is based on a disappointment that you're
       | building intelligence without understanding why it works.
        
         | DoctorOetker wrote:
         | Right, like discovering Huygens principle, or interference,
         | integrals/sums of all paths in physics.
         | 
         | It is not because a whole lot of physical phenomena can be
         | explained by a couple of foundational principles, that
         | understanding those core patterns automatically endows one with
         | an understanding of how and why materials refract light and a
         | plethora of other specific effects... effects worth
         | understanding individually, even if still explained in terms of
         | those foundational concepts.
         | 
         | Knowing a complicated set of axioms or postulates endows one to
         | derive theorems from them, but those implied theorem proofs are
         | nonetheless non-trivial, and have a value of their own (even
         | though they can be expressed and expanded into a DAG of
         | applications of those "bitterly minimal" axiomatization.
         | 
         | Once enough patterns are correctly modeled by machines, and
         | given enough time to analyze it, people will eventually
         | discover a better how and why things work (beyond the mere
         | abstract, knowledge that latent parameters were fitted against
         | a loss function).
         | 
         | In some sense deeper understanding has already come for the
         | simpler models like word2vec, where many papers have analyzed
         | and explained relations between word vectors. This too lagged
         | behind the creation and utilization of word vector embeddings.
         | 
         | It is not inconceivable that someday someone observes an
         | analogy between say QKV tensors and triples resulting from
         | graph linearization: think subject, object, predicate; (even
         | though I hate those triples, try modeling a ternary relation
         | like 2+5=7 with SOP-triples, its really only meant to capture
         | "sky - is - blue" associations. A better type of triple would
         | be player-role-act triples, one can then model ternary
         | relations, but one needs to reify the relation)
         | 
         | Similarly, without mathematical training, humans display
         | awareness of the concepts of sets, membership, existence, ...
         | without a formal system. The chatbots display this awareness.
         | It's all vague naive set theory. But _how_ are DNN 's modeling
         | set theory? Thats a paper someday.
        
         | ta8645 wrote:
         | > you're building intelligence without understanding why it
         | works.
         | 
         | But if we do a good enough job of that, it should then be able
         | to explain to us why it works (after it does some
         | research/science on itself). Yes?
        
           | samrus wrote:
           | Bit fantastical. We are a general intelligence and we dont
           | understand ourselves
        
       | godelski wrote:
       | I'm not sure if the Bitter Lesson is wrong, I think we'd need
       | clarification from Sutton (does someone have this?)
       | 
       | But I do know "Scale is All You Need" is wrong. And VERY wrong.
       | 
       | Scaling has done a lot. Without a doubt it is very useful. But
       | this is a drastic oversimplification of all the work that has
       | happened over the last 10-20 years. ConvNext and "ResNets Strike
       | Back" didn't take off for reasons, despite being very impressive.
       | There's been a lot of algorithmic changes, a lot of changes to
       | training procedures, a lot of changes to _how_ we collect
       | data[0], and more.
       | 
       | We have to be very honest, you can't just buy your way to AGI.
       | There's still innovation that needs be done. This is great for
       | anyone still looking to get into the space. The game isn't close
       | to being over. I'd argue that this is great for investors too, as
       | there are a lot of techniques looking to try themselves at scale.
       | Your unicorns are going to be over here. A dark horse isn't a
       | horse that just looks like every other horse. Might be a "safer"
       | bet, but that's like betting on amateur jockies and horses that
       | just train similar to professional ones. They have to do a lot of
       | catch-up, even if the results are fairly certain. At that point
       | you're not investing in the tech, you're investing in the person
       | or the market strategy.
       | 
       | [0] Okay, I'll buy this one as scale if we really want to argue
       | that these changes are about scaling data effectively but we also
       | look at smaller datasets differently because of these lessons.
        
       | roadside_picnic wrote:
       | "The Bitter Lesson" certainly seems correct when applied to
       | whatever the limit of the current state of the art is, but in
       | practice solving day-to-day ML problems, outside of FAANG-style
       | companies and cutting edge research, data is always much more
       | constrained.
       | 
       | I have, multiple times in my career, solved a problem using
       | simple, intelligible models that have empirically outperformed
       | neural models ultimately because there was not enough data for
       | the neural approach to learn anything. As a community we tend to
       | obsess over architecture and then infrastructure, but data is
       | often the real limiting factor.
       | 
       | When I was early in my career I used to always try to apply very
       | general, data hungry, models to all my problems.. with very mixed
       | success. As I became more skilled I started to be a staunch
       | advocated of only using simple models you could understand, with
       | much more successful results (which is what lead to this revised
       | opinion). But, at this point in my career, I increasingly see
       | that one's approach to modeling should basically be to approach
       | the problem more information theoretically: try to figure out the
       | model with a channel capacity that best matches your information
       | rate.
       | 
       | As a Bayesian, I also think there's a very reasonable explanation
       | for why "The Bitter Lesson" rings true over and over again. In ET
       | Jaynes' writing he often talks about Bayes' Theorem in terms of
       | P(D|H) (i.e. probably of the _D_ ata given the _H_ ypothesis, or
       | vice versa), but, especially in the earlier chapters,
       | purposefully adds an X to that equation: P(D|H,X) where X is a
       | stand in for _all of our prior information about the world_.
       | Typically we think of prior data as being literal data, but
       | Jaynes ' points out that our entire world of understand is also
       | part of our prior context.
       | 
       | In this view, models that "leverage human understanding" (i.e.
       | are fully intelligible) are essentially throwing out information
       | at the limit. But to my earlier point, if the data falls quite
       | short of that limit, then those intelligible models are _adding_
       | information in data constrained scenarios. I think the challenge
       | in practical application is figuring out where the threshold is
       | that you need to adopt a more general approach.
       | 
       | Currently I'm very much in love with Gaussian Processes that, for
       | constrained data environments, offer a powerful combination of
       | both of these methods. You can give the model prior hints at what
       | things should look like in terms of the relative structure of the
       | kernel and it's priors (e.g. there should be some roughly annual
       | seasonal component, and one roughly weekly seasonal component)
       | but otherwise let the data decide.
        
       | littlestymaar wrote:
       | The Leela Chess Zero vs Stockfish case also offers an interesting
       | perspective on the bitter lesson.
       | 
       | Here's my (maybe a bit loose) recollection of what happened:
       | 
       | Step 1- Stockfish was the typical human-knowledge AI, with tons
       | of actual chess knowledge injected in the process of building an
       | efficient chess engine.
       | 
       | Step 2. Then came Leela Chess Zero, with its Alpha Zero-inspired
       | training, a chess engine trained fully with RL with no prior
       | chess knowledge added. And it has beaten Stockfish. This is a
       | "bitter lesson" moment.
       | 
       | Step 3. The Stockfish devs added a neural network trained with RL
       | to their chess engine, _in addition to their existing
       | heuristics_. And Stockfish easily took back its crown.
       | 
       | Yes sending more compute at a problem is an efficient way to
       | solve it, but if all you have is compute, you'll pretty certainly
       | lose to somebody who has both compute and knowledge.
        
         | symbolicAGI wrote:
         | The Stockfish chess engine example nails it.
         | 
         | For AI researchers, the Bitter Lesson is not to rely on
         | supervised learning, not to rely on manual data labeling, nor
         | on manual ontologies nor manual business rules,
         | 
         | Nor on *manually coded* AI systems, except as the bootstrap
         | code.
         | 
         | Unsupervised methods prevail, even if compute expensive.
         | 
         | The challenge from Sutton's Bitter Lesson for AI researchers is
         | to develop sufficient unsupervised methods for learning and AI
         | self-improvement.
        
       | grillitoazul wrote:
       | Perhaps the Bitter Lesson, more data and compute triumph over
       | intelligent algorithms, is true only in a certain range, and a
       | new intelligent algorithm is needed to go beyond. For example,
       | nearest neighbors algorithm obey the Bitter Lesson depending of
       | the relation between grid of data points and complexity of the
       | problem.
        
       | TheDudeMan wrote:
       | The Bitter Lesson is saying, if you're going to use human
       | knowledge, be aware that your solution is temporary. It's not
       | wrong. And it's not wrong to use human knowledge to solve your
       | "today" problem.
        
       | grillitoazul wrote:
       | it should be noted that "The Bitter Lesson" is not a general
       | principle, for example LLMs are not able to sum 1000 digits
       | numbers, but a python program with a few lines can do it. Also we
       | provide tools (algorithms) for LLMs, these examples show that
       | "The Bitter Lesson" is not true beyond a certain context.
        
       | rhet0rica wrote:
       | From TFA:
       | 
       | > No machine learning model was ever built using pure "human
       | knowledge" -- because then it wouldn't be a learning model. It
       | would be a hard coded algorithm.
       | 
       | I guess the author hasn't heard of expert systems? Systems like
       | MYCIN (https://en.wikipedia.org/wiki/Mycin) were heralded as
       | incredible leaps forward at the time, and they indeed consisted
       | of pure "human knowledge."
       | 
       | I am disturbed whenever a thinkpiece is written by someone who
       | obviously didn't do their research.
        
         | suddenlybananas wrote:
         | Expert systems aren't a machine learning approach even if
         | they're an AI approach.
        
       ___________________________________________________________________
       (page generated 2025-07-20 23:01 UTC)