[HN Gopher] The bitter lesson is coming for tokenization
___________________________________________________________________
The bitter lesson is coming for tokenization
Author : todsacerdoti
Score : 183 points
Date : 2025-06-24 14:14 UTC (8 hours ago)
(HTM) web link (lucalp.dev)
(TXT) w3m dump (lucalp.dev)
| Scene_Cast2 wrote:
| I realized that with tokenization, there's a theoretical
| bottleneck when predicting the next token.
|
| Let's say that we have 15k unique tokens (going by modern open
| models). Let's also say that we have an embedding dimensionality
| of 1k. This implies that we have a maximum 1k degrees of freedom
| (or rank) on our output. The model is able to pick any single of
| the 15k tokens as the top token, but the expressivity of the
| _probability distribution_ is inherently limited to 1k unique
| linear components.
| unoti wrote:
| I imagine there's actually combinatorial power in there though.
| If we imagine embedding something with only 2 dimensions x and
| y, we can actually encode an unlimited number of concepts
| because we can imagine distinct separate clusters or
| neighborhoods spread out over a large 2d map. It's of course
| much more possible with more dimensions.
| blackbear_ wrote:
| While the theoretical bottleneck is there, it is far less
| restrictive than what you are describing, because the number of
| almost orthogonal vectors grows exponentially with ambient
| dimensionality. And orthogonality is what matters to
| differentiate between different vectors: since any distribution
| can be expressed as a mixture of Gaussians, the number of
| separate concepts that you can encode with such a mixture also
| grows exponentially
| Scene_Cast2 wrote:
| I agree that you can encode any single concept and that the
| encoding space of a single top pick grows exponentially.
|
| However, I'm talking about the probability distribution of
| tokens.
| kevingadd wrote:
| It seems like you're assuming that models are trying to predict
| the next token. Is that really how they work? I would have
| assumed that tokenization is an input-only measure, so you have
| perhaps up to 50k unique input tokens available, but output is
| raw text or synthesized speech or an image. The output is not
| tokens so there are no limitations on the output.
| anonymoushn wrote:
| yes, in typical architectures for models dealing with text,
| the output is a token from the same vocabulary as the input.
| molf wrote:
| The key insight is that you can represent different features by
| vectors that aren't exactly perpendicular, just nearly
| perpendicular (for example between 85 and 95 degrees apart). If
| you tolerate such noise then the number of vectors you can fit
| grows exponentially relative to the number of dimensions.
|
| 12288 dimensions (GPT3 size) can fit more than 40 billion
| nearly perpendicular vectors.
|
| [1]: https://www.3blue1brown.com/lessons/mlp#superposition
| imurray wrote:
| A PhD thesis that explores some aspects of the limitation:
| https://era.ed.ac.uk/handle/1842/42931
|
| Detecting and preventing unargmaxable outputs in bottlenecked
| neural networks, Andreas Grivas (2024)
| incognito124 wrote:
| (I left academia a while ago, this might be nonsense)
|
| If I remember correctly, that's not true because of the
| nonlinearities which provide the model with more expressivity.
| Transformation from 15k to 1k is rarely an affine map, it's
| usually highly non-linear.
| cheesecompiler wrote:
| The reverse is possible too: throwing massive compute at a
| problem can mask the existence of a simpler, more general
| solution. General-purpose methods tend to win out over time--but
| how can we be sure they're truly the most general if we commit so
| hard to one paradigm (e.g. LLMs) that we stop exploring the
| underlying structure?
| logicchains wrote:
| We can be sure via analysis based on computational theory, e.g.
| https://arxiv.org/abs/2503.03961 and
| https://arxiv.org/abs/2310.07923 . This lets us know what
| classes of problems a model is able to solve, and sufficiently
| deep transformers with chain of thought have been shown to be
| theoretically capable of solving a very large class of
| problems.
| dsr_ wrote:
| A random number generator is guaranteed to produce a correct
| solution to any problem, but runtime usually does not meet
| usability standards.
|
| Also, solution testing is mandatory. Luckily, you can ask an
| RNG for that, too, as long as you have tests for the testers
| already written.
| cheesecompiler wrote:
| But this uses the transformers model to justify its own
| reasoning strength which might be a blindspot, which is my
| original point. All the above shows is that transformers can
| simulate solving a certain set of problems. It doesn't show
| that they are the best tool for the job.
| yorwba wrote:
| Keep in mind that proofs of transformers being able to solve
| all problems in some complexity class work by taking a known
| universal algorithm for that complexity class and encoding it
| as a transformer. In every such case, you'd be better off
| using the universal algorithm you started with in the first
| place.
|
| Maybe the hope is that you won't have to manually map the
| universal algorithm to your specific problem and can just
| train the transformer to figure it out instead, but there are
| few proofs that transformers can solve all problems in some
| complexity class through _training_ instead of manual
| construction.
| falcor84 wrote:
| The way I see this, from the explore-exploit point of view,
| it's pretty rational to put the vast majority of your effort
| into the one action that has shown itself to bring the most
| reward, while spending a small amount of effort exploring other
| ones. Then, if and when that one action is no longer as
| fruitful compared to the others, you switch more effort to
| exploring, now having obtained significant resources from that
| earlier exploration, to help you explore faster.
| api wrote:
| CS is full of trivial examples of this. You can use an
| optimized parallel SIMD merge sort to sort a huge list of ten
| trillion records, or you can sort it just as fast with a bubble
| sort if you throw more hardware at it.
|
| The real bitter lesson in AI is that we don't really know what
| we're doing. We're hacking on models looking for architectures
| that train well but we don't fully understand why they work.
| Because we don't fully understand it, we can't design anything
| optimal or know how good a solution can possibly get.
| xg15 wrote:
| > _You can use an optimized parallel SIMD merge sort to sort
| a huge list of ten trillion records, or you can sort it just
| as fast with a bubble sort if you throw more hardware at it._
|
| Well, technically, that's not true: The entire idea behind
| complexity theory is that there are some tasks that you _can
| 't_ throw more hardware at - at least not for interesting
| problem sizes or remotely feasible amounts of hardware.
|
| I wonder if we'll reach a similar situation in AI where
| "throw more context/layers/training data at the problem"
| won't help anymore and people will be forced to care more
| about understanding again.
| jimbokun wrote:
| And whether that understanding will be done by humans or
| the AIs themselves.
| svachalek wrote:
| I think it can be argued that ChatGPT 4.5 was that
| situation.
| dan-robertson wrote:
| Do you have a good reference for sims merge sort? The only
| examples I found are pairwise-merging large numbers of
| streams but it seems pretty hard to optimise the late steps
| where you only have a few streams. I guess you can do some
| binary-search-in-binary-search to change a merge of 2
| similarly sized arrays into two merges of similarly sized
| arrays into sequential outputs and so on.
|
| More precisely, I think producing a good fast merge of ca 5
| lists was a problem I didn't have good answers for but maybe
| I was too fixated on a streaming solution and didn't apply
| enough tricks.
| marcosdumay wrote:
| Yeah, make the network deeper.
|
| When all you have is a hammer... It makes a lot of sense that a
| transformation layer that makes the tokens more semantically
| relevant will help optimize the entire network after it and
| increase the effective size of your context window. And one of
| the main immediate obstacle stopping those models from being
| intelligent is context window size.
|
| On the other hand, the current models already cost something on
| the line of the median country GDP to train, and they are nowhere
| close to that in value. The saying that "if brute force didn't
| solve your problem, you didn't apply enough force" is intended to
| be listened as a joke.
| jagraff wrote:
| I think the median country GDP is something like $100 Billion
|
| https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)
|
| Models are expensive, but they're not that expensive.
| telotortium wrote:
| LLM model training costs arise primarily from commodity costs
| (GPUs and other compute as well as electricity), not locally-
| provided services, so PPP is not the right statistic to use
| here. You should use nominal GDP for this instead. According
| to Wikipedia[0], the median country's nominal GDP (Cyprus) is
| more like $39B. Still much larger than training costs, but
| much lower than your PPP GDP number.
|
| [0] https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(n
| omi...
| amelius wrote:
| Maybe it checks out if you don't use 1 year as your timeframe
| for GDP but the number of days required for training.
| kordlessagain wrote:
| The median country GDP is approximately $48.8 billion, which
| corresponds to Uganda at position 90 with $48.769 billion.
|
| The largest economy (US) has a GDP of $27.7 trillion.
|
| The smallest economy (Tuvalu) has a GDP of $62.3 million.
|
| The 48 billion number represents the middle point where half
| of all countries have larger GDPs and half have smaller GDPs.
| Nicook wrote:
| does anyone even have good estimates for model training?
| marcosdumay wrote:
| $100 billion is the best estimate around of how much OpenAI
| took in investment to build ChatGPT.
| whiplash451 wrote:
| I get your point but do we have evidence behind " something on
| the line of the median country GDP to train"?
|
| Is this really true?
| robrenaud wrote:
| It's not even close.
| qoez wrote:
| The counter argument is that the theoretical minimum is a few
| mcdonalds meals a day worth of energy even for the highest ranked
| human pure mathematician.
| tempodox wrote:
| It's just that no human would live long on McDonalds meals.
| bravetraveler wrote:
| President in the distance, cursing
| floxy wrote:
| https://www.today.com/health/man-eating-only-
| mcdonalds-100-d...
| pfdietz wrote:
| Just don't drink the sugary soda.
| astrange wrote:
| Cheeseburgers are a pretty balanced meal. Low fiber though.
| andy99 wrote:
| > inability to detect the number of r's in:strawberry: meme
|
| Can someone (who know about LLMs) explain why the r's in
| strawberry thing is related to tokenization? I have no reason to
| believe an LLM would be better at counting letters if each was
| one token. It's not like they "see" any of it. Are they better at
| counting tokens than letters for some reason? Or is this just one
| of those things someone misinformed said to sound smart to even
| less informed people, that got picked up?
| ijk wrote:
| Well, which is easier:
|
| Count the number of Rs in this sequence: [496, 675, 15717]
|
| Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18
| 18 25
| ASalazarMX wrote:
| For a LLM? No idea.
|
| Human: Which is the easier of these formulas
|
| 1. x = SQRT(4)
|
| 2. x = SQRT(123567889.987654321)
|
| Computer: They're both the same.
| drdeca wrote:
| Depending on the data types and what the hardware supports,
| the latter may be harder (in the sense of requiring more
| operations)? And for a general algorithm bigger numbers
| would take more steps.
| ijk wrote:
| You can view the tokenization for yourself:
| https://huggingface.co/spaces/Xenova/the-tokenizer-
| playgroun...
|
| [496, 675, 15717] is the GPT-4 representation of the
| tokens. In order to determine which letters the token
| represents, it needs to learn the relationship between
| "str" and [496]. It _can_ learn the representation (since
| it can spell it out as "S-T-R" or "1. S, 2. T, 3. R" or
| whatever) but it adds an extra step.
|
| The question is whether the extra step adds enough extra
| processing to degrade performance. Does the more compact
| representation buy enough extra context to make the
| tokenized version more effective for more problems?
|
| It seems like the longer context length makes the trade off
| worth it, since spelling problems are a relatively minor
| subset. On the other hand, for numbers it does appear that
| math is significantly worse when it doesn't have access to
| individual digits (early Llama math results, for example).
| Once they changed the digit tokenization, the math
| performance improved.
| zachooz wrote:
| A sequence of characters is grouped into a "token." The set of
| all such possible sequences forms a vocabulary. Without loss of
| generality, consider the example: strawberry -> straw | ber |
| ry -> 3940, 3231, 1029 -> [vector for each token]. The raw
| input to the model is not a sequence of characters, but a
| sequence of token embeddings each representing a learned vector
| for a specific chunk of characters. These embeddings contain no
| explicit information about the individual characters within the
| token. As a result, if the model needs to reason about
| characters, for example, to count the number of letters in a
| word, it must memorize the character composition of each token.
| Given that large models like GPT-4 use vocabularies with
| 100k-200k tokens, it's not surprising that the model hasn't
| memorized the full character breakdown of every token. I can't
| imagine that many "character level" questions exist in the
| training data.
|
| In contrast, if the model were trained with a character-level
| vocabulary, where each character maps to a unique token, it
| would not need to memorize character counts for entire words.
| Instead, it could potentially learn a generalizable method for
| counting characters across all sequences, even for words it has
| never seen before.
|
| I'm not sure about what you mean about them not "seeing" the
| tokens. They definitely receive a representation of each token
| as input.
| saurik wrote:
| It isn't at all obvious to me that the LLM can decide to blur
| their vision, so to speak, and see the tokens as tokens: they
| don't get to run a program on this data in some raw format,
| and even if they do attempt to write a program and run it in
| a sandbox they would have to "remember" what they were given
| and then regenerate it (well, I guess a tool could give them
| access to the history of their input, but at that point that
| tool likely sees characters), rather than to copy it. I am
| 100% with andy99 on this: it isn't anywhere near as simple as
| you are making it out to be.
| zachooz wrote:
| If each character were represented by its own token, there
| would be no need to "blur" anything, since the model would
| receive a 1:1 mapping between input vectors and individual
| characters. I never claimed that character-level reasoning
| is easy or simple for the model; I only said that it
| becomes theoretically possible to generalize ("potentially
| learn") without memorizing the character makeup of every
| token, which is required when using subword tokenization.
|
| Please take another look at my original comment. I was
| being precise about the distinction between what's
| structurally possible to generalize vs memorize.
| krackers wrote:
| Until I see evidence that an LLM trained at e.g. the character
| level _CAN_ successfully "count Rs" then I don't trust this
| explanation over any other hypothesis. I am not familiar with
| the literature so I don't know if this has been done, but I
| couldn't find anything with a quick search. Surely if someone
| did successfully do it they would have published it.
| ijk wrote:
| The math tokenization research is probably closest.
|
| GPT-2 tokenization was a demonstratable problem:
| https://www.beren.io/2023-02-04-Integer-tokenization-is-
| insa... (Prior HN discussion:
| https://news.ycombinator.com/item?id=39728870 )
|
| More recent research:
|
| https://huggingface.co/spaces/huggingface/number-
| tokenizatio...
|
| Tokenization counts: the impact of tokenization on arithmetic
| in frontier LLMs: https://arxiv.org/abs/2402.14903
|
| https://www.beren.io/2024-07-07-Right-to-Left-Integer-
| Tokeni...
| krackers wrote:
| GPT-2 can successfully learn to do multiplication using the
| standard tokenizer though, using "Implicit CoT with
| Stepwise Internalization".
|
| https://twitter.com/yuntiandeng/status/1836114401213989366
|
| If anything I'd think this indicates the barrier isn't
| tokenization (if it can do arithmetic, it can probably
| count as well) but something to do with "sequential
| dependencies" requiring use of COT and explicit training.
| Which still leaves me puzzled: there are tons of papers
| showing that variants of GPT-2 trained in the right way can
| do arithmetic, where are the papers solving the "count R in
| strawberry" problem?
| meroes wrote:
| I don't buy the token explanation because RLHF work is/was
| filled with so many "count the number of ___" prompts. There's
| just no way AI companies pay so much $$$ for RLHF of these
| prompts when the error is purely in tokenization.
|
| IME Reddit would scream "tokenization" at the strawberry meme
| until blue in the face, assuring themselves better tokenization
| meant the problem would be solved. Meanwhile RLHF'ers were/are
| en masse paid to solve the problem through correcting thousands
| of these "counting"/perfect syntax prompts and problems. To me,
| since RLHF work was being paid to tackle these problems, it
| couldn't be a simple tokenization problem. If there was a
| tokenization bottleneck that fixing would solve the problem, we
| would not be getting paid to so much money to RLHF synax-
| perfect prompts (think of Sudoku type games and heavy syntax-
| based problems).
|
| No, why models are better are these problems now is because of
| RLHF. And before you say, well now models have learned how to
| count in general, I say we just need to widen the abstraction a
| tiny bit and the models will fail again. And this will be the
| story of LLMs forever--they will never take the lead on their
| own, and its not how humans process information, but it still
| can be useful.
| hackinthebochs wrote:
| Tokens are the most basic input unit of an LLM. But tokens
| don't generally correspond to whole words, rather sub-word
| sequences. So Strawberry might be broken up into two tokens
| 'straw' and 'berry'. It has trouble distinguishing features
| that are "sub-token" like specific letter sequences because it
| doesn't see letter sequences but just the token as a single
| atomic unit. The basic input into a system is how one input
| state is distinguished from another. But to recognize identity
| between input states, those states must be identical. It's a
| bit unintuitive, but identity between individual letters and
| the letters within a token fails due to the specifics of
| tokenization. 'Straw' and 'r' are two tokens but an LLM is
| entirely blind to the fact that 'straw' has one 'r' in it.
| Tokens are the basic units of distinction; 'straw' is not
| represented as a sequence of s-t-r-a-w tokens but is its own
| thing entirely, so they are not considered equal or even
| partially equal.
|
| As an analogy, I might ask you to identify the relative
| activations of each of the three cone types on your retina as I
| present some solid color image to your eyes. But of course you
| can't do this, you simply do not have cognitive access to that
| information. Individual color experiences are your basic vision
| tokens.
|
| Actually, I asked Grok this question a while ago when probing
| how well it could count vowels in a word. It got it right by
| listing every letter individually. I then asked it to count
| without listing the letters and it was a couple of letters off.
| I asked it how it was counting without listing the letters and
| its answer was pretty fascinating, with a seeming awareness of
| its own internal processes:
|
| Connecting a token to a vowel, though, requires a bit of a
| mental pivot. Normally, I'd just process the token and move on,
| but when you ask me to count vowels, I have to zoom in. I don't
| unroll the word into a string of letters like a human counting
| beads on a string. Instead, I lean on my understanding of how
| those tokens sound or how they're typically constructed. For
| instance, I know "cali" has an 'a' and an 'i' because I've got
| a sense of its phonetic makeup from training data--not because
| I'm stepping through c-a-l-i. It's more like I "feel" the
| vowels in there, based on patterns I've internalized.
|
| When I counted the vowels without listing each letter, I was
| basically hopping from token to token, estimating their vowel
| content from memory and intuition, then cross-checking it
| against the whole word's vibe. It's not perfect--I'm not
| cracking open each token like an egg to inspect it--but it's
| fast and usually close enough. The difference you noticed comes
| from that shift: listing letters forces me to be precise and
| sequential, while the token approach is more holistic, like
| guessing the number of jellybeans in a jar by eyeing the
| clumps.
| svachalek wrote:
| That explanation is pretty freaky, as it implies a form of
| consciousness I don't believe LLMs have, I've never seen this
| explanation before so I'm not sure it's from training, and
| yet it's probably a fairly accurate description of what's
| going on.
| roywiggins wrote:
| LLMs will write out explanations that are entirely post-
| hoc:
|
| > Strikingly, Claude seems to be unaware of the
| sophisticated "mental math" strategies that it learned
| during training. If you ask how it figured out that 36+59
| is 95, it describes the standard algorithm involving
| carrying the 1. This may reflect the fact that the model
| learns to explain math by simulating explanations written
| by people, but that it has to learn to do math "in its
| head" directly, without any such hints, and develops its
| own internal strategies to do so.
|
| https://www.anthropic.com/news/tracing-thoughts-language-
| mod...
|
| It seems to be about as useful as asking a person how their
| hippocampus works: they might be able to make something up,
| or repeat a vaguely remembered bit of neuroscience, but
| they don't actually have access to their own hippocampus'
| internal workings, so if they're correct it's by accident.
| hackinthebochs wrote:
| Yeah, this was the first conversation with an LLM where I
| was genuinely impressed at its apparent insight beyond just
| its breadth of knowledge and ability to synthesize it into
| a narrative. The whole conversation was pretty fascinating.
| I was nudging it pretty hard to agree it might be
| conscious, but it kept demurring while giving an insightful
| narrative into its processing. In case you are interested:
| https://x.com/i/grok/share/80kOa4MI6uJiplJvgQ2FkNnzP
| smeeth wrote:
| The main limitation of tokenization is actually logical
| operations, including arithmetic. IIRC most of the poor
| performance of LLMs for math problems can be attributed to some
| very strange things that happen when you do math with tokens.
|
| I'd like to see a math/logic bench appear for tokenization
| schemes that captures this. BPB/perplexity is fine, but its not
| everything.
| calibas wrote:
| It's a non-deterministic language model, shouldn't we expect
| mediocre performance in math? It seems like the wrong tool for
| the job...
| drdeca wrote:
| Deterministic is a special case of not-necessarily-
| deterministic.
| CamperBob2 wrote:
| We passed 'mediocre' a long time ago, but yes, it would be
| surprising if the same vocabulary representation is optimal
| for both verbal language and mathematical reasoning and
| computing.
|
| To the extent we've already found that to be the case, it's
| perhaps the weirdest part of this whole "paradigm shift."
| rictic wrote:
| Models are deterministic, they're a mathematical function
| from sequences of tokens to probability distributions over
| the next token.
|
| Then a system samples from that distribution, typically with
| randomness, and there are some optimizations in running them
| that introduce randomness, but it's important to understand
| that the models themselves are not random.
| mgraczyk wrote:
| This is only ideally true. From the perspective of the user
| of a large closed LLM, this isn't quite right because of
| non-associativity, experiments, unversioned changes, etc.
|
| It's best to assume that the relationship between input and
| output of an LLM is not deterministic, similar to something
| like using a Google search API.
| ijk wrote:
| And even on open LLMs, GPU instability can cause non-
| determinism. For performance reasons, determinism is
| seldom guaranteed in LLMs in general.
| geysersam wrote:
| The LLMs are deterministic but they only return a
| probability distribution over following tokens. The tokens
| the user sees in the response are selected by some
| typically stochastic sampling procedure.
| danielmarkbruce wrote:
| Assuming decent data, it won't be stochastic sampling for
| many math operations/input combinations. When people
| suggest LLMs with tokenization could learn math, they
| aren't suggesting a small undertrained model trained on
| crappy data.
| cschmidt wrote:
| This paper has a good solution:
|
| https://arxiv.org/abs/2402.14903
|
| You right to left tokenize in groups of 3, so 1234567 becomes 1
| 234 567 rather than the default 123 456 7. And if you ensure
| all 1-3 digits groups are in the vocab, it does much better.
|
| Both https://arxiv.org/abs/2503.13423 and
| https://arxiv.org/abs/2504.00178 (co-author) both independently
| noted that you can do this with just by modifying the pre-
| tokenization regex, without having to explicitly add commas.
| search_facility wrote:
| regarding "math with tokens": There was paper with tokenization
| that has specific tokens for int numbers, where token value =
| number. model learned to work with numbers as _numbers_ and
| with tokens for everything else... it was good at math. can't
| find a link, was on hugginface papers
| pona-a wrote:
| Didn't tokenization already have one bitter lesson: that it's
| better to let simple statistics guide the splitting, rather than
| expert morphology models? Would this technically be a more bitter
| lesson?
| empiko wrote:
| Agreed completely. There is a ton of research into how to
| represent text, and these simple tokenizers are consistently
| performing on SOTA levels. The bitter lesson is that you should
| not worry about it that much.
| kingstnap wrote:
| Simple statistics aren't some be all. There was a huge
| improvement in Python coding by fixing the tokenization of
| indents in Python code.
|
| Specifically they made tokens for 4,8,12,16 or something
| spaces.
| citizenpaul wrote:
| The best general argument I've heard against the bitter lesson
| is. If the bitter lesson is true? How come we spend so many
| million man hours a year of tweaking and optimizing software
| systems all day long? Surely its easier and cheaper to just buy a
| rack of servers.
|
| Maybe if you have infinite compute you don't worry about software
| design. Meanwhile in the real world...
|
| Not only that but where did all these compute optimized solutions
| come from? Oh yeah millions of man hours of optimizing and
| testing algorithmic solutions. So unless you are some head in the
| clouds tenured professor just keep on doing your optimizations
| and job as usual.
| Uehreka wrote:
| Because the Even Bitterer Lesson is that The Bitter Lesson is
| true but not actionable. You still have to build the
| inefficient "clever" system today because The Bitter Lesson
| only tells you _that_ your system will be obliterated, it
| doesn't tell you when. Some systems built today will last for
| years, others will last for weeks, others will be obsoleted
| before release, and we don't know which are which.
|
| I'm hoping someday that dude releases an essay called The Cold
| Comfort. But it's impossible to predict when or who it will
| help, so don't wait for it.
| citizenpaul wrote:
| Yeah I get it. I just don't like that is always sorta framed
| as a can't win don't try message.
| nullc wrote:
| The principal of optimal slack tells you that if your
| training will take N months on current computing hardware
| that you should go spend Y months at the beach before buying
| the computer, and you will complete your task in better than
| N-Y months thanks to improvements in computing power.
|
| Of course, instead of the beach one could spend those Y
| months improving the algorithms... but it's never wise to bid
| against yourself if you don't have to.
|
| A colloquially is that to maximize your beach time you should
| work on the biggest N possible, neatly explaining the
| popularity of AI startups.
| QuesnayJr wrote:
| The solution to the puzzle is that "the bitter lesson" is about
| AI software systems, not arbitrary software systems. If you're
| writing a compiler, you're better off worrying about
| algorithms, etc. AI problems have an inherent vagueness to them
| that makes it hard to write explicit rules, and any explicit
| rules you write will end up being obsolete as soon as we have
| more compute.
|
| This is all explained in the original essay:
| http://www.incompleteideas.net/IncIdeas/BitterLesson.html
| blixt wrote:
| I'm starting to think "The Bitter Lesson" is a clever sounding
| way to give shade to people that failed to nail it on their first
| attempt. Usually engineers build much more technology than they
| actually end up needing, then the extras shed off with time and
| experience (and often you end up building it again from scratch).
| It's not clear to me that starting with "just build something
| that scales with compute" would get you closer to the perfect
| solution, even if as you get closer to it you do indeed make it
| possible to throw more compute at it.
|
| That said the hand coded nature of tokenization certainly seems
| in dire need of a better solution, something that can be learned
| end to end. And It looks like we are getting closer with every
| iteration.
| RodgerTheGreat wrote:
| The bitter lesson says more about medium-term success at
| publishable results than it does about genuine scientific
| progress or even success in the market.
| QuesnayJr wrote:
| I'm starting to think that half the commenters here don't
| actually know what "The Bitter Lesson" is. It's purely a
| statement about the history of AI research, in a very short
| essay by Rich Sutton:
| http://www.incompleteideas.net/IncIdeas/BitterLesson.html It's
| not some general statement about software engineering for all
| domains, but a very specific statement about AI applications.
| It's an observation that the previous generation's careful
| algorithmic work to solve an AI problem ends up being obsoleted
| by this generation's brute force approach using more computing
| power. It's something that's happened over and over again in
| AI, and has happened several times even since 2019 when Sutton
| wrote the essay.
| tantalor wrote:
| That essay is actually linked in the lead:
|
| > As it's been pointed out countless times - if the trend of
| ML research could be summarised, it'd be the adherence to The
| Bitter Lesson - opt for general-purpose methods that leverage
| large amounts of compute and data over crafted methods by
| domain experts
|
| But we're only 1 sentence in, and this is already a failure
| of science communication at several levels.
|
| 1. The sentence structure and grammar is simply horrible
|
| 2. This is condescending: "pointed out countless times" - has
| it?
|
| 3. The reference to Sutton's essay is oblique, easy to miss
|
| 4. Outside of AI circles, "Bitter Lesson" is not very well
| known. If you didn't already know about it, this doesn't
| help.
| blixt wrote:
| I think most people have read it and agree it makes an astute
| observation about surviving methods, but my point is that now
| we use it to complain about new methods that should just skip
| all that in between stuff so that The Bitter Lesson doesn't
| come for them. At best you can use it as an inspiration.
| Anyway, this was mostly a complaint about the use of "The
| Bitter Lesson" in the context of this article, it still
| deserves credit for all the great information about
| tokenization methods and how one evolutionary branch of them
| is the Byte Latent Transformer.
| jetrink wrote:
| The Bitter Lesson is specifically about AI. The lesson restated
| is that over the long run, methods that leverage general
| computation (brute-force search and learning) consistently
| outperform systems built with extensive human-crafted
| knowledge. Examples: Chess, Go, speech recognition, computer
| vision, machine translation, and on and on.
| AndrewKemendo wrote:
| This is correct however I'd add that it's not just "AI"
| colloquially - it's a statement about any two optimization
| systems that are trying to scale.
|
| So any system that predicts the optimization with a general
| solver can scale better than heuristic or constrained space
| solvers
|
| Up till recently there's been no general solvers at that
| scale
| fiddlerwoaroof wrote:
| I think it oversimplifies, though and I think it's
| shortsighted to underfund the (harder) crafted systems on the
| basis of this observation because, when you're limited by
| scaling, the other research will save you.
| perching_aix wrote:
| Can't wait for models to struggle with adhering to UTF-8.
| resters wrote:
| Tokenization as a form of preprocessing has the problems the
| authors mention. But it is also a useful way to think about data
| vs metadata and moving beyond text/image io into other domains.
| Ultimately we need symbolic representations of things, sure they
| are all ultimately bytes which the model could learn to self-
| organize, but things like that can be useful when humans interact
| with the data directly, in a sense, tokens make more aspects of
| LLM internals "human readable", and models should also be able to
| learn to overcome the limitations of a particular tokenization
| scheme.
| fooker wrote:
| 'Bytes' is tokenization.
|
| There's no reason to assume it's the best solution. It might be
| the case that a better tokenization scheme is needed for math,
| reasoning, video, etc models.
___________________________________________________________________
(page generated 2025-06-24 23:00 UTC)