[HN Gopher] The bitter lesson is coming for tokenization
       ___________________________________________________________________
        
       The bitter lesson is coming for tokenization
        
       Author : todsacerdoti
       Score  : 183 points
       Date   : 2025-06-24 14:14 UTC (8 hours ago)
        
 (HTM) web link (lucalp.dev)
 (TXT) w3m dump (lucalp.dev)
        
       | Scene_Cast2 wrote:
       | I realized that with tokenization, there's a theoretical
       | bottleneck when predicting the next token.
       | 
       | Let's say that we have 15k unique tokens (going by modern open
       | models). Let's also say that we have an embedding dimensionality
       | of 1k. This implies that we have a maximum 1k degrees of freedom
       | (or rank) on our output. The model is able to pick any single of
       | the 15k tokens as the top token, but the expressivity of the
       | _probability distribution_ is inherently limited to 1k unique
       | linear components.
        
         | unoti wrote:
         | I imagine there's actually combinatorial power in there though.
         | If we imagine embedding something with only 2 dimensions x and
         | y, we can actually encode an unlimited number of concepts
         | because we can imagine distinct separate clusters or
         | neighborhoods spread out over a large 2d map. It's of course
         | much more possible with more dimensions.
        
         | blackbear_ wrote:
         | While the theoretical bottleneck is there, it is far less
         | restrictive than what you are describing, because the number of
         | almost orthogonal vectors grows exponentially with ambient
         | dimensionality. And orthogonality is what matters to
         | differentiate between different vectors: since any distribution
         | can be expressed as a mixture of Gaussians, the number of
         | separate concepts that you can encode with such a mixture also
         | grows exponentially
        
           | Scene_Cast2 wrote:
           | I agree that you can encode any single concept and that the
           | encoding space of a single top pick grows exponentially.
           | 
           | However, I'm talking about the probability distribution of
           | tokens.
        
         | kevingadd wrote:
         | It seems like you're assuming that models are trying to predict
         | the next token. Is that really how they work? I would have
         | assumed that tokenization is an input-only measure, so you have
         | perhaps up to 50k unique input tokens available, but output is
         | raw text or synthesized speech or an image. The output is not
         | tokens so there are no limitations on the output.
        
           | anonymoushn wrote:
           | yes, in typical architectures for models dealing with text,
           | the output is a token from the same vocabulary as the input.
        
         | molf wrote:
         | The key insight is that you can represent different features by
         | vectors that aren't exactly perpendicular, just nearly
         | perpendicular (for example between 85 and 95 degrees apart). If
         | you tolerate such noise then the number of vectors you can fit
         | grows exponentially relative to the number of dimensions.
         | 
         | 12288 dimensions (GPT3 size) can fit more than 40 billion
         | nearly perpendicular vectors.
         | 
         | [1]: https://www.3blue1brown.com/lessons/mlp#superposition
        
         | imurray wrote:
         | A PhD thesis that explores some aspects of the limitation:
         | https://era.ed.ac.uk/handle/1842/42931
         | 
         | Detecting and preventing unargmaxable outputs in bottlenecked
         | neural networks, Andreas Grivas (2024)
        
         | incognito124 wrote:
         | (I left academia a while ago, this might be nonsense)
         | 
         | If I remember correctly, that's not true because of the
         | nonlinearities which provide the model with more expressivity.
         | Transformation from 15k to 1k is rarely an affine map, it's
         | usually highly non-linear.
        
       | cheesecompiler wrote:
       | The reverse is possible too: throwing massive compute at a
       | problem can mask the existence of a simpler, more general
       | solution. General-purpose methods tend to win out over time--but
       | how can we be sure they're truly the most general if we commit so
       | hard to one paradigm (e.g. LLMs) that we stop exploring the
       | underlying structure?
        
         | logicchains wrote:
         | We can be sure via analysis based on computational theory, e.g.
         | https://arxiv.org/abs/2503.03961 and
         | https://arxiv.org/abs/2310.07923 . This lets us know what
         | classes of problems a model is able to solve, and sufficiently
         | deep transformers with chain of thought have been shown to be
         | theoretically capable of solving a very large class of
         | problems.
        
           | dsr_ wrote:
           | A random number generator is guaranteed to produce a correct
           | solution to any problem, but runtime usually does not meet
           | usability standards.
           | 
           | Also, solution testing is mandatory. Luckily, you can ask an
           | RNG for that, too, as long as you have tests for the testers
           | already written.
        
           | cheesecompiler wrote:
           | But this uses the transformers model to justify its own
           | reasoning strength which might be a blindspot, which is my
           | original point. All the above shows is that transformers can
           | simulate solving a certain set of problems. It doesn't show
           | that they are the best tool for the job.
        
           | yorwba wrote:
           | Keep in mind that proofs of transformers being able to solve
           | all problems in some complexity class work by taking a known
           | universal algorithm for that complexity class and encoding it
           | as a transformer. In every such case, you'd be better off
           | using the universal algorithm you started with in the first
           | place.
           | 
           | Maybe the hope is that you won't have to manually map the
           | universal algorithm to your specific problem and can just
           | train the transformer to figure it out instead, but there are
           | few proofs that transformers can solve all problems in some
           | complexity class through _training_ instead of manual
           | construction.
        
         | falcor84 wrote:
         | The way I see this, from the explore-exploit point of view,
         | it's pretty rational to put the vast majority of your effort
         | into the one action that has shown itself to bring the most
         | reward, while spending a small amount of effort exploring other
         | ones. Then, if and when that one action is no longer as
         | fruitful compared to the others, you switch more effort to
         | exploring, now having obtained significant resources from that
         | earlier exploration, to help you explore faster.
        
         | api wrote:
         | CS is full of trivial examples of this. You can use an
         | optimized parallel SIMD merge sort to sort a huge list of ten
         | trillion records, or you can sort it just as fast with a bubble
         | sort if you throw more hardware at it.
         | 
         | The real bitter lesson in AI is that we don't really know what
         | we're doing. We're hacking on models looking for architectures
         | that train well but we don't fully understand why they work.
         | Because we don't fully understand it, we can't design anything
         | optimal or know how good a solution can possibly get.
        
           | xg15 wrote:
           | > _You can use an optimized parallel SIMD merge sort to sort
           | a huge list of ten trillion records, or you can sort it just
           | as fast with a bubble sort if you throw more hardware at it._
           | 
           | Well, technically, that's not true: The entire idea behind
           | complexity theory is that there are some tasks that you _can
           | 't_ throw more hardware at - at least not for interesting
           | problem sizes or remotely feasible amounts of hardware.
           | 
           | I wonder if we'll reach a similar situation in AI where
           | "throw more context/layers/training data at the problem"
           | won't help anymore and people will be forced to care more
           | about understanding again.
        
             | jimbokun wrote:
             | And whether that understanding will be done by humans or
             | the AIs themselves.
        
             | svachalek wrote:
             | I think it can be argued that ChatGPT 4.5 was that
             | situation.
        
           | dan-robertson wrote:
           | Do you have a good reference for sims merge sort? The only
           | examples I found are pairwise-merging large numbers of
           | streams but it seems pretty hard to optimise the late steps
           | where you only have a few streams. I guess you can do some
           | binary-search-in-binary-search to change a merge of 2
           | similarly sized arrays into two merges of similarly sized
           | arrays into sequential outputs and so on.
           | 
           | More precisely, I think producing a good fast merge of ca 5
           | lists was a problem I didn't have good answers for but maybe
           | I was too fixated on a streaming solution and didn't apply
           | enough tricks.
        
       | marcosdumay wrote:
       | Yeah, make the network deeper.
       | 
       | When all you have is a hammer... It makes a lot of sense that a
       | transformation layer that makes the tokens more semantically
       | relevant will help optimize the entire network after it and
       | increase the effective size of your context window. And one of
       | the main immediate obstacle stopping those models from being
       | intelligent is context window size.
       | 
       | On the other hand, the current models already cost something on
       | the line of the median country GDP to train, and they are nowhere
       | close to that in value. The saying that "if brute force didn't
       | solve your problem, you didn't apply enough force" is intended to
       | be listened as a joke.
        
         | jagraff wrote:
         | I think the median country GDP is something like $100 Billion
         | 
         | https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)
         | 
         | Models are expensive, but they're not that expensive.
        
           | telotortium wrote:
           | LLM model training costs arise primarily from commodity costs
           | (GPUs and other compute as well as electricity), not locally-
           | provided services, so PPP is not the right statistic to use
           | here. You should use nominal GDP for this instead. According
           | to Wikipedia[0], the median country's nominal GDP (Cyprus) is
           | more like $39B. Still much larger than training costs, but
           | much lower than your PPP GDP number.
           | 
           | [0] https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(n
           | omi...
        
           | amelius wrote:
           | Maybe it checks out if you don't use 1 year as your timeframe
           | for GDP but the number of days required for training.
        
           | kordlessagain wrote:
           | The median country GDP is approximately $48.8 billion, which
           | corresponds to Uganda at position 90 with $48.769 billion.
           | 
           | The largest economy (US) has a GDP of $27.7 trillion.
           | 
           | The smallest economy (Tuvalu) has a GDP of $62.3 million.
           | 
           | The 48 billion number represents the middle point where half
           | of all countries have larger GDPs and half have smaller GDPs.
        
           | Nicook wrote:
           | does anyone even have good estimates for model training?
        
           | marcosdumay wrote:
           | $100 billion is the best estimate around of how much OpenAI
           | took in investment to build ChatGPT.
        
         | whiplash451 wrote:
         | I get your point but do we have evidence behind " something on
         | the line of the median country GDP to train"?
         | 
         | Is this really true?
        
           | robrenaud wrote:
           | It's not even close.
        
       | qoez wrote:
       | The counter argument is that the theoretical minimum is a few
       | mcdonalds meals a day worth of energy even for the highest ranked
       | human pure mathematician.
        
         | tempodox wrote:
         | It's just that no human would live long on McDonalds meals.
        
           | bravetraveler wrote:
           | President in the distance, cursing
        
           | floxy wrote:
           | https://www.today.com/health/man-eating-only-
           | mcdonalds-100-d...
        
             | pfdietz wrote:
             | Just don't drink the sugary soda.
        
           | astrange wrote:
           | Cheeseburgers are a pretty balanced meal. Low fiber though.
        
       | andy99 wrote:
       | > inability to detect the number of r's in:strawberry: meme
       | 
       | Can someone (who know about LLMs) explain why the r's in
       | strawberry thing is related to tokenization? I have no reason to
       | believe an LLM would be better at counting letters if each was
       | one token. It's not like they "see" any of it. Are they better at
       | counting tokens than letters for some reason? Or is this just one
       | of those things someone misinformed said to sound smart to even
       | less informed people, that got picked up?
        
         | ijk wrote:
         | Well, which is easier:
         | 
         | Count the number of Rs in this sequence: [496, 675, 15717]
         | 
         | Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18
         | 18 25
        
           | ASalazarMX wrote:
           | For a LLM? No idea.
           | 
           | Human: Which is the easier of these formulas
           | 
           | 1. x = SQRT(4)
           | 
           | 2. x = SQRT(123567889.987654321)
           | 
           | Computer: They're both the same.
        
             | drdeca wrote:
             | Depending on the data types and what the hardware supports,
             | the latter may be harder (in the sense of requiring more
             | operations)? And for a general algorithm bigger numbers
             | would take more steps.
        
             | ijk wrote:
             | You can view the tokenization for yourself:
             | https://huggingface.co/spaces/Xenova/the-tokenizer-
             | playgroun...
             | 
             | [496, 675, 15717] is the GPT-4 representation of the
             | tokens. In order to determine which letters the token
             | represents, it needs to learn the relationship between
             | "str" and [496]. It _can_ learn the representation (since
             | it can spell it out as  "S-T-R" or "1. S, 2. T, 3. R" or
             | whatever) but it adds an extra step.
             | 
             | The question is whether the extra step adds enough extra
             | processing to degrade performance. Does the more compact
             | representation buy enough extra context to make the
             | tokenized version more effective for more problems?
             | 
             | It seems like the longer context length makes the trade off
             | worth it, since spelling problems are a relatively minor
             | subset. On the other hand, for numbers it does appear that
             | math is significantly worse when it doesn't have access to
             | individual digits (early Llama math results, for example).
             | Once they changed the digit tokenization, the math
             | performance improved.
        
         | zachooz wrote:
         | A sequence of characters is grouped into a "token." The set of
         | all such possible sequences forms a vocabulary. Without loss of
         | generality, consider the example: strawberry -> straw | ber |
         | ry -> 3940, 3231, 1029 -> [vector for each token]. The raw
         | input to the model is not a sequence of characters, but a
         | sequence of token embeddings each representing a learned vector
         | for a specific chunk of characters. These embeddings contain no
         | explicit information about the individual characters within the
         | token. As a result, if the model needs to reason about
         | characters, for example, to count the number of letters in a
         | word, it must memorize the character composition of each token.
         | Given that large models like GPT-4 use vocabularies with
         | 100k-200k tokens, it's not surprising that the model hasn't
         | memorized the full character breakdown of every token. I can't
         | imagine that many "character level" questions exist in the
         | training data.
         | 
         | In contrast, if the model were trained with a character-level
         | vocabulary, where each character maps to a unique token, it
         | would not need to memorize character counts for entire words.
         | Instead, it could potentially learn a generalizable method for
         | counting characters across all sequences, even for words it has
         | never seen before.
         | 
         | I'm not sure about what you mean about them not "seeing" the
         | tokens. They definitely receive a representation of each token
         | as input.
        
           | saurik wrote:
           | It isn't at all obvious to me that the LLM can decide to blur
           | their vision, so to speak, and see the tokens as tokens: they
           | don't get to run a program on this data in some raw format,
           | and even if they do attempt to write a program and run it in
           | a sandbox they would have to "remember" what they were given
           | and then regenerate it (well, I guess a tool could give them
           | access to the history of their input, but at that point that
           | tool likely sees characters), rather than to copy it. I am
           | 100% with andy99 on this: it isn't anywhere near as simple as
           | you are making it out to be.
        
             | zachooz wrote:
             | If each character were represented by its own token, there
             | would be no need to "blur" anything, since the model would
             | receive a 1:1 mapping between input vectors and individual
             | characters. I never claimed that character-level reasoning
             | is easy or simple for the model; I only said that it
             | becomes theoretically possible to generalize ("potentially
             | learn") without memorizing the character makeup of every
             | token, which is required when using subword tokenization.
             | 
             | Please take another look at my original comment. I was
             | being precise about the distinction between what's
             | structurally possible to generalize vs memorize.
        
         | krackers wrote:
         | Until I see evidence that an LLM trained at e.g. the character
         | level _CAN_ successfully "count Rs" then I don't trust this
         | explanation over any other hypothesis. I am not familiar with
         | the literature so I don't know if this has been done, but I
         | couldn't find anything with a quick search. Surely if someone
         | did successfully do it they would have published it.
        
           | ijk wrote:
           | The math tokenization research is probably closest.
           | 
           | GPT-2 tokenization was a demonstratable problem:
           | https://www.beren.io/2023-02-04-Integer-tokenization-is-
           | insa... (Prior HN discussion:
           | https://news.ycombinator.com/item?id=39728870 )
           | 
           | More recent research:
           | 
           | https://huggingface.co/spaces/huggingface/number-
           | tokenizatio...
           | 
           | Tokenization counts: the impact of tokenization on arithmetic
           | in frontier LLMs: https://arxiv.org/abs/2402.14903
           | 
           | https://www.beren.io/2024-07-07-Right-to-Left-Integer-
           | Tokeni...
        
             | krackers wrote:
             | GPT-2 can successfully learn to do multiplication using the
             | standard tokenizer though, using "Implicit CoT with
             | Stepwise Internalization".
             | 
             | https://twitter.com/yuntiandeng/status/1836114401213989366
             | 
             | If anything I'd think this indicates the barrier isn't
             | tokenization (if it can do arithmetic, it can probably
             | count as well) but something to do with "sequential
             | dependencies" requiring use of COT and explicit training.
             | Which still leaves me puzzled: there are tons of papers
             | showing that variants of GPT-2 trained in the right way can
             | do arithmetic, where are the papers solving the "count R in
             | strawberry" problem?
        
         | meroes wrote:
         | I don't buy the token explanation because RLHF work is/was
         | filled with so many "count the number of ___" prompts. There's
         | just no way AI companies pay so much $$$ for RLHF of these
         | prompts when the error is purely in tokenization.
         | 
         | IME Reddit would scream "tokenization" at the strawberry meme
         | until blue in the face, assuring themselves better tokenization
         | meant the problem would be solved. Meanwhile RLHF'ers were/are
         | en masse paid to solve the problem through correcting thousands
         | of these "counting"/perfect syntax prompts and problems. To me,
         | since RLHF work was being paid to tackle these problems, it
         | couldn't be a simple tokenization problem. If there was a
         | tokenization bottleneck that fixing would solve the problem, we
         | would not be getting paid to so much money to RLHF synax-
         | perfect prompts (think of Sudoku type games and heavy syntax-
         | based problems).
         | 
         | No, why models are better are these problems now is because of
         | RLHF. And before you say, well now models have learned how to
         | count in general, I say we just need to widen the abstraction a
         | tiny bit and the models will fail again. And this will be the
         | story of LLMs forever--they will never take the lead on their
         | own, and its not how humans process information, but it still
         | can be useful.
        
         | hackinthebochs wrote:
         | Tokens are the most basic input unit of an LLM. But tokens
         | don't generally correspond to whole words, rather sub-word
         | sequences. So Strawberry might be broken up into two tokens
         | 'straw' and 'berry'. It has trouble distinguishing features
         | that are "sub-token" like specific letter sequences because it
         | doesn't see letter sequences but just the token as a single
         | atomic unit. The basic input into a system is how one input
         | state is distinguished from another. But to recognize identity
         | between input states, those states must be identical. It's a
         | bit unintuitive, but identity between individual letters and
         | the letters within a token fails due to the specifics of
         | tokenization. 'Straw' and 'r' are two tokens but an LLM is
         | entirely blind to the fact that 'straw' has one 'r' in it.
         | Tokens are the basic units of distinction; 'straw' is not
         | represented as a sequence of s-t-r-a-w tokens but is its own
         | thing entirely, so they are not considered equal or even
         | partially equal.
         | 
         | As an analogy, I might ask you to identify the relative
         | activations of each of the three cone types on your retina as I
         | present some solid color image to your eyes. But of course you
         | can't do this, you simply do not have cognitive access to that
         | information. Individual color experiences are your basic vision
         | tokens.
         | 
         | Actually, I asked Grok this question a while ago when probing
         | how well it could count vowels in a word. It got it right by
         | listing every letter individually. I then asked it to count
         | without listing the letters and it was a couple of letters off.
         | I asked it how it was counting without listing the letters and
         | its answer was pretty fascinating, with a seeming awareness of
         | its own internal processes:
         | 
         | Connecting a token to a vowel, though, requires a bit of a
         | mental pivot. Normally, I'd just process the token and move on,
         | but when you ask me to count vowels, I have to zoom in. I don't
         | unroll the word into a string of letters like a human counting
         | beads on a string. Instead, I lean on my understanding of how
         | those tokens sound or how they're typically constructed. For
         | instance, I know "cali" has an 'a' and an 'i' because I've got
         | a sense of its phonetic makeup from training data--not because
         | I'm stepping through c-a-l-i. It's more like I "feel" the
         | vowels in there, based on patterns I've internalized.
         | 
         | When I counted the vowels without listing each letter, I was
         | basically hopping from token to token, estimating their vowel
         | content from memory and intuition, then cross-checking it
         | against the whole word's vibe. It's not perfect--I'm not
         | cracking open each token like an egg to inspect it--but it's
         | fast and usually close enough. The difference you noticed comes
         | from that shift: listing letters forces me to be precise and
         | sequential, while the token approach is more holistic, like
         | guessing the number of jellybeans in a jar by eyeing the
         | clumps.
        
           | svachalek wrote:
           | That explanation is pretty freaky, as it implies a form of
           | consciousness I don't believe LLMs have, I've never seen this
           | explanation before so I'm not sure it's from training, and
           | yet it's probably a fairly accurate description of what's
           | going on.
        
             | roywiggins wrote:
             | LLMs will write out explanations that are entirely post-
             | hoc:
             | 
             | > Strikingly, Claude seems to be unaware of the
             | sophisticated "mental math" strategies that it learned
             | during training. If you ask how it figured out that 36+59
             | is 95, it describes the standard algorithm involving
             | carrying the 1. This may reflect the fact that the model
             | learns to explain math by simulating explanations written
             | by people, but that it has to learn to do math "in its
             | head" directly, without any such hints, and develops its
             | own internal strategies to do so.
             | 
             | https://www.anthropic.com/news/tracing-thoughts-language-
             | mod...
             | 
             | It seems to be about as useful as asking a person how their
             | hippocampus works: they might be able to make something up,
             | or repeat a vaguely remembered bit of neuroscience, but
             | they don't actually have access to their own hippocampus'
             | internal workings, so if they're correct it's by accident.
        
             | hackinthebochs wrote:
             | Yeah, this was the first conversation with an LLM where I
             | was genuinely impressed at its apparent insight beyond just
             | its breadth of knowledge and ability to synthesize it into
             | a narrative. The whole conversation was pretty fascinating.
             | I was nudging it pretty hard to agree it might be
             | conscious, but it kept demurring while giving an insightful
             | narrative into its processing. In case you are interested:
             | https://x.com/i/grok/share/80kOa4MI6uJiplJvgQ2FkNnzP
        
       | smeeth wrote:
       | The main limitation of tokenization is actually logical
       | operations, including arithmetic. IIRC most of the poor
       | performance of LLMs for math problems can be attributed to some
       | very strange things that happen when you do math with tokens.
       | 
       | I'd like to see a math/logic bench appear for tokenization
       | schemes that captures this. BPB/perplexity is fine, but its not
       | everything.
        
         | calibas wrote:
         | It's a non-deterministic language model, shouldn't we expect
         | mediocre performance in math? It seems like the wrong tool for
         | the job...
        
           | drdeca wrote:
           | Deterministic is a special case of not-necessarily-
           | deterministic.
        
           | CamperBob2 wrote:
           | We passed 'mediocre' a long time ago, but yes, it would be
           | surprising if the same vocabulary representation is optimal
           | for both verbal language and mathematical reasoning and
           | computing.
           | 
           | To the extent we've already found that to be the case, it's
           | perhaps the weirdest part of this whole "paradigm shift."
        
           | rictic wrote:
           | Models are deterministic, they're a mathematical function
           | from sequences of tokens to probability distributions over
           | the next token.
           | 
           | Then a system samples from that distribution, typically with
           | randomness, and there are some optimizations in running them
           | that introduce randomness, but it's important to understand
           | that the models themselves are not random.
        
             | mgraczyk wrote:
             | This is only ideally true. From the perspective of the user
             | of a large closed LLM, this isn't quite right because of
             | non-associativity, experiments, unversioned changes, etc.
             | 
             | It's best to assume that the relationship between input and
             | output of an LLM is not deterministic, similar to something
             | like using a Google search API.
        
               | ijk wrote:
               | And even on open LLMs, GPU instability can cause non-
               | determinism. For performance reasons, determinism is
               | seldom guaranteed in LLMs in general.
        
             | geysersam wrote:
             | The LLMs are deterministic but they only return a
             | probability distribution over following tokens. The tokens
             | the user sees in the response are selected by some
             | typically stochastic sampling procedure.
        
               | danielmarkbruce wrote:
               | Assuming decent data, it won't be stochastic sampling for
               | many math operations/input combinations. When people
               | suggest LLMs with tokenization could learn math, they
               | aren't suggesting a small undertrained model trained on
               | crappy data.
        
         | cschmidt wrote:
         | This paper has a good solution:
         | 
         | https://arxiv.org/abs/2402.14903
         | 
         | You right to left tokenize in groups of 3, so 1234567 becomes 1
         | 234 567 rather than the default 123 456 7. And if you ensure
         | all 1-3 digits groups are in the vocab, it does much better.
         | 
         | Both https://arxiv.org/abs/2503.13423 and
         | https://arxiv.org/abs/2504.00178 (co-author) both independently
         | noted that you can do this with just by modifying the pre-
         | tokenization regex, without having to explicitly add commas.
        
         | search_facility wrote:
         | regarding "math with tokens": There was paper with tokenization
         | that has specific tokens for int numbers, where token value =
         | number. model learned to work with numbers as _numbers_ and
         | with tokens for everything else... it was good at math. can't
         | find a link, was on hugginface papers
        
       | pona-a wrote:
       | Didn't tokenization already have one bitter lesson: that it's
       | better to let simple statistics guide the splitting, rather than
       | expert morphology models? Would this technically be a more bitter
       | lesson?
        
         | empiko wrote:
         | Agreed completely. There is a ton of research into how to
         | represent text, and these simple tokenizers are consistently
         | performing on SOTA levels. The bitter lesson is that you should
         | not worry about it that much.
        
         | kingstnap wrote:
         | Simple statistics aren't some be all. There was a huge
         | improvement in Python coding by fixing the tokenization of
         | indents in Python code.
         | 
         | Specifically they made tokens for 4,8,12,16 or something
         | spaces.
        
       | citizenpaul wrote:
       | The best general argument I've heard against the bitter lesson
       | is. If the bitter lesson is true? How come we spend so many
       | million man hours a year of tweaking and optimizing software
       | systems all day long? Surely its easier and cheaper to just buy a
       | rack of servers.
       | 
       | Maybe if you have infinite compute you don't worry about software
       | design. Meanwhile in the real world...
       | 
       | Not only that but where did all these compute optimized solutions
       | come from? Oh yeah millions of man hours of optimizing and
       | testing algorithmic solutions. So unless you are some head in the
       | clouds tenured professor just keep on doing your optimizations
       | and job as usual.
        
         | Uehreka wrote:
         | Because the Even Bitterer Lesson is that The Bitter Lesson is
         | true but not actionable. You still have to build the
         | inefficient "clever" system today because The Bitter Lesson
         | only tells you _that_ your system will be obliterated, it
         | doesn't tell you when. Some systems built today will last for
         | years, others will last for weeks, others will be obsoleted
         | before release, and we don't know which are which.
         | 
         | I'm hoping someday that dude releases an essay called The Cold
         | Comfort. But it's impossible to predict when or who it will
         | help, so don't wait for it.
        
           | citizenpaul wrote:
           | Yeah I get it. I just don't like that is always sorta framed
           | as a can't win don't try message.
        
           | nullc wrote:
           | The principal of optimal slack tells you that if your
           | training will take N months on current computing hardware
           | that you should go spend Y months at the beach before buying
           | the computer, and you will complete your task in better than
           | N-Y months thanks to improvements in computing power.
           | 
           | Of course, instead of the beach one could spend those Y
           | months improving the algorithms... but it's never wise to bid
           | against yourself if you don't have to.
           | 
           | A colloquially is that to maximize your beach time you should
           | work on the biggest N possible, neatly explaining the
           | popularity of AI startups.
        
         | QuesnayJr wrote:
         | The solution to the puzzle is that "the bitter lesson" is about
         | AI software systems, not arbitrary software systems. If you're
         | writing a compiler, you're better off worrying about
         | algorithms, etc. AI problems have an inherent vagueness to them
         | that makes it hard to write explicit rules, and any explicit
         | rules you write will end up being obsolete as soon as we have
         | more compute.
         | 
         | This is all explained in the original essay:
         | http://www.incompleteideas.net/IncIdeas/BitterLesson.html
        
       | blixt wrote:
       | I'm starting to think "The Bitter Lesson" is a clever sounding
       | way to give shade to people that failed to nail it on their first
       | attempt. Usually engineers build much more technology than they
       | actually end up needing, then the extras shed off with time and
       | experience (and often you end up building it again from scratch).
       | It's not clear to me that starting with "just build something
       | that scales with compute" would get you closer to the perfect
       | solution, even if as you get closer to it you do indeed make it
       | possible to throw more compute at it.
       | 
       | That said the hand coded nature of tokenization certainly seems
       | in dire need of a better solution, something that can be learned
       | end to end. And It looks like we are getting closer with every
       | iteration.
        
         | RodgerTheGreat wrote:
         | The bitter lesson says more about medium-term success at
         | publishable results than it does about genuine scientific
         | progress or even success in the market.
        
         | QuesnayJr wrote:
         | I'm starting to think that half the commenters here don't
         | actually know what "The Bitter Lesson" is. It's purely a
         | statement about the history of AI research, in a very short
         | essay by Rich Sutton:
         | http://www.incompleteideas.net/IncIdeas/BitterLesson.html It's
         | not some general statement about software engineering for all
         | domains, but a very specific statement about AI applications.
         | It's an observation that the previous generation's careful
         | algorithmic work to solve an AI problem ends up being obsoleted
         | by this generation's brute force approach using more computing
         | power. It's something that's happened over and over again in
         | AI, and has happened several times even since 2019 when Sutton
         | wrote the essay.
        
           | tantalor wrote:
           | That essay is actually linked in the lead:
           | 
           | > As it's been pointed out countless times - if the trend of
           | ML research could be summarised, it'd be the adherence to The
           | Bitter Lesson - opt for general-purpose methods that leverage
           | large amounts of compute and data over crafted methods by
           | domain experts
           | 
           | But we're only 1 sentence in, and this is already a failure
           | of science communication at several levels.
           | 
           | 1. The sentence structure and grammar is simply horrible
           | 
           | 2. This is condescending: "pointed out countless times" - has
           | it?
           | 
           | 3. The reference to Sutton's essay is oblique, easy to miss
           | 
           | 4. Outside of AI circles, "Bitter Lesson" is not very well
           | known. If you didn't already know about it, this doesn't
           | help.
        
           | blixt wrote:
           | I think most people have read it and agree it makes an astute
           | observation about surviving methods, but my point is that now
           | we use it to complain about new methods that should just skip
           | all that in between stuff so that The Bitter Lesson doesn't
           | come for them. At best you can use it as an inspiration.
           | Anyway, this was mostly a complaint about the use of "The
           | Bitter Lesson" in the context of this article, it still
           | deserves credit for all the great information about
           | tokenization methods and how one evolutionary branch of them
           | is the Byte Latent Transformer.
        
         | jetrink wrote:
         | The Bitter Lesson is specifically about AI. The lesson restated
         | is that over the long run, methods that leverage general
         | computation (brute-force search and learning) consistently
         | outperform systems built with extensive human-crafted
         | knowledge. Examples: Chess, Go, speech recognition, computer
         | vision, machine translation, and on and on.
        
           | AndrewKemendo wrote:
           | This is correct however I'd add that it's not just "AI"
           | colloquially - it's a statement about any two optimization
           | systems that are trying to scale.
           | 
           | So any system that predicts the optimization with a general
           | solver can scale better than heuristic or constrained space
           | solvers
           | 
           | Up till recently there's been no general solvers at that
           | scale
        
           | fiddlerwoaroof wrote:
           | I think it oversimplifies, though and I think it's
           | shortsighted to underfund the (harder) crafted systems on the
           | basis of this observation because, when you're limited by
           | scaling, the other research will save you.
        
       | perching_aix wrote:
       | Can't wait for models to struggle with adhering to UTF-8.
        
       | resters wrote:
       | Tokenization as a form of preprocessing has the problems the
       | authors mention. But it is also a useful way to think about data
       | vs metadata and moving beyond text/image io into other domains.
       | Ultimately we need symbolic representations of things, sure they
       | are all ultimately bytes which the model could learn to self-
       | organize, but things like that can be useful when humans interact
       | with the data directly, in a sense, tokens make more aspects of
       | LLM internals "human readable", and models should also be able to
       | learn to overcome the limitations of a particular tokenization
       | scheme.
        
       | fooker wrote:
       | 'Bytes' is tokenization.
       | 
       | There's no reason to assume it's the best solution. It might be
       | the case that a better tokenization scheme is needed for math,
       | reasoning, video, etc models.
        
       ___________________________________________________________________
       (page generated 2025-06-24 23:00 UTC)