[HN Gopher] Beyond Semantics: Unreasonable Effectiveness of Reas...
       ___________________________________________________________________
        
       Beyond Semantics: Unreasonable Effectiveness of Reasonless
       Intermediate Tokens
        
       Author : nyrikki
       Score  : 89 points
       Date   : 2025-05-23 16:13 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | nullc wrote:
       | Even when you train AI on human language, the tokens can have
       | "subtext" that is only legible to the AI. And, unfortunately,
       | it's not even legible to the AI in ways that it could ever
       | explain it to us.
       | 
       | It's no different than how in English we can signal that a
       | statement is related to a kind of politics or that it's about sex
       | through particular word and phrase choice.
       | 
       | Training for reasoning should be expected to amplify the subtext,
       | since any random noise in the selection that by chance is
       | correlated with the right results will get amplified.
       | 
       | Perhaps you could try to dampen this by training two distinct
       | models for a while, then swap their reasoning for a while before
       | going back-- but sadly distinct models may still end up with
       | similar subtexts due to correlations in their training data.
       | Maybe ones with very distinct tokenization would be less likely
       | to do so.
        
         | nihakue wrote:
         | This is such a bonkers line of thinking, I'm so intrigued. So a
         | particular model will have an entire 'culture' only available
         | or understandable to itself. Seems kind of lonely. Like some
         | symbols might activate together for reasons that are totally
         | incomprehensible to us, but make perfect sense to the model. I
         | wonder if an approach like the one in
         | https://www.anthropic.com/research/tracing-thoughts-language...
         | could ever give us insight into any 'inside jokes' present in
         | the model.
         | 
         | I hope that research into understanding LLM qualia eventually
         | allow us to understand e.g. what it's like to [be a bat](https:
         | //en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F)
        
           | nullc wrote:
           | In some sense it's more human than a model trained with no RL
           | and which has absolutely no exposure to its own output.
           | 
           | We have our own personal 'culture' too-- it's just less
           | obvious because its tied up with our own hidden state. If you
           | go back and read old essays that you wrote you might notice
           | some of it-- that ideas and feelings (maybe smells?) that are
           | absolutely not explicitly in the text immediately come back
           | to you, stuff that no one or maybe only a spouse or very
           | close friend might think.
           | 
           | I think it may be very hard to explore hidden subtext because
           | the signals may be almost arbitrarily weak and context
           | dependent. The bare model may need only a little nudge to get
           | to the right answer and the you have this big wall of
           | "reasoning" where each token could carry very small amounts
           | of subtext that cumulatively add up to a lot and push things
           | in the right direction.
        
         | candiddevmike wrote:
         | IMO this is why natural language will always be a terrible
         | _interface_--because English is a terrible _language_ where
         | words can have wildly different meanings that change over time.
         | There's no ambiguity with intentions with traditional UX (or
         | even programming languages).
        
           | nullc wrote:
           | It can happen more or less no matter what language the model
           | uses, so long as its reinforcement trained. It's just in
           | English we have an illusion of thinking we understand the
           | meaning.
           | 
           | An example of this is toki pona, a minimalist constructed
           | human language that is designed to only express "positive
           | thinking". Yet it is extremely easy to insult people in toki
           | pona: e.g. sina toki li pona pona pona pona. (you are
           | speaking very very very very well).
           | 
           | To be free of a potential subtext sidechannel there can be
           | essentially no equivalent outputs.
        
             | pona-a wrote:
             | Can't you just say "sina toki ike suli a." (you are
             | speaking very bad <exclamation>)? Just because it doesn't
             | have official swearwords like most natural languages
             | doesn't mean you can only express "positive thinking".
        
               | nullc wrote:
               | My mistake, in the future I'll refrain from using Toki
               | pona for making a rhetorical point. :)
        
       | modeless wrote:
       | > we then train models on noisy, corrupted traces which have no
       | relation to the specific problem each is paired with, and find
       | that not only does performance remain largely consistent with
       | models trained on correct data, but in some cases can improve
       | upon it
       | 
       | This is the interesting part. We've probably all had the
       | experience where the model is going off the rails during the
       | thinking process but somehow spits out the right answer at the
       | end. Apparently the reasoning doesn't even need to be correct
       | during training?
       | 
       | I guess it suggests to me that the reason CoT helps is that the
       | model gets more compute to think internally, not that the words
       | it produces are meaningful. I'm surprised nobody has come up with
       | a good scheme for adaptive compute per token yet. Maybe we can
       | skip CoT entirely.
        
         | trehalose wrote:
         | > We've probably all had the experience where the model is
         | going off the rails during the thinking process but somehow
         | spits out the right answer at the end. Apparently the reasoning
         | doesn't even need to be correct during training?
         | 
         | How do we know if the reasoning was correct or not? Do we have
         | more information about what the model was thinking besides just
         | what it _says_ it was thinking?
        
           | rickyhatespeas wrote:
           | It's definitely not explicitly writing out everything it's
           | "thinking" if you are considering all dimensions of the
           | latent space that are connected, that can't really be
           | exhibited with a sentence.
           | 
           | CoT builds on existing prompt engineering techniques by
           | adding it to reinforcement learning to force the models to
           | build their own CoT prompt essentially. So it's not what it's
           | thinking but all indications are that it does guide the
           | reasoning abilities of LLMs through the output distribution.
        
         | kelseyfrog wrote:
         | > I'm surprised nobody has come up with a good scheme for
         | adaptive compute per token yet.
         | 
         | I have one, I just don't have the time or money to research it
         | :(
        
           | golol wrote:
           | Post it let's go.
        
         | istjohn wrote:
         | Uh... hmmm... uhhh... ummm...
        
         | AlexCoventry wrote:
         | No, the words are meaningful to it. It's effectively using the
         | CoT text as a "scratch space" for intermediate steps it can't
         | calculate on one iteration through the transformer. These
         | papers give examples of how it works:
         | 
         | - https://physics.allen-zhu.com/part-2-grade-school-
         | math/part-...
         | 
         | - https://physics.allen-zhu.com/part-3-knowledge/part-3-3
        
           | modeless wrote:
           | I mean, this theory is directly contradicted by the paper
           | under discussion. If you want to assert this then you need to
           | be arguing why the paper is wrong.
        
         | thomastjeffery wrote:
         | That sounds to me more like evidence that an LLM is never
         | reasoning at all, even when it looks like it is.
         | 
         | The mock conversation that is written between think tags is not
         | a conversation. It's the collection of tokens that are most
         | likely to be written after a prompt to a model that was trained
         | on example conversations.
         | 
         | Why is that different? In a real conversation, participants use
         | logic to choose what is worth saying next. The next statement
         | is already determined in the speaker's mind to be logically
         | sound. In a mock conversation (the LLM's CoT), there is no
         | logic. The next statement is only determined to be
         | statistically familiar, then written immediately.
         | 
         | The end result of a desirable CoT interaction is text that
         | would have been written by a thoughtful/logical
         | conversationalist. Whether or not the mock conversation itself
         | is _logically consistent_ with the mock conclusion is
         | irrelevant, because the LLM is only concerned with how familiar
         | that mock conclusion is to the prompt, its mock conversation,
         | and its training.
         | 
         | The overall vibe of how something is written behaves as a
         | replacement for actual logic. Logical deduction is replaced
         | with measures of confidence, conversations turns, etc. in
         | writing style. It all works out in the end, because we are so
         | consistent with the style in which we write real logical
         | deductions, we have ended up providing an invisible semantics
         | for the LLM to follow.
         | 
         | There is something meaningful that we are entirely blind to.
         | Unfortunately, it doesn't follow rules the way logic does, so
         | it's not a trustworthy replacement. Fortunately, it's useful
         | for more general exploration.
        
       | valine wrote:
       | I think it's helpful to remember that language models are not
       | producing tokens, they are producing a distribution of possible
       | next tokens. Just because your sampler picks a sequence of tokens
       | that contain incorrect reasoning doesn't mean a useful reasoning
       | trace isn't also contained within the latent space.
       | 
       | It's a misconception that transformers reason in token space.
       | Tokens don't attend to other tokens. High dimensional latents
       | attend to other high dimensional latents. The final layer of a
       | decoder only transformer has full access to entire latent space
       | of all previous latents, the same latents you can project into a
       | distribution of next tokens.
        
         | woadwarrior01 wrote:
         | > Just because your sampler picks a sequence of tokens that
         | contain incorrect reasoning doesn't mean a useful reasoning
         | trace isn't also contained within the latent space.
         | 
         | That's essentially the core idea in Coconut[1][2], to keep the
         | reasoning traces in a continuous space.
         | 
         | [1]: https://arxiv.org/abs/2412.06769
         | 
         | [2]: https://github.com/facebookresearch/coconut
        
         | jacob019 wrote:
         | So you're saying that the reasoning trace represents sequential
         | connections between the full distribution rather than the
         | sampled tokens from that distribution?
        
           | valine wrote:
           | The lower dimensional logits are discarded, the original high
           | dimensional latents are not.
           | 
           | But yeah, the LLM doesn't even know the sampler exists. I
           | used the last layer as an example, but it's likely that
           | reasoning traces exist in the latent space of every layer not
           | just the final one, with the most complex reasoning
           | concentrated in the middle layers.
        
             | jacob019 wrote:
             | I don't think that's accurate. The logits actually have
             | high dimensionality, and they are intermediate outputs used
             | to sample tokens. The latent representations contain
             | contextual information and are also high-dimensional, but
             | they serve a different role--they feed into the logits.
        
               | valine wrote:
               | The dimensionality I suppose depends on the vocab size
               | and your hidden dimension size, but that's not really
               | relevant. It's a single linear projection to go from
               | latents to logits.
               | 
               | Reasoning is definitely not happening in the linear
               | projection to logits if that's what you mean.
        
             | bcoates wrote:
             | Either I'm wildly misunderstanding or that can't possibly
             | be true--if you sample at high temperature and it chooses a
             | very-low probability token, it continues consistent with
             | the chosen token, not with the more likely ones
        
               | valine wrote:
               | Attention computes a weighted average of all previous
               | latents. So yes, it's a new token as input to the forward
               | pass, but after it feeds through an attention head it
               | contains a little bit of every previous latent.
        
       | timhigins wrote:
       | This paper seems to focus on highly algorithmic/puzzle-like
       | problems, which are not the typical application domain of LLMs,
       | using a <500M parameter model. So my hunch is "reasoning" works
       | much better for math, coding, factual recall, and writing tasks
       | that most LLMs actually deal with.
        
       | throwawaymaths wrote:
       | why is it unreasonable that giving the llm a spot to think and
       | collate long range attention and summarize without the pressure
       | of building a meaningful next token so quickly would result in
       | higher effectiveness?
        
       | naasking wrote:
       | I wonder if this finding would hold for something like Meta's
       | Large Concept Models.
        
       | ngruhn wrote:
       | Man that "Unreasonable Effectiveness of ..." pattern is getting a
       | bit overused. With the original paper [1] you could still say
       | that there really is some deeply philosophical mystery. But they
       | now slap that on everything.
       | 
       | [1]
       | https://en.m.wikipedia.org/wiki/The_Unreasonable_Effectivene...
        
         | MeteorMarc wrote:
         | What is not unreasonable about intermediate tokens without
         | reason? See the abstract.
        
           | godelski wrote:
           | It's not "unreasonable" if you weren't anthropomorphizing
           | COT, equating it to thinking or "internal dialogue." The
           | results aren't surprising to people in this camp, but I also
           | wouldn't say that makes the work less impactful.
           | 
           | But it would also be more unreasonable to dismiss the fact
           | that a significant portion of the research community (and
           | even greater portion of the public) was operating under these
           | beliefs: that COT was akin to thinking (it's literally there
           | in the name...). It is possible to disagree with something
           | but also not believe someone is being unreasonable by coming
           | to different conclusions.
        
         | jvanderbot wrote:
         | In this case it's probably more a pun (intentional or not I
         | guess) about "reasonless" or "unreason"
        
         | godelski wrote:
         | It's also worth pointing out that Winger's (position) paper[0]
         | is really about something that would sound silly today. He's
         | arguing that we should use math to drive physics. Today, many
         | people think these are indistinguishable things and you'll get
         | into arguments about math being invented or discovered. But
         | Winger is talking about how mathematics provides us with a
         | framework where we can drive physics forward through theory
         | instead of relying purely upon experimentation to poke and prod
         | the universe.
         | 
         | It is rather "unreasonable" to think we can explore the world
         | simply through pen and paper, from the comfort of a chair.
         | You'd think you'd need to go out and touch grass, but
         | incredibly this is not necessary.                 | The first
         | point is that the enormous usefulness of mathematics in the
         | natural sciences is something bordering on the mysterious and
         | that there is no rational explanation for it. Second, it is
         | just this uncanny usefulness of mathematical concepts that
         | raises the question of the uniqueness of our physical theories.
         | 
         | Which is exactly why a lot of these other things are overused.
         | Hamming's seems like an extension or corollary[1] and I even
         | think Norvig's (Halevy's) is highly appropriate[2]. It is
         | "unreasonable" to think these things would be effective.
         | -------------------------------------
         | 
         | With this paper?
         | 
         | I think is fine. It is being used in a similar way to Winger,
         | with similar context.
         | 
         | I can see two camps. One has always interpreted the COT as
         | analogous to a model's internal dialogue. While the other has
         | always thought there's a much larger gap between the
         | manipulations within latent representations and what has been
         | decoded, not necessarily needing be strongly aligned.[3] To the
         | former, the results here would be shocking, while to the latter
         | it is "yes, and?" Clearly they're addressing the former camp.
         | There were plenty of people that Winger did not need to
         | convince.
         | 
         | I'm of the latter camp[4], and I'm happy people are not just
         | asserting and are demonstrating. Honestly, I'm even frequently
         | upset when works get dismissed because they "demonstrate
         | something we already knew" but no one had ever actually
         | demonstrated. _The proofs and evidencing is more important than
         | the answer_. Quite often we 're highly certain about results
         | but they are difficult to even evidence (let alone prove). I
         | mean it would be quite silly to dismiss a proof that P != NP,
         | even though the vast majority of us have long been convinced
         | that this is the relationship we'll end up with. Yet, no one's
         | done it.                 -------------------------------------
         | 
         | [0]
         | https://web.archive.org/web/20210212111540/http://www.dartmo...
         | 
         | [1]
         | https://math.dartmouth.edu/~matc/MathDrama/reading/Hamming.h...
         | 
         | [2]
         | https://static.googleusercontent.com/media/research.google.c...
         | 
         | [3] Both camps can be further broken down too. Lots of nuances
         | and opinions here and the lines really get fuzzy as we try to
         | make it more accurate. I don't want to pretend there's a hard
         | defining line, but the distinction helps the discussion and I
         | think is reasonably accurate enough. Let me know if you think
         | it is a gross mischaracterization.
         | 
         | [4] I can expand more why this side seems "obvious" to me. But
         | a warning, you can probably guess I'm not good at being terse.
         | 
         | [Note]: I'd even go so far as say we should revisit Winger's
         | argument around AI. I'm certain mathematics can be and will be
         | "unreasonably effective." But not enough time has been
         | dedicated to formulate the right type of math to use. We really
         | do have to invent a new kind here. This may sound weird to non-
         | mathematicians, but even physics uses multiple kinds of
         | mathematics. The operations, fields, and algebras you use in
         | one part may not be appropriate in another part. That's okay.
         | But we don't have a TOE yet either, and that's a critical part
         | of finding a TOE, is bringing all this together.
        
           | tim333 wrote:
           | >It's also worth pointing out that Winger's (position)
           | paper[0] is really about something that would sound silly
           | today. He's arguing that we should use math to drive physics.
           | 
           | I think you misinterpret what it's about. He's pointing out
           | how remarkable it is that the universe obeys laws like E=MC^2
           | exactly as far as we can tell which is not something you
           | would probably expect just from looking at the world. The pre
           | scientific understanding of the world was it was driven my
           | gods and spirits. The mathematical laws were only discovered
           | by scientific investigation.
           | 
           | Or as he puts it:
           | 
           | >The miracle of the appropriateness of the language of
           | mathematics for the formulation of the laws of physics is a
           | wonderful gift which we neither understand nor deserve.
           | 
           | If he was just saying use maths it would be boring and not a
           | famous paper 65 years on.
        
         | gipp wrote:
         | Engineering blogger's love of parroting the titles of famous
         | papers/articles (unreasonable effectiveness..., all you need
         | is..., ... Considered harmful, etc) has always been lightly
         | annoying to me
        
           | airza wrote:
           | It's just not that common for the same person to have serious
           | engineering chops and writing abilities.
        
           | jorvi wrote:
           | With software engineering, every single thing in the 2010s
           | had "syntactic sugar" and "sane defaults". I still get a
           | slight blood pressure spike whenever someone uses either of
           | those terms.
        
             | joe_the_user wrote:
             | I guess those are overused but at least they have some
             | meaning. "Unreasonable Effectiveness..." is essentially
             | pure meaninglessness.
        
               | tim333 wrote:
               | It was meaningful in the original paper.
        
             | ruuda wrote:
             | "modern"
        
           | layer8 wrote:
           | All you need is for the unreasonable effectiveness of
           | snowclones to be considered harmful.
        
         | EGreg wrote:
         | Would you go further, and say that Unreasonable
         | Effectiveness... is considered harmful?
        
           | kevindamm wrote:
           | Indeed, considering unreasonable effectiveness harmful is all
           | you need
        
         | dkga wrote:
         | TIL. I am not from an engineering/physics background so for me
         | the original Unreasonable Effectiveness paper was Karpathy's
         | blog post about RNNs.
        
           | godelski wrote:
           | (Karpathy's might be more a call back to Halevy, Norvig, and
           | Pereira's "The Unreasonable Effectiveness of Data"[0].)
           | 
           | But I think is a good example that fits the OP's critique (I
           | don't think the critique fits to the arXiv paper. Even though
           | I expected the main results, see my main comment).
           | 
           | The "unreasonableness" in Karpathy's post[1] is using
           | sequencing to process non-sequential data. But the reason
           | this isn't unreasonable is that we explicitly expect non-
           | sequential processes to be able to be reformulated as
           | sequential ones.
           | 
           | The SVHN (hose numbers) he shows is actually a great example
           | of this. We humans don't process that all at once. Our eyes
           | similarly dart around, even if very fast. Or we might think
           | about how to draw a picture. We don't do everything at once,
           | but we work in sections, building up, and have layers that
           | end up being ordered even though this technically isn't a
           | requirement. I'm actually struggling to think of things that
           | cannot be broken down into sequences. He says as much here
           | | an important point to realize is that even if your
           | inputs/outputs are fixed vectors, it is still possible to use
           | this powerful formalism to process them in a sequential
           | manner.
           | 
           | So really the question is: what part of this was
           | unreasonable? Or what part was unexpected? Honestly, we
           | should be expecting this as the nature of neural nets is
           | itself sequential, data being processed layer by layer. Hell,
           | every computer program has a trace, which is sequential. I
           | can give tons of examples. So it is quite reasonable that
           | sequential processing should work.
           | 
           | [0] https://static.googleusercontent.com/media/research.googl
           | e.c...
           | 
           | [1] https://karpathy.github.io/2015/05/21/rnn-effectiveness/
        
       | theptip wrote:
       | So is the interpretation here something like "CoT tokens are
       | actually neuraleese"? They do boost performance, so the model
       | must be stashing some intermediate reasoning outputs there. But
       | perhaps not using the literal human meaning of those tokens?
        
       ___________________________________________________________________
       (page generated 2025-05-23 23:00 UTC)