[HN Gopher] Beyond Semantics: Unreasonable Effectiveness of Reas...
___________________________________________________________________
Beyond Semantics: Unreasonable Effectiveness of Reasonless
Intermediate Tokens
Author : nyrikki
Score : 89 points
Date : 2025-05-23 16:13 UTC (6 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| nullc wrote:
| Even when you train AI on human language, the tokens can have
| "subtext" that is only legible to the AI. And, unfortunately,
| it's not even legible to the AI in ways that it could ever
| explain it to us.
|
| It's no different than how in English we can signal that a
| statement is related to a kind of politics or that it's about sex
| through particular word and phrase choice.
|
| Training for reasoning should be expected to amplify the subtext,
| since any random noise in the selection that by chance is
| correlated with the right results will get amplified.
|
| Perhaps you could try to dampen this by training two distinct
| models for a while, then swap their reasoning for a while before
| going back-- but sadly distinct models may still end up with
| similar subtexts due to correlations in their training data.
| Maybe ones with very distinct tokenization would be less likely
| to do so.
| nihakue wrote:
| This is such a bonkers line of thinking, I'm so intrigued. So a
| particular model will have an entire 'culture' only available
| or understandable to itself. Seems kind of lonely. Like some
| symbols might activate together for reasons that are totally
| incomprehensible to us, but make perfect sense to the model. I
| wonder if an approach like the one in
| https://www.anthropic.com/research/tracing-thoughts-language...
| could ever give us insight into any 'inside jokes' present in
| the model.
|
| I hope that research into understanding LLM qualia eventually
| allow us to understand e.g. what it's like to [be a bat](https:
| //en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F)
| nullc wrote:
| In some sense it's more human than a model trained with no RL
| and which has absolutely no exposure to its own output.
|
| We have our own personal 'culture' too-- it's just less
| obvious because its tied up with our own hidden state. If you
| go back and read old essays that you wrote you might notice
| some of it-- that ideas and feelings (maybe smells?) that are
| absolutely not explicitly in the text immediately come back
| to you, stuff that no one or maybe only a spouse or very
| close friend might think.
|
| I think it may be very hard to explore hidden subtext because
| the signals may be almost arbitrarily weak and context
| dependent. The bare model may need only a little nudge to get
| to the right answer and the you have this big wall of
| "reasoning" where each token could carry very small amounts
| of subtext that cumulatively add up to a lot and push things
| in the right direction.
| candiddevmike wrote:
| IMO this is why natural language will always be a terrible
| _interface_--because English is a terrible _language_ where
| words can have wildly different meanings that change over time.
| There's no ambiguity with intentions with traditional UX (or
| even programming languages).
| nullc wrote:
| It can happen more or less no matter what language the model
| uses, so long as its reinforcement trained. It's just in
| English we have an illusion of thinking we understand the
| meaning.
|
| An example of this is toki pona, a minimalist constructed
| human language that is designed to only express "positive
| thinking". Yet it is extremely easy to insult people in toki
| pona: e.g. sina toki li pona pona pona pona. (you are
| speaking very very very very well).
|
| To be free of a potential subtext sidechannel there can be
| essentially no equivalent outputs.
| pona-a wrote:
| Can't you just say "sina toki ike suli a." (you are
| speaking very bad <exclamation>)? Just because it doesn't
| have official swearwords like most natural languages
| doesn't mean you can only express "positive thinking".
| nullc wrote:
| My mistake, in the future I'll refrain from using Toki
| pona for making a rhetorical point. :)
| modeless wrote:
| > we then train models on noisy, corrupted traces which have no
| relation to the specific problem each is paired with, and find
| that not only does performance remain largely consistent with
| models trained on correct data, but in some cases can improve
| upon it
|
| This is the interesting part. We've probably all had the
| experience where the model is going off the rails during the
| thinking process but somehow spits out the right answer at the
| end. Apparently the reasoning doesn't even need to be correct
| during training?
|
| I guess it suggests to me that the reason CoT helps is that the
| model gets more compute to think internally, not that the words
| it produces are meaningful. I'm surprised nobody has come up with
| a good scheme for adaptive compute per token yet. Maybe we can
| skip CoT entirely.
| trehalose wrote:
| > We've probably all had the experience where the model is
| going off the rails during the thinking process but somehow
| spits out the right answer at the end. Apparently the reasoning
| doesn't even need to be correct during training?
|
| How do we know if the reasoning was correct or not? Do we have
| more information about what the model was thinking besides just
| what it _says_ it was thinking?
| rickyhatespeas wrote:
| It's definitely not explicitly writing out everything it's
| "thinking" if you are considering all dimensions of the
| latent space that are connected, that can't really be
| exhibited with a sentence.
|
| CoT builds on existing prompt engineering techniques by
| adding it to reinforcement learning to force the models to
| build their own CoT prompt essentially. So it's not what it's
| thinking but all indications are that it does guide the
| reasoning abilities of LLMs through the output distribution.
| kelseyfrog wrote:
| > I'm surprised nobody has come up with a good scheme for
| adaptive compute per token yet.
|
| I have one, I just don't have the time or money to research it
| :(
| golol wrote:
| Post it let's go.
| istjohn wrote:
| Uh... hmmm... uhhh... ummm...
| AlexCoventry wrote:
| No, the words are meaningful to it. It's effectively using the
| CoT text as a "scratch space" for intermediate steps it can't
| calculate on one iteration through the transformer. These
| papers give examples of how it works:
|
| - https://physics.allen-zhu.com/part-2-grade-school-
| math/part-...
|
| - https://physics.allen-zhu.com/part-3-knowledge/part-3-3
| modeless wrote:
| I mean, this theory is directly contradicted by the paper
| under discussion. If you want to assert this then you need to
| be arguing why the paper is wrong.
| thomastjeffery wrote:
| That sounds to me more like evidence that an LLM is never
| reasoning at all, even when it looks like it is.
|
| The mock conversation that is written between think tags is not
| a conversation. It's the collection of tokens that are most
| likely to be written after a prompt to a model that was trained
| on example conversations.
|
| Why is that different? In a real conversation, participants use
| logic to choose what is worth saying next. The next statement
| is already determined in the speaker's mind to be logically
| sound. In a mock conversation (the LLM's CoT), there is no
| logic. The next statement is only determined to be
| statistically familiar, then written immediately.
|
| The end result of a desirable CoT interaction is text that
| would have been written by a thoughtful/logical
| conversationalist. Whether or not the mock conversation itself
| is _logically consistent_ with the mock conclusion is
| irrelevant, because the LLM is only concerned with how familiar
| that mock conclusion is to the prompt, its mock conversation,
| and its training.
|
| The overall vibe of how something is written behaves as a
| replacement for actual logic. Logical deduction is replaced
| with measures of confidence, conversations turns, etc. in
| writing style. It all works out in the end, because we are so
| consistent with the style in which we write real logical
| deductions, we have ended up providing an invisible semantics
| for the LLM to follow.
|
| There is something meaningful that we are entirely blind to.
| Unfortunately, it doesn't follow rules the way logic does, so
| it's not a trustworthy replacement. Fortunately, it's useful
| for more general exploration.
| valine wrote:
| I think it's helpful to remember that language models are not
| producing tokens, they are producing a distribution of possible
| next tokens. Just because your sampler picks a sequence of tokens
| that contain incorrect reasoning doesn't mean a useful reasoning
| trace isn't also contained within the latent space.
|
| It's a misconception that transformers reason in token space.
| Tokens don't attend to other tokens. High dimensional latents
| attend to other high dimensional latents. The final layer of a
| decoder only transformer has full access to entire latent space
| of all previous latents, the same latents you can project into a
| distribution of next tokens.
| woadwarrior01 wrote:
| > Just because your sampler picks a sequence of tokens that
| contain incorrect reasoning doesn't mean a useful reasoning
| trace isn't also contained within the latent space.
|
| That's essentially the core idea in Coconut[1][2], to keep the
| reasoning traces in a continuous space.
|
| [1]: https://arxiv.org/abs/2412.06769
|
| [2]: https://github.com/facebookresearch/coconut
| jacob019 wrote:
| So you're saying that the reasoning trace represents sequential
| connections between the full distribution rather than the
| sampled tokens from that distribution?
| valine wrote:
| The lower dimensional logits are discarded, the original high
| dimensional latents are not.
|
| But yeah, the LLM doesn't even know the sampler exists. I
| used the last layer as an example, but it's likely that
| reasoning traces exist in the latent space of every layer not
| just the final one, with the most complex reasoning
| concentrated in the middle layers.
| jacob019 wrote:
| I don't think that's accurate. The logits actually have
| high dimensionality, and they are intermediate outputs used
| to sample tokens. The latent representations contain
| contextual information and are also high-dimensional, but
| they serve a different role--they feed into the logits.
| valine wrote:
| The dimensionality I suppose depends on the vocab size
| and your hidden dimension size, but that's not really
| relevant. It's a single linear projection to go from
| latents to logits.
|
| Reasoning is definitely not happening in the linear
| projection to logits if that's what you mean.
| bcoates wrote:
| Either I'm wildly misunderstanding or that can't possibly
| be true--if you sample at high temperature and it chooses a
| very-low probability token, it continues consistent with
| the chosen token, not with the more likely ones
| valine wrote:
| Attention computes a weighted average of all previous
| latents. So yes, it's a new token as input to the forward
| pass, but after it feeds through an attention head it
| contains a little bit of every previous latent.
| timhigins wrote:
| This paper seems to focus on highly algorithmic/puzzle-like
| problems, which are not the typical application domain of LLMs,
| using a <500M parameter model. So my hunch is "reasoning" works
| much better for math, coding, factual recall, and writing tasks
| that most LLMs actually deal with.
| throwawaymaths wrote:
| why is it unreasonable that giving the llm a spot to think and
| collate long range attention and summarize without the pressure
| of building a meaningful next token so quickly would result in
| higher effectiveness?
| naasking wrote:
| I wonder if this finding would hold for something like Meta's
| Large Concept Models.
| ngruhn wrote:
| Man that "Unreasonable Effectiveness of ..." pattern is getting a
| bit overused. With the original paper [1] you could still say
| that there really is some deeply philosophical mystery. But they
| now slap that on everything.
|
| [1]
| https://en.m.wikipedia.org/wiki/The_Unreasonable_Effectivene...
| MeteorMarc wrote:
| What is not unreasonable about intermediate tokens without
| reason? See the abstract.
| godelski wrote:
| It's not "unreasonable" if you weren't anthropomorphizing
| COT, equating it to thinking or "internal dialogue." The
| results aren't surprising to people in this camp, but I also
| wouldn't say that makes the work less impactful.
|
| But it would also be more unreasonable to dismiss the fact
| that a significant portion of the research community (and
| even greater portion of the public) was operating under these
| beliefs: that COT was akin to thinking (it's literally there
| in the name...). It is possible to disagree with something
| but also not believe someone is being unreasonable by coming
| to different conclusions.
| jvanderbot wrote:
| In this case it's probably more a pun (intentional or not I
| guess) about "reasonless" or "unreason"
| godelski wrote:
| It's also worth pointing out that Winger's (position) paper[0]
| is really about something that would sound silly today. He's
| arguing that we should use math to drive physics. Today, many
| people think these are indistinguishable things and you'll get
| into arguments about math being invented or discovered. But
| Winger is talking about how mathematics provides us with a
| framework where we can drive physics forward through theory
| instead of relying purely upon experimentation to poke and prod
| the universe.
|
| It is rather "unreasonable" to think we can explore the world
| simply through pen and paper, from the comfort of a chair.
| You'd think you'd need to go out and touch grass, but
| incredibly this is not necessary. | The first
| point is that the enormous usefulness of mathematics in the
| natural sciences is something bordering on the mysterious and
| that there is no rational explanation for it. Second, it is
| just this uncanny usefulness of mathematical concepts that
| raises the question of the uniqueness of our physical theories.
|
| Which is exactly why a lot of these other things are overused.
| Hamming's seems like an extension or corollary[1] and I even
| think Norvig's (Halevy's) is highly appropriate[2]. It is
| "unreasonable" to think these things would be effective.
| -------------------------------------
|
| With this paper?
|
| I think is fine. It is being used in a similar way to Winger,
| with similar context.
|
| I can see two camps. One has always interpreted the COT as
| analogous to a model's internal dialogue. While the other has
| always thought there's a much larger gap between the
| manipulations within latent representations and what has been
| decoded, not necessarily needing be strongly aligned.[3] To the
| former, the results here would be shocking, while to the latter
| it is "yes, and?" Clearly they're addressing the former camp.
| There were plenty of people that Winger did not need to
| convince.
|
| I'm of the latter camp[4], and I'm happy people are not just
| asserting and are demonstrating. Honestly, I'm even frequently
| upset when works get dismissed because they "demonstrate
| something we already knew" but no one had ever actually
| demonstrated. _The proofs and evidencing is more important than
| the answer_. Quite often we 're highly certain about results
| but they are difficult to even evidence (let alone prove). I
| mean it would be quite silly to dismiss a proof that P != NP,
| even though the vast majority of us have long been convinced
| that this is the relationship we'll end up with. Yet, no one's
| done it. -------------------------------------
|
| [0]
| https://web.archive.org/web/20210212111540/http://www.dartmo...
|
| [1]
| https://math.dartmouth.edu/~matc/MathDrama/reading/Hamming.h...
|
| [2]
| https://static.googleusercontent.com/media/research.google.c...
|
| [3] Both camps can be further broken down too. Lots of nuances
| and opinions here and the lines really get fuzzy as we try to
| make it more accurate. I don't want to pretend there's a hard
| defining line, but the distinction helps the discussion and I
| think is reasonably accurate enough. Let me know if you think
| it is a gross mischaracterization.
|
| [4] I can expand more why this side seems "obvious" to me. But
| a warning, you can probably guess I'm not good at being terse.
|
| [Note]: I'd even go so far as say we should revisit Winger's
| argument around AI. I'm certain mathematics can be and will be
| "unreasonably effective." But not enough time has been
| dedicated to formulate the right type of math to use. We really
| do have to invent a new kind here. This may sound weird to non-
| mathematicians, but even physics uses multiple kinds of
| mathematics. The operations, fields, and algebras you use in
| one part may not be appropriate in another part. That's okay.
| But we don't have a TOE yet either, and that's a critical part
| of finding a TOE, is bringing all this together.
| tim333 wrote:
| >It's also worth pointing out that Winger's (position)
| paper[0] is really about something that would sound silly
| today. He's arguing that we should use math to drive physics.
|
| I think you misinterpret what it's about. He's pointing out
| how remarkable it is that the universe obeys laws like E=MC^2
| exactly as far as we can tell which is not something you
| would probably expect just from looking at the world. The pre
| scientific understanding of the world was it was driven my
| gods and spirits. The mathematical laws were only discovered
| by scientific investigation.
|
| Or as he puts it:
|
| >The miracle of the appropriateness of the language of
| mathematics for the formulation of the laws of physics is a
| wonderful gift which we neither understand nor deserve.
|
| If he was just saying use maths it would be boring and not a
| famous paper 65 years on.
| gipp wrote:
| Engineering blogger's love of parroting the titles of famous
| papers/articles (unreasonable effectiveness..., all you need
| is..., ... Considered harmful, etc) has always been lightly
| annoying to me
| airza wrote:
| It's just not that common for the same person to have serious
| engineering chops and writing abilities.
| jorvi wrote:
| With software engineering, every single thing in the 2010s
| had "syntactic sugar" and "sane defaults". I still get a
| slight blood pressure spike whenever someone uses either of
| those terms.
| joe_the_user wrote:
| I guess those are overused but at least they have some
| meaning. "Unreasonable Effectiveness..." is essentially
| pure meaninglessness.
| tim333 wrote:
| It was meaningful in the original paper.
| ruuda wrote:
| "modern"
| layer8 wrote:
| All you need is for the unreasonable effectiveness of
| snowclones to be considered harmful.
| EGreg wrote:
| Would you go further, and say that Unreasonable
| Effectiveness... is considered harmful?
| kevindamm wrote:
| Indeed, considering unreasonable effectiveness harmful is all
| you need
| dkga wrote:
| TIL. I am not from an engineering/physics background so for me
| the original Unreasonable Effectiveness paper was Karpathy's
| blog post about RNNs.
| godelski wrote:
| (Karpathy's might be more a call back to Halevy, Norvig, and
| Pereira's "The Unreasonable Effectiveness of Data"[0].)
|
| But I think is a good example that fits the OP's critique (I
| don't think the critique fits to the arXiv paper. Even though
| I expected the main results, see my main comment).
|
| The "unreasonableness" in Karpathy's post[1] is using
| sequencing to process non-sequential data. But the reason
| this isn't unreasonable is that we explicitly expect non-
| sequential processes to be able to be reformulated as
| sequential ones.
|
| The SVHN (hose numbers) he shows is actually a great example
| of this. We humans don't process that all at once. Our eyes
| similarly dart around, even if very fast. Or we might think
| about how to draw a picture. We don't do everything at once,
| but we work in sections, building up, and have layers that
| end up being ordered even though this technically isn't a
| requirement. I'm actually struggling to think of things that
| cannot be broken down into sequences. He says as much here
| | an important point to realize is that even if your
| inputs/outputs are fixed vectors, it is still possible to use
| this powerful formalism to process them in a sequential
| manner.
|
| So really the question is: what part of this was
| unreasonable? Or what part was unexpected? Honestly, we
| should be expecting this as the nature of neural nets is
| itself sequential, data being processed layer by layer. Hell,
| every computer program has a trace, which is sequential. I
| can give tons of examples. So it is quite reasonable that
| sequential processing should work.
|
| [0] https://static.googleusercontent.com/media/research.googl
| e.c...
|
| [1] https://karpathy.github.io/2015/05/21/rnn-effectiveness/
| theptip wrote:
| So is the interpretation here something like "CoT tokens are
| actually neuraleese"? They do boost performance, so the model
| must be stashing some intermediate reasoning outputs there. But
| perhaps not using the literal human meaning of those tokens?
___________________________________________________________________
(page generated 2025-05-23 23:00 UTC)