[HN Gopher] Beyond Semantics: Unreasonable Effectiveness of Reas...
       ___________________________________________________________________
        
       Beyond Semantics: Unreasonable Effectiveness of Reasonless
       Intermediate Tokens
        
       Author : nyrikki
       Score  : 127 points
       Date   : 2025-05-23 16:13 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | nullc wrote:
       | Even when you train AI on human language, the tokens can have
       | "subtext" that is only legible to the AI. And, unfortunately,
       | it's not even legible to the AI in ways that it could ever
       | explain it to us.
       | 
       | It's no different than how in English we can signal that a
       | statement is related to a kind of politics or that it's about sex
       | through particular word and phrase choice.
       | 
       | Training for reasoning should be expected to amplify the subtext,
       | since any random noise in the selection that by chance is
       | correlated with the right results will get amplified.
       | 
       | Perhaps you could try to dampen this by training two distinct
       | models for a while, then swap their reasoning for a while before
       | going back-- but sadly distinct models may still end up with
       | similar subtexts due to correlations in their training data.
       | Maybe ones with very distinct tokenization would be less likely
       | to do so.
        
         | nihakue wrote:
         | This is such a bonkers line of thinking, I'm so intrigued. So a
         | particular model will have an entire 'culture' only available
         | or understandable to itself. Seems kind of lonely. Like some
         | symbols might activate together for reasons that are totally
         | incomprehensible to us, but make perfect sense to the model. I
         | wonder if an approach like the one in
         | https://www.anthropic.com/research/tracing-thoughts-language...
         | could ever give us insight into any 'inside jokes' present in
         | the model.
         | 
         | I hope that research into understanding LLM qualia eventually
         | allow us to understand e.g. what it's like to [be a bat](https:
         | //en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F)
        
           | nullc wrote:
           | In some sense it's more human than a model trained with no RL
           | and which has absolutely no exposure to its own output.
           | 
           | We have our own personal 'culture' too-- it's just less
           | obvious because its tied up with our own hidden state. If you
           | go back and read old essays that you wrote you might notice
           | some of it-- that ideas and feelings (maybe smells?) that are
           | absolutely not explicitly in the text immediately come back
           | to you, stuff that no one or maybe only a spouse or very
           | close friend might think.
           | 
           | I think it may be very hard to explore hidden subtext because
           | the signals may be almost arbitrarily weak and context
           | dependent. The bare model may need only a little nudge to get
           | to the right answer and the you have this big wall of
           | "reasoning" where each token could carry very small amounts
           | of subtext that cumulatively add up to a lot and push things
           | in the right direction.
        
         | candiddevmike wrote:
         | IMO this is why natural language will always be a terrible
         | _interface_--because English is a terrible _language_ where
         | words can have wildly different meanings that change over time.
         | There's no ambiguity with intentions with traditional UX (or
         | even programming languages).
        
           | nullc wrote:
           | It can happen more or less no matter what language the model
           | uses, so long as its reinforcement trained. It's just in
           | English we have an illusion of thinking we understand the
           | meaning.
           | 
           | An example of this is toki pona, a minimalist constructed
           | human language that is designed to only express "positive
           | thinking". Yet it is extremely easy to insult people in toki
           | pona: e.g. sina toki li pona pona pona pona. (you are
           | speaking very very very very well).
           | 
           | To be free of a potential subtext sidechannel there can be
           | essentially no equivalent outputs.
        
             | pona-a wrote:
             | Can't you just say "sina toki ike suli a." (you are
             | speaking very bad <exclamation>)? Just because it doesn't
             | have official swearwords like most natural languages
             | doesn't mean you can only express "positive thinking".
        
               | nullc wrote:
               | My mistake, in the future I'll refrain from using Toki
               | pona for making a rhetorical point. :)
        
       | modeless wrote:
       | > we then train models on noisy, corrupted traces which have no
       | relation to the specific problem each is paired with, and find
       | that not only does performance remain largely consistent with
       | models trained on correct data, but in some cases can improve
       | upon it
       | 
       | This is the interesting part. We've probably all had the
       | experience where the model is going off the rails during the
       | thinking process but somehow spits out the right answer at the
       | end. Apparently the reasoning doesn't even need to be correct
       | during training?
       | 
       | I guess it suggests to me that the reason CoT helps is that the
       | model gets more compute to think internally, not that the words
       | it produces are meaningful. I'm surprised nobody has come up with
       | a good scheme for adaptive compute per token yet. Maybe we can
       | skip CoT entirely.
        
         | trehalose wrote:
         | > We've probably all had the experience where the model is
         | going off the rails during the thinking process but somehow
         | spits out the right answer at the end. Apparently the reasoning
         | doesn't even need to be correct during training?
         | 
         | How do we know if the reasoning was correct or not? Do we have
         | more information about what the model was thinking besides just
         | what it _says_ it was thinking?
        
           | rickyhatespeas wrote:
           | It's definitely not explicitly writing out everything it's
           | "thinking" if you are considering all dimensions of the
           | latent space that are connected, that can't really be
           | exhibited with a sentence.
           | 
           | CoT builds on existing prompt engineering techniques by
           | adding it to reinforcement learning to force the models to
           | build their own CoT prompt essentially. So it's not what it's
           | thinking but all indications are that it does guide the
           | reasoning abilities of LLMs through the output distribution.
        
           | modeless wrote:
           | During fine tuning the model does not produce reasoning
           | traces, it consumes them. And the researchers presented it
           | with traces deliberately constructed to be wrong except for
           | the answer at the end. That's easy enough to do.
        
         | kelseyfrog wrote:
         | > I'm surprised nobody has come up with a good scheme for
         | adaptive compute per token yet.
         | 
         | I have one, I just don't have the time or money to research it
         | :(
        
           | golol wrote:
           | Post it let's go.
        
         | istjohn wrote:
         | Uh... hmmm... uhhh... ummm...
        
         | AlexCoventry wrote:
         | No, the words are meaningful to it. It's effectively using the
         | CoT text as a "scratch space" for intermediate steps it can't
         | calculate on one iteration through the transformer. These
         | papers give examples of how it works:
         | 
         | - https://physics.allen-zhu.com/part-2-grade-school-
         | math/part-...
         | 
         | - https://physics.allen-zhu.com/part-3-knowledge/part-3-3
        
           | modeless wrote:
           | I mean, this theory is directly contradicted by the paper
           | under discussion. If you want to assert this then you need to
           | be arguing why the paper is wrong.
        
         | thomastjeffery wrote:
         | That sounds to me more like evidence that an LLM is never
         | reasoning at all, even when it looks like it is.
         | 
         | The mock conversation that is written between think tags is not
         | a conversation. It's the collection of tokens that are most
         | likely to be written after a prompt to a model that was trained
         | on example conversations.
         | 
         | Why is that different? In a real conversation, participants use
         | logic to choose what is worth saying next. The next statement
         | is already determined in the speaker's mind to be logically
         | sound. In a mock conversation (the LLM's CoT), there is no
         | logic. The next statement is only determined to be
         | statistically familiar, then written immediately.
         | 
         | The end result of a desirable CoT interaction is text that
         | would have been written by a thoughtful/logical
         | conversationalist. Whether or not the mock conversation itself
         | is _logically consistent_ with the mock conclusion is
         | irrelevant, because the LLM is only concerned with how familiar
         | that mock conclusion is to the prompt, its mock conversation,
         | and its training.
         | 
         | The overall vibe of how something is written behaves as a
         | replacement for actual logic. Logical deduction is replaced
         | with measures of confidence, conversations turns, etc. in
         | writing style. It all works out in the end, because we are so
         | consistent with the style in which we write real logical
         | deductions, we have ended up providing an invisible semantics
         | for the LLM to follow.
         | 
         | There is something meaningful that we are entirely blind to.
         | Unfortunately, it doesn't follow rules the way logic does, so
         | it's not a trustworthy replacement. Fortunately, it's useful
         | for more general exploration.
        
         | x_flynn wrote:
         | I like to think of the intermediate tokens as low-dimensional
         | hidden states. Also see the Coconut paper/citation
        
       | valine wrote:
       | I think it's helpful to remember that language models are not
       | producing tokens, they are producing a distribution of possible
       | next tokens. Just because your sampler picks a sequence of tokens
       | that contain incorrect reasoning doesn't mean a useful reasoning
       | trace isn't also contained within the latent space.
       | 
       | It's a misconception that transformers reason in token space.
       | Tokens don't attend to other tokens. High dimensional latents
       | attend to other high dimensional latents. The final layer of a
       | decoder only transformer has full access to entire latent space
       | of all previous latents, the same latents you can project into a
       | distribution of next tokens.
        
         | woadwarrior01 wrote:
         | > Just because your sampler picks a sequence of tokens that
         | contain incorrect reasoning doesn't mean a useful reasoning
         | trace isn't also contained within the latent space.
         | 
         | That's essentially the core idea in Coconut[1][2], to keep the
         | reasoning traces in a continuous space.
         | 
         | [1]: https://arxiv.org/abs/2412.06769
         | 
         | [2]: https://github.com/facebookresearch/coconut
        
         | jacob019 wrote:
         | So you're saying that the reasoning trace represents sequential
         | connections between the full distribution rather than the
         | sampled tokens from that distribution?
        
           | valine wrote:
           | The lower dimensional logits are discarded, the original high
           | dimensional latents are not.
           | 
           | But yeah, the LLM doesn't even know the sampler exists. I
           | used the last layer as an example, but it's likely that
           | reasoning traces exist in the latent space of every layer not
           | just the final one, with the most complex reasoning
           | concentrated in the middle layers.
        
             | jacob019 wrote:
             | I don't think that's accurate. The logits actually have
             | high dimensionality, and they are intermediate outputs used
             | to sample tokens. The latent representations contain
             | contextual information and are also high-dimensional, but
             | they serve a different role--they feed into the logits.
        
               | valine wrote:
               | The dimensionality I suppose depends on the vocab size
               | and your hidden dimension size, but that's not really
               | relevant. It's a single linear projection to go from
               | latents to logits.
               | 
               | Reasoning is definitely not happening in the linear
               | projection to logits if that's what you mean.
        
               | pyinstallwoes wrote:
               | Where does it happen ?
        
               | valine wrote:
               | My personal theory is that it's an emergent property of
               | many attention heads working together. If each attention
               | head is a bird, reasoning would be the movement of the
               | flock.
        
             | bcoates wrote:
             | Either I'm wildly misunderstanding or that can't possibly
             | be true--if you sample at high temperature and it chooses a
             | very-low probability token, it continues consistent with
             | the chosen token, not with the more likely ones
        
               | valine wrote:
               | Attention computes a weighted average of all previous
               | latents. So yes, it's a new token as input to the forward
               | pass, but after it feeds through an attention head it
               | contains a little bit of every previous latent.
        
         | x_flynn wrote:
         | What the model is doing in latent space is auxilliary to
         | anthropomorphic interpretations of the tokens, though. And if
         | the latent reasoning matches a ground-truth procedure (A*),
         | then we'd expect it to be projectable to semantic tokens, but
         | it isn't. So it seems the model has learned an alternative
         | method for solving these problems.
        
           | refulgentis wrote:
           | It is worth pointing out that "latent space" is meaningless.
           | 
           | There's a lot of stuff that makes this hard to discuss, ex.
           | "projectable to semantic tokens" you mean "able to be written
           | down"...right?
           | 
           | Something I do to make an idea really stretch its legs is
           | reword it in Fat Tony, the Taleb character.
           | 
           | Setting that aside, why do we think this path finding can't
           | be written down?
           | 
           | Is Claude/Gemini Plays Pokemon just an iterated A* search?
        
           | valine wrote:
           | You're thinking about this like the final layer of the model
           | is all that exists. It's highly likely reasoning is happening
           | at a lower layer, in a different latent space that can't
           | natively be projected into logits.
        
         | aiiizzz wrote:
         | Is that really true? E.g. anthropic said that the model can
         | make decisions about all the tokens, before a single token is
         | produced.
        
           | valine wrote:
           | That's true yeah. The model can do that because calculating
           | latents is independent of next token prediction. You do a
           | forward pass for each token in your sequence without the
           | final projection to logits.
        
       | timhigins wrote:
       | This paper seems to focus on highly algorithmic/puzzle-like
       | problems, which are not the typical application domain of LLMs,
       | using a <500M parameter model. So my hunch is "reasoning" works
       | much better for math, coding, factual recall, and writing tasks
       | that most LLMs actually deal with.
        
       | throwawaymaths wrote:
       | why is it unreasonable that giving the llm a spot to think and
       | collate long range attention and summarize without the pressure
       | of building a meaningful next token so quickly would result in
       | higher effectiveness?
        
         | x_flynn wrote:
         | It's more about the lack of semantic meaning in the
         | intermediate tokens, not that they aren't effective (even when
         | the intermediates are wrong)
        
       | naasking wrote:
       | I wonder if this finding would hold for something like Meta's
       | Large Concept Models.
        
       | ngruhn wrote:
       | Man that "Unreasonable Effectiveness of ..." pattern is getting a
       | bit overused. With the original paper [1] you could still say
       | that there really is some deeply philosophical mystery. But they
       | now slap that on everything.
       | 
       | [1]
       | https://en.m.wikipedia.org/wiki/The_Unreasonable_Effectivene...
        
         | MeteorMarc wrote:
         | What is not unreasonable about intermediate tokens without
         | reason? See the abstract.
        
           | godelski wrote:
           | It's not "unreasonable" if you weren't anthropomorphizing
           | COT, equating it to thinking or "internal dialogue." The
           | results aren't surprising to people in this camp, but I also
           | wouldn't say that makes the work less impactful.
           | 
           | But it would also be more unreasonable to dismiss the fact
           | that a significant portion of the research community (and
           | even greater portion of the public) was operating under these
           | beliefs: that COT was akin to thinking (it's literally there
           | in the name...). It is possible to disagree with something
           | but also not believe someone is being unreasonable by coming
           | to different conclusions.
        
         | jvanderbot wrote:
         | In this case it's probably more a pun (intentional or not I
         | guess) about "reasonless" or "unreason"
        
         | godelski wrote:
         | It's also worth pointing out that Winger's (position) paper[0]
         | is really about something that would sound silly today. He's
         | arguing that we should use math to drive physics. Today, many
         | people think these are indistinguishable things and you'll get
         | into arguments about math being invented or discovered. But
         | Winger is talking about how mathematics provides us with a
         | framework where we can drive physics forward through theory
         | instead of relying purely upon experimentation to poke and prod
         | the universe.
         | 
         | It is rather "unreasonable" to think we can explore the world
         | simply through pen and paper, from the comfort of a chair.
         | You'd think you'd need to go out and touch grass, but
         | incredibly this is not necessary.                 | The first
         | point is that the enormous usefulness of mathematics in the
         | natural sciences is something bordering on the mysterious and
         | that there is no rational explanation for it. Second, it is
         | just this uncanny usefulness of mathematical concepts that
         | raises the question of the uniqueness of our physical theories.
         | 
         | Which is exactly why a lot of these other things are overused.
         | Hamming's seems like an extension or corollary[1] and I even
         | think Norvig's (Halevy's) is highly appropriate[2]. It is
         | "unreasonable" to think these things would be effective.
         | -------------------------------------
         | 
         | With this paper?
         | 
         | I think is fine. It is being used in a similar way to Winger,
         | with similar context.
         | 
         | I can see two camps. One has always interpreted the COT as
         | analogous to a model's internal dialogue. While the other has
         | always thought there's a much larger gap between the
         | manipulations within latent representations and what has been
         | decoded, not necessarily needing be strongly aligned.[3] To the
         | former, the results here would be shocking, while to the latter
         | it is "yes, and?" Clearly they're addressing the former camp.
         | There were plenty of people that Winger did not need to
         | convince.
         | 
         | I'm of the latter camp[4], and I'm happy people are not just
         | asserting and are demonstrating. Honestly, I'm even frequently
         | upset when works get dismissed because they "demonstrate
         | something we already knew" but no one had ever actually
         | demonstrated. _The proofs and evidencing is more important than
         | the answer_. Quite often we 're highly certain about results
         | but they are difficult to even evidence (let alone prove). I
         | mean it would be quite silly to dismiss a proof that P != NP,
         | even though the vast majority of us have long been convinced
         | that this is the relationship we'll end up with. Yet, no one's
         | done it.                 -------------------------------------
         | 
         | [0]
         | https://web.archive.org/web/20210212111540/http://www.dartmo...
         | 
         | [1]
         | https://math.dartmouth.edu/~matc/MathDrama/reading/Hamming.h...
         | 
         | [2]
         | https://static.googleusercontent.com/media/research.google.c...
         | 
         | [3] Both camps can be further broken down too. Lots of nuances
         | and opinions here and the lines really get fuzzy as we try to
         | make it more accurate. I don't want to pretend there's a hard
         | defining line, but the distinction helps the discussion and I
         | think is reasonably accurate enough. Let me know if you think
         | it is a gross mischaracterization.
         | 
         | [4] I can expand more why this side seems "obvious" to me. But
         | a warning, you can probably guess I'm not good at being terse.
         | 
         | [Note]: I'd even go so far as say we should revisit Winger's
         | argument around AI. I'm certain mathematics can be and will be
         | "unreasonably effective." But not enough time has been
         | dedicated to formulate the right type of math to use. We really
         | do have to invent a new kind here. This may sound weird to non-
         | mathematicians, but even physics uses multiple kinds of
         | mathematics. The operations, fields, and algebras you use in
         | one part may not be appropriate in another part. That's okay.
         | But we don't have a TOE yet either, and that's a critical part
         | of finding a TOE, is bringing all this together.
        
           | tim333 wrote:
           | >It's also worth pointing out that Winger's (position)
           | paper[0] is really about something that would sound silly
           | today. He's arguing that we should use math to drive physics.
           | 
           | I think you misinterpret what it's about. He's pointing out
           | how remarkable it is that the universe obeys laws like E=MC^2
           | exactly as far as we can tell which is not something you
           | would probably expect just from looking at the world. The pre
           | scientific understanding of the world was it was driven my
           | gods and spirits. The mathematical laws were only discovered
           | by scientific investigation.
           | 
           | Or as he puts it:
           | 
           | >The miracle of the appropriateness of the language of
           | mathematics for the formulation of the laws of physics is a
           | wonderful gift which we neither understand nor deserve.
           | 
           | If he was just saying use maths it would be boring and not a
           | famous paper 65 years on.
        
             | godelski wrote:
             | > I think you misinterpret what it's about. He's pointing
             | out how remarkable it is that the universe obeys laws
             | like...
             | 
             | I apologize for not being clear. But we are talking about
             | the same thing.                 > The pre scientific
             | understanding of the world was it was driven my gods and
             | spirits.
             | 
             | Winger's paper was written in 1960. I do not think such
             | claims need have been said. Those arguments were prolific
             | and had been made for centuries. He did not need to
             | convince anyone in the scientific community that the laws
             | of nature were not driven by gods and spirits. By the 1960s
             | the scientific age was already mature and it was well
             | established in the community that the laws of nature are
             | not the domain of gospel.                 | "Well, now you
             | are pushing your joke too far," said the classmate, "surely
             | the population has nothing to do with the circumference of
             | the circle."
             | 
             | The point is made here. It is surprising that math
             | describes reality. It is surprising that a circle has
             | anything to do with a population.
             | 
             | I really did mean it when I said "about something that
             | would sound silly today". We take this for granted now,
             | with 60 years of working under this framework, but this
             | wasn't always so. It seems silly now because much of the
             | math we learn is in science classes and even outside we
             | have a particular focus of teaching math most relating to
             | science, but this is a small portion of a much larger
             | field. Fwiw, I am not saying this as a complete outsider, I
             | have a degree in physics.
             | 
             | It is also worth paying attention to the fact that Wigner
             | helped create Mathematical Physics[0]. "Mathematical
             | Physics" is not a pleonasm.
             | 
             | Don't take it just on my word! The Wiki page says something
             | extremely similar!                 | In it, Wigner observes
             | that a theoretical physics's mathematical structure often
             | points the way to further advances in that theory and to
             | empirical predictions. Mathematical theories often have
             | predictive power in describing nature. [1]            |
             | Wigner argues that mathematical concepts have applicability
             | far beyond the context in which they were originally
             | developed[1]            > The mathematical laws were only
             | discovered by scientific investigation.
             | 
             | I should make sure this is clear though (unsure which
             | interpretation you intend). Math and science aren't
             | interchangeable. Physics uses the language of math as its
             | main method for developing theories and logic. But it is
             | also important to stress that it doesn't use the same
             | mathematical language throughout. The frameworks that those
             | working in relativity use are not useful for those that
             | work in quantum mechanics. If the math was uniform, we
             | would not be dedicating so much time to bridge these. Nor
             | is math absolute here, as it is a map, and we still rely
             | heavily on the language of experimental evidence.
             | 
             | Yes, he was saying "use maths". Yes, it sounds silly today,
             | but so do a lot of things that needed be said in the past.
             | I see no reason that the (now) obvious claim by Copernicus
             | would make him any less famous.
             | 
             | [0] https://en.wikipedia.org/wiki/Mathematical_physics
             | 
             | [1] https://en.wikipedia.org/wiki/The_Unreasonable_Effectiv
             | eness...
        
               | tim333 wrote:
               | I think we agree on all the facts but I disagree on what
               | his message was and whether it sounds silly these days,
               | which is a matter of opinion I guess.
               | 
               | You have "use maths" which sounds silly, I take it as
               | "the appropriateness of maths is a miracle we don't
               | understand" which is deeper and still largely true.
        
               | godelski wrote:
               | To change your mind, what would I need to demonstrate?
               | I'm providing third party sources, I have personal
               | experience, I can quote from the original source. What is
               | missing that results in being unconvincing? I want to
               | make sure, there are ways to change your opinion, right?
               | 
               | I really encourage you to read that wiki page.
               | | The quantum theory of the Lamb shift, as conceived by
               | Bethe and established by Schwinger, is a purely
               | mathematical theory and the only direct contribution of
               | experiment was to show the existence of a measurable
               | effect. The agreement with calculation is better than one
               | part in a thousand."
               | 
               | I think you're missing a lot of context in that physics
               | was highly non-mathematical in the past. Physicists
               | called Einstein a mathematician. It isn't too hard to see
               | when he asserted that his theories were correct and
               | didn't need experimental confirmation.                 |
               | Hamming argues that Albert Einstein's pioneering work on
               | special relativity was largely "scholastic" in its
               | approach. He knew from the outset what the theory should
               | look like (although he only knew this because of the
               | Michelson-Morley experiment), and explored candidate
               | theories with mathematical tools, not actual experiments.
               | Hamming alleges that Einstein was so confident that his
               | relativity theories were correct that the outcomes of
               | observations designed to test them did not much interest
               | him. If the observations were inconsistent with his
               | theories, it would be the observations that were at
               | fault.
               | 
               | Hell, go read Ian Hacking, any metaphysics, or ask
               | ChatGPT. They will confirm what I'm saying. Even some of
               | this is discussed in An Opinionated History of
               | Mathematics[0], though much more focused on math. I'm
               | more mentioning it because it is good and helps provide
               | some of that historical context.
               | 
               | It is kinda crazy that a thing we created, without the
               | specific intent of modeling the world, ended up being so
               | great at modeling the world. That's the unreasonable
               | effectiveness.
               | 
               | In fairness, to change my opinion, you would need to show
               | me some chain of reasoning or a conversation Wigner is
               | clearly responding to that involves religion. Because
               | this is what I see, but around math not being physics,
               | and is what drives my interpretation.
               | 
               | [0] https://intellectualmathematics.com/opinionated-
               | history-of-m...
        
         | gipp wrote:
         | Engineering blogger's love of parroting the titles of famous
         | papers/articles (unreasonable effectiveness..., all you need
         | is..., ... Considered harmful, etc) has always been lightly
         | annoying to me
        
           | airza wrote:
           | It's just not that common for the same person to have serious
           | engineering chops and writing abilities.
        
           | jorvi wrote:
           | With software engineering, every single thing in the 2010s
           | had "syntactic sugar" and "sane defaults". I still get a
           | slight blood pressure spike whenever someone uses either of
           | those terms.
        
             | joe_the_user wrote:
             | I guess those are overused but at least they have some
             | meaning. "Unreasonable Effectiveness..." is essentially
             | pure meaninglessness.
        
               | tim333 wrote:
               | It was meaningful in the original paper.
        
             | ruuda wrote:
             | "modern"
        
             | nine_k wrote:
             | Fine. "The unreasonable effectiveness of syntactic sugar
             | considered harmful: all you need is sane defaults."
             | 
             | Now, in comparison, nothing in this thread is going to
             | annoy you!
        
           | layer8 wrote:
           | All you need is for the unreasonable effectiveness of
           | snowclones to be considered harmful.
        
         | EGreg wrote:
         | Would you go further, and say that Unreasonable
         | Effectiveness... is considered harmful?
        
           | kevindamm wrote:
           | Indeed, considering unreasonable effectiveness harmful is all
           | you need
        
             | lareter77 wrote:
             | Thank you, I have had a few beers but that really made me
             | laugh good.
        
         | dkga wrote:
         | TIL. I am not from an engineering/physics background so for me
         | the original Unreasonable Effectiveness paper was Karpathy's
         | blog post about RNNs.
        
           | godelski wrote:
           | (Karpathy's might be more a call back to Halevy, Norvig, and
           | Pereira's "The Unreasonable Effectiveness of Data"[0].)
           | 
           | But I think is a good example that fits the OP's critique (I
           | don't think the critique fits to the arXiv paper. Even though
           | I expected the main results, see my main comment).
           | 
           | The "unreasonableness" in Karpathy's post[1] is using
           | sequencing to process non-sequential data. But the reason
           | this isn't unreasonable is that we explicitly expect non-
           | sequential processes to be able to be reformulated as
           | sequential ones.
           | 
           | The SVHN (hose numbers) he shows is actually a great example
           | of this. We humans don't process that all at once. Our eyes
           | similarly dart around, even if very fast. Or we might think
           | about how to draw a picture. We don't do everything at once,
           | but we work in sections, building up, and have layers that
           | end up being ordered even though this technically isn't a
           | requirement. I'm actually struggling to think of things that
           | cannot be broken down into sequences. He says as much here
           | | an important point to realize is that even if your
           | inputs/outputs are fixed vectors, it is still possible to use
           | this powerful formalism to process them in a sequential
           | manner.
           | 
           | So really the question is: what part of this was
           | unreasonable? Or what part was unexpected? Honestly, we
           | should be expecting this as the nature of neural nets is
           | itself sequential, data being processed layer by layer. Hell,
           | every computer program has a trace, which is sequential. I
           | can give tons of examples. So it is quite reasonable that
           | sequential processing should work.
           | 
           | [0] https://static.googleusercontent.com/media/research.googl
           | e.c...
           | 
           | [1] https://karpathy.github.io/2015/05/21/rnn-effectiveness/
        
         | mensetmanusman wrote:
         | All you need is unreasonable effectiveness.
        
       | theptip wrote:
       | So is the interpretation here something like "CoT tokens are
       | actually neuraleese"? They do boost performance, so the model
       | must be stashing some intermediate reasoning outputs there. But
       | perhaps not using the literal human meaning of those tokens?
        
         | x_flynn wrote:
         | Exactly, the traces lack semantics and shouldn't be
         | anthropomorphized. (I'm one of the students in the lab that
         | wrote this, but not one of the authors)
        
           | theptip wrote:
           | Thanks! So, how does this impact Deliberative Alignment[1],
           | where IIUC the intermediate tokens are assessed (eg for
           | referencing the appropriate policy fragment)?
           | 
           | Does you see your result as putting that paradigm in
           | question, or does the explicit reasoning assessment perhaps
           | ameliorate the issue?
           | 
           | [1]: https://arxiv.org/html/2412.16339v2
        
       | meltyness wrote:
       | Brought to you by lightspeed briefs
        
       | r0ze-at-hn wrote:
       | Very much related to this is "Chain-of-draft"
       | 
       | https://arxiv.org/abs/2502.18600
       | 
       | Similar level of results in a fraction of the tokens resulting in
       | similar quality for less cost for longer runs.
       | 
       | But also when interacting and needing to read the token responses
       | I can read shorter responses way faster so my own speed is
       | faster.
        
       ___________________________________________________________________
       (page generated 2025-05-24 23:02 UTC)