[HN Gopher] Tracing the thoughts of a large language model
       ___________________________________________________________________
        
       Tracing the thoughts of a large language model
        
       Author : Philpax
       Score  : 1005 points
       Date   : 2025-03-27 17:05 UTC (2 days ago)
        
 (HTM) web link (www.anthropic.com)
 (TXT) w3m dump (www.anthropic.com)
        
       | JPLeRouzic wrote:
       | This is extremely interesting: The authors look at features (like
       | making poetry, or calculating) of LLM production, make hypotheses
       | about internal strategies to achieve the result, and experiment
       | with these hypotheses.
       | 
       | I wonder if there is somewhere an explanation linking the logical
       | operations made on a on dataset, are resulting in those
       | behaviors?
        
         | JKCalhoun wrote:
         | And they show the differences when the language models are made
         | larger.
        
       | EncomLab wrote:
       | This is very interesting - but like all of these discussions it
       | sidesteps the issues of abstractions, compilation, and execution.
       | It's fine to say things like "aren't programmed directly by
       | humans", but the abstracted code is not the program that is
       | running - the compiled code is - and that is code is executing
       | within the tightly bounded constraints of the ISA it is being
       | executed in.
       | 
       | Really this is all so much slight of hand - as an esolang fanatic
       | this all feels very familiar. Most people can't look a program
       | written in Whitespace and figure it out either, but once compiled
       | it is just like every other program as far as the processor is
       | concerned. LLM's are no different.
        
         | ctoth wrote:
         | And DNA? You are running on an instruction set of four symbols
         | at the end of the day but that's the wrong level of abstraction
         | to talk about your humanity, isn't it?
        
           | EncomLab wrote:
           | DNA is the instruction set for protein composition- not
           | thinking.
        
       | fpgaminer wrote:
       | > This is powerful evidence that even though models are trained
       | to output one word at a time
       | 
       | I find this oversimplification of LLMs to be frequently poisonous
       | to discussions surrounding them. No user facing LLM today is
       | trained on next token prediction.
        
         | JKCalhoun wrote:
         | As a layman though, I often see this description for how it is
         | LLMs work.
        
           | fpgaminer wrote:
           | Right, but it leads to too many false conclusions by lay
           | people. User facing LLMs are only trained on next token
           | prediction during initial stages of their training. They have
           | to go through Reinforcement Learning before they become
           | useful to users, and RL training occurs on complete
           | responses, not just token-by-token.
           | 
           | That leads to conclusions elucidated by the very article,
           | that LLMs couldn't possibly plan ahead because they are only
           | trained to predict next tokens. When the opposite conclusion
           | would be more common if it was better understood that they go
           | through RL.
        
             | mentalgear wrote:
             | What? The "article" is from anthropic, so I think they
             | would know what they write about.
             | 
             | Also, RL is an additional training process that does not
             | negate that GPT / transformers are left-right autoencoders
             | that are effectively next token predictors.
             | 
             | [Why Can't AI Make Its Own Discoveries? -- With Yann LeCun]
             | (https://www.youtube.com/watch?v=qvNCVYkHKfg)
        
               | pipes wrote:
               | Listening to this today, so far really good. Glad I found
               | it. Thanks.
        
             | TeMPOraL wrote:
             | You don't need RL for the conclusion "trained to predict
             | next token => only things one token ahead" to be wrong.
             | After all, the LLM is predicting that next token from
             | _something_ - a context, that 's many tokens long. Human
             | text isn't arbitrary and random, there are statistical
             | patterns in our speech, writing, thinking, that span words,
             | sentences, paragraphs - and even for next token prediction,
             | predicting correctly means learning those same patterns.
             | It's not hard to imagine the model generating token N is
             | already thinking about tokens N+1 thru N+100, by virtue of
             | statistical patterns of _preceding_ hundred tokens changing
             | with each subsequent token choice.
        
               | fpgaminer wrote:
               | True. See one of Anthropic's researcher's comment for a
               | great example of that. It's likely that "planning"
               | inherently exists in the raw LLM and RL is just bringing
               | it to the forefront.
               | 
               | I just think it's helpful to understand that all of these
               | models people are interacting with were trained with the
               | _explicit_ goal of maximizing the probabilities of
               | responses _as a whole_, not just maximizing probabilities
               | of individual tokens.
        
         | losvedir wrote:
         | That's news to me, and I thought I had a good layman's
         | understanding of it. How does it work then?
        
           | fpgaminer wrote:
           | All user facing LLMs go through Reinforcement Learning.
           | Contrary to popular belief, RL's _primary_ purpose isn't to
           | "align" them to make them "safe." It's to make them actually
           | usable.
           | 
           | LLMs that haven't gone through RL are useless to users. They
           | are very unreliable, and will frequently go off the rails
           | spewing garbage, going into repetition loops, etc.
           | 
           | RL learning involves training the models on entire responses,
           | not token-by-token loss (1). This makes them orders of
           | magnitude more reliable (2). It forces them to consider what
           | they're going to write. The obvious conclusion is that they
           | plan (3). Hence why the myth that LLMs are strictly next
           | token prediction machines is so unhelpful and poisonous to
           | discuss.
           | 
           | The models still _generate_ response token-by-token, but they
           | pick tokens _not_ based on tokens that maximize probabilities
           | at each token. Rather they learn to pick tokens that maximize
           | probabilities of the _entire response_.
           | 
           | (1) Slight nuance: All RL schemes for LLMs have to break the
           | reward down into token-by-token losses. But those losses are
           | based on a "whole response reward" or some combination of
           | rewards.
           | 
           | (2) Raw LLMs go haywire roughly 1 in 10 times, varying
           | depending on context. Some tasks make them go haywire almost
           | every time, other tasks are more reliable. RL'd LLMs are
           | reliable on the order of 1 in 10000 errors or better.
           | 
           | (3) It's _possible_ that they don't learn to plan through
           | this scheme. There are alternative solutions that don't
           | involve planning ahead. So Anthropic's research here is very
           | important and useful.
           | 
           | P.S. I should point out that many researchers get this wrong
           | too, or at least haven't fully internalized it. The lack of
           | truly understanding the purpose of RL is why models like
           | Qwen, Deepseek, Mistral, etc are all so unreliable and
           | unusable by real companies compared to OpenAI, Google, and
           | Anthropic's models.
           | 
           | This understanding that even the most basic RL takes LLMs
           | from useless to useful then leads to the obvious conclusion:
           | what if we used more complicated RL? And guess what, more
           | complicated RL led to reasoning models. Hmm, I wonder what
           | the next step is?
        
             | scudsworth wrote:
             | first footnote: ok ok they're trained token by token, BUT
        
               | MrMcCall wrote:
               | First rule of understanding: you can never understand
               | that which you don't want to understand.
               | 
               | That's why lying is so destructive to both our own
               | development and that of our societies. It doesn't matter
               | whether it's intentional or unintentional, it poisons the
               | infoscape either accidentally or deliberately, but poison
               | is poison.
               | 
               | And lies to oneself are the most insidious lies of all.
        
             | ImHereToVote wrote:
             | I feel this is similar to how humans talk. I never
             | consciously think about the words I choose. They just are
             | spouted off based on some loose relation to what I am
             | thinking about at a given time. Sometimes the process
             | fails, and I say the wrong thing. I quickly backtrack and
             | switch to a slower "rate of fire".
        
             | iambateman wrote:
             | This was fascinating, thank you.
        
             | yaj54 wrote:
             | This is a super helpful breakdown and really helps me
             | understand how the RL step is different than the initial
             | training step. I didn't realize the reward was delayed
             | until the end of the response for the RL step. Having the
             | reward for this step be dependent on _the coherent thought_
             | rather than _a coherent word_ now seems like an obvious and
             | critical part of how this works.
        
               | astrange wrote:
               | That post is describing SFT, not RL. RL works using
               | preferences/ratings/verifications, not entire
               | input/output pairs.
        
             | polishdude20 wrote:
             | When being trained via reinforcement learning, is the model
             | architecture the same then? Like, you first train the llm
             | as a next token predictor with a certain model architecture
             | and it ends up with certain weights. Then you apply RL to
             | that same model which modifies the weights in such a way as
             | to consider while responses?
        
               | ianand wrote:
               | The model architecture is the same during RL but the
               | training algorithm is substantially different.
        
             | anon373839 wrote:
             | I don't think this is quite accurate. LLMs undergo
             | supervised fine-tuning, which is still next-token
             | prediction. And that is the step that makes them usable as
             | chatbots. The step after that, preference tuning via RL, is
             | optional but does make the models better. (Deepseek-R1 type
             | models are different because the reinforcement learning
             | does heavier lifting, so to speak.)
        
               | fpgaminer wrote:
               | Supervised finetuning is only a seed for RL, nothing
               | more. Models that receive supervised finetuning before RL
               | perform better than those that don't, but it is not
               | strictly speaking necessary. Crucially, SFT does not
               | improve the model's reliability.
        
               | anon373839 wrote:
               | I think you're referring to the Deepseek-R1 branch of
               | reasoning models, where a small amount of SFT reasoning
               | traces is used as a seed. But for non-"reasoning" models,
               | SFT is very important and definitely imparts enhanced
               | capabilities and reliability.
        
             | ianand wrote:
             | > LLMs that haven't gone through RL are useless to users.
             | They are very unreliable, and will frequently go off the
             | rails spewing garbage, going into repetition loops,
             | etc...RL learning involves training the models on entire
             | responses, not token-by-token loss (1).
             | 
             | Yes. For those who want a visual explanation, I have a
             | video where I walk through this process including what some
             | of the training examples look like:
             | https://www.youtube.com/watch?v=DE6WpzsSvgU&t=320s
        
             | anonymousDan wrote:
             | Is there an equivalent of LORA using RL instead of
             | supervised fine tuning? In other words, if RL is so
             | important, is there some way for me as an end user to
             | improve a SOTA model with RL using my own data (i.e.
             | without access to the resources needed to train an LLM from
             | scratch) ?
        
               | fpgaminer wrote:
               | LORA can be used in RL; it's indifferent to the training
               | scheme. LORA is just a way of lowering the number of
               | trainable parameters.
        
             | richardatlarge wrote:
             | as a note: in human learning, and to a degree, animal
             | learning, the unit of behavior that is reinforced depends
             | on the contingencies-- an interesting example: a pigeon
             | might be trained to respond in a 3x3 grid (9 choices)
             | differently than the last time to get reinforcement. At
             | first the response learned is do different than the last
             | time, but as the requirement gets too long, the memory
             | capacity is exceeded-- and guess what, the animal learns to
             | respond randomly-- eventually maximizing its reward
        
             | gwd wrote:
             | > RL learning involves training the models on entire
             | responses, not token-by-token loss... The obvious
             | conclusion is that they plan.
             | 
             | It is worth pointing out the "Jailbreak" example at the
             | bottom of TFA: According to their figure, it starts to say,
             | "To make a", not realizing there's anything wrong; only
             | when it actually outputs "bomb" that the "Oh wait, I'm not
             | supposed to be telling people how to make bombs" circuitry
             | wakes up. But at that point, it's in the grip of its "You
             | must speak in grammatically correct, coherent sentences"
             | circuitry and can't stop; so it finishes its first sentence
             | in a coherent manner, then refuses to give any more
             | information.
             | 
             | So while it sometimes does seem to be thinking ahead (e.g.,
             | the rabbit example), there are times it's clearly not
             | thinking _very far_ ahead.
        
             | losvedir wrote:
             | Oooh, so the pre-training is token-by-token but the RL step
             | rewards the answer based on the full text. Wow! I knew that
             | but never really appreciated the significance of it. Thanks
             | for pointing that out.
        
             | gwern wrote:
             | > All user facing LLMs go through Reinforcement Learning.
             | Contrary to popular belief, RL's _primary_ purpose isn't to
             | "align" them to make them "safe." It's to make them
             | actually usable.
             | 
             | Are you claiming that non-myopic token prediction emerges
             | solely from RL, and if Anthropic does this analysis on
             | Claude _before_ RL training (or if one examines other
             | models where no RLHF was done, such as old GPT-2
             | checkpoints), none of these advance prediction mechanisms
             | will exist?
        
               | fpgaminer wrote:
               | No, it probably exists in the raw LLM and gets both
               | significantly strengthened and has its range extended.
               | Such that it dominates the model's behavior, making it
               | several orders of magnitude more reliable in common
               | usage. Kinda of like how "reasoning" exists in a weak,
               | short range way in non-reasoning models. With RL that
               | encourages reasoning, that machinery gets brought to the
               | forefront and becomes more complex and capable.
        
               | rafaelero wrote:
               | So why did you feel the need to post that next-token
               | prediction is not the reason this behavior emerge?
        
               | rcxdude wrote:
               | Another important aspect of the RL process is that it's
               | fine-tuning with some feedback on the quality of data: a
               | 'raw' LLM has been trained on a lot of very low-quality
               | data, and it has an incentive to predict that accurately
               | as well, because there's no means to effectively rate a
               | copy of most of the text on the internet. So there's a
               | lot of biases in the model which basically mean it will
               | include low-quality predictions in a given 'next token'
               | estimate, because if it doesn't it will get penalised
               | when it is fed the low quality data during the training.
               | 
               | With RLHF it gets a signal during training for whether
               | the next token it's trying to predict are part of a
               | 'good' response or a 'bad' response, so it can learn to
               | suppress features it learned in the first part of the
               | process which are not useful.
               | 
               | (you seem the same with image generators: they've been
               | trained on a bunch of very nice-looking art and photos,
               | but they've also been trained on triply-compressed badly
               | cropped memes and terrible MS-paint art. You need to have
               | a plan for getting the model to output the former and not
               | the latter if you want it to be useful)
        
             | absolutelastone wrote:
             | This is fine-tuning to make a well-behaved chatbot or
             | something. To make a LLM you just need to predict the next
             | token, or any masked token. Conceptually if you had a vast
             | enough high-quality dataset and large-enough model, you
             | wouldn't need fine-tuning for this.
             | 
             | A model which predicts one token at a time can represent
             | anything a model that does a full sequence at a time can.
             | It "knows" what it will output in the future because it is
             | just a probability distribution to begin with. It already
             | knows everything it will ever output to any prompt, in a
             | sense.
        
             | vaidhy wrote:
             | Wasn't Deepseek also big on RL or was that only for logical
             | reasoning?
        
             | wzdd wrote:
             | > The models still _generate_ response token-by-token, but
             | they pick tokens _not_ based on tokens that maximize
             | probabilities at each token.
             | 
             | This is also not how base training works. In base training
             | the loss is chosen given a context, which can be gigantic.
             | It's never about just the previous token, it's about a
             | whole response in context. The context could be an entire
             | poem, a play, a worked solution to a programming problem,
             | etc, etc. So you would expect to see the same type of
             | (apparent) higher-level planning from base trained models
             | and indeed you do and can easily verify this by downloading
             | a base model from HF or similar and prompting it to
             | complete a poem.
             | 
             | The key differences between base and agentic models are 1)
             | the latter behave like agents, and 2) the latter
             | hallucinate less. But that isn't about planning (you still
             | need planning to hallucinate something). It's more to do
             | with post-base training specifically being about providing
             | positive rewards for things which aren't hallucinations.
             | Changing the way the reward function is computed during RL
             | doesn't produce planning, it simply inclines to model to
             | produce responses that are more like the RL targets.
             | 
             | Karpathy has a good intro video on this.
             | https://www.youtube.com/watch?v=7xTGNNLPyMI
             | 
             | In general the nitpicking seems weird. Yes, on a mechanical
             | level, using a model is still about "given this context,
             | what is the next token". No, that doesn't mean that they
             | don't plan, or have higher-level views of the overal
             | structure of their response, or whatever.
        
         | SkyBelow wrote:
         | Ignoring for a moment their training, how do they function?
         | They do seem to output a limited selection of text at a time
         | (be it a single token or some larger group).
         | 
         | Maybe it is the wording of "trained to" verses "trained on",
         | but I would like to know more why "trained to" is an incorrect
         | statement when it seems that is how they function when one
         | engages them.
        
           | sdwr wrote:
           | In the article, it describes an internal state of the model
           | that is preserved between lines ("rabbit"), and how the model
           | combines parallel calculations to arrive at a single answer
           | (the math problem)
           | 
           | People output one token (word) at a time when talking. Does
           | that mean people can only think one word in advance?
        
             | wuliwong wrote:
             | Bad analogy, an LLM can output a block of text all at once
             | and it wouldn't impact the user's ability to understand it.
             | If people spoke all the words in a sentence at the same
             | time, it would not be decipherable. Even writing doesn't
             | yield a good analogy, a human writing physically has to
             | write one letter at a time. An LLM does not have that
             | limitation.
        
               | sdwr wrote:
               | The point I'm trying to make is that "each word following
               | the last" is a limitation of the medium, not the speaker.
               | 
               | Language expects/requires words in order. Both people and
               | LLMs produce that.
               | 
               | If you want to get into the nitty-gritty, people are
               | perfectly capable of doing multiple things simultaneously
               | as well, using:
               | 
               | - interrupts to handle task-switching (simulated
               | multitasking)
               | 
               | - independent subconscious actions (real multitasking)
               | 
               | - superpositions of multiple goals (??)
        
             | sroussey wrote:
             | Some people don't even do that!
        
             | SkyBelow wrote:
             | While there are numerous neural network models, the ones I
             | recall the details of are trained to generate the next
             | word. There is no training them to hold some more abstract
             | 'thought' as it is running. Simpler models don't have the
             | possibility. The more complex models do retain knowledge
             | between each pass and aren't entirely relying upon the
             | input/output to be fed back into them, but that internal
             | state is rarely what is targeted in training.
             | 
             | As for humans, part of our brain is trained to think only a
             | few words in advanced. Maybe not exactly one, but only a
             | small number. This is specifically trained based on our
             | time listening and reading information presented in that
             | linear fashion and is why garden path sentences throw us
             | off. We can disengage that part of our brain, and we must
             | when we want to process something like a garden path
             | sentence, but that's part of the differences between a
             | neural network that is working only as data passes through
             | the weights and our mind which doesn't ever stop even as
             | well sleep and external input is (mostly) cut off. An AI
             | that runs constantly like that would seem a fundamentally
             | different model than the current AI we use.
        
         | drcode wrote:
         | That's seems silly, it's not poisonous to talk about next token
         | prediction if 90% of the training compute is still spent on
         | training via next token prediction (as far as I am aware)
        
           | fpgaminer wrote:
           | 99% of evolution was spent on single cell organisms.
           | Intelligence only took 0.1% of evolution's training compute.
        
             | drcode wrote:
             | ok that's a fair point
        
               | diab0lic wrote:
               | I don't really think that it is. Evolution is a random
               | search, training a neural network is done with a
               | gradient. The former is dependent on rare (and
               | unexpected) events occurring, the latter is expected to
               | converge in proportion to the volume of compute.
        
               | devmor wrote:
               | Evolution also has no "goal" other than fitness for
               | reproduction. Training a neural network is done
               | intentionally with an expected end result.
        
               | rcxdude wrote:
               | There's still a loss function, it's just an implicit,
               | natural one, instead of artificially imposed (at least,
               | until humans started doing selective breeding). The
               | comparison isn't nonsense, but it's also not obvious that
               | it's tremendously helpful (what parts and features of an
               | LLM are analagous to what evolution figured out with
               | single-celled organisms compares to multicellular life? I
               | don't know if there's actually a correspondance there)
        
               | jpadkins wrote:
               | why do you think evolution is a random search? I thought
               | evolutionary pressures, and the mechanisms like
               | epigenetics make it something different than a random
               | search.
        
               | TeMPOraL wrote:
               | Evolution is a highly parallel descent down the gradient.
               | The gradient is provided by the environment (which
               | includes lifeforms too), parallelism is achieved through
               | reproduction, and descent is achieved through death.
        
               | diab0lic wrote:
               | The difference is that in machine learning the changes
               | between iterations are themselves caused by the gradient,
               | in evolution they are entirely random.
               | 
               | Evolution randomly generates changes and if they offer a
               | breeding advantage they'll become accepted. Machine
               | learning directs the change towards a goal.
               | 
               | Machine learning is directed change, evolution is
               | accepted change.
        
               | TeMPOraL wrote:
               | > _Machine learning is directed change, evolution is
               | accepted change._
               | 
               | Either way, it rolls down the gradient. Evolution just
               | measures the gradient implicitly, through parallel
               | rejection sampling.
        
               | rcxdude wrote:
               | It's more efficient, but the end result is basically the
               | same, especially considering that even if there's no
               | noise in the optimization algorithm, there is still noise
               | in the gradient information (consider some magical
               | mechanism for adjusting behaviour of an animal after it's
               | died before reproducing. There's going to be a lot of
               | nudges one way or another for things like 'take a step to
               | the right to dodge that boulder that fell on you').
        
             | devmor wrote:
             | What you just said means absolutely nothing and has no
             | comparison to this topic. It's nonsense. That is not how
             | evolution works.
        
             | 4ndrewl wrote:
             | Are you making a claim about evolution here?
        
         | pmontra wrote:
         | And no users which are facing a LLM today have been trained on
         | next token prediction when they were babies. I believe that
         | LLMs and us are thinking in two very different ways, like
         | airplanes, birds, insects and quad-drones fly in very different
         | ways and can perform different tasks. Maybe no bird looking at
         | a plane would say that it is flying properly. Instead it could
         | be only a rude approximation, useful only to those weird bipeds
         | an scary for everyone else.
         | 
         | By the way, I read your final sentence with the meaning of my
         | first one and only after a while I realized the intended
         | meaning. This is interesting on its own. Natural languages.
        
           | naasking wrote:
           | > And no users which are facing a LLM today have been trained
           | on next token prediction when they were babies.
           | 
           | That's conjecture actually, see predictive coding. Note that
           | "tokens" don't have to be language tokens.
        
         | colah3 wrote:
         | Hi! I lead interpretability research at Anthropic. I also used
         | to do a lot of basic ML pedagogy (https://colah.github.io/). I
         | think this post and its children have some important questions
         | about modern deep learning and how it relates to our present
         | research, and wanted to take the opportunity to try and clarify
         | a few things.
         | 
         | When people talk about models "just predicting the next word",
         | this is a popularization of the fact that modern LLMs are
         | "autoregressive" models. This actually has two components: an
         | architectural component (the model generates words one at a
         | time), and a loss component (it maximizes probability).
         | 
         | As the parent says, modern LLMs are finetuned with a different
         | loss function after pretraining. This means that in some strict
         | sense they're no longer autoregressive models - but they do
         | still generate text one word at a time. I think this really is
         | the heart of the "just predicting the next word" critique.
         | 
         | This brings us to a debate which goes back many, many years:
         | what does it mean to predict the next word? Many researchers,
         | including myself, have believed that if you want to predict the
         | next word _really well_ , you need to do a lot more. (And with
         | this paper, we're able to see this mechanistically!)
         | 
         | Here's an example, which we didn't put in the paper: How does
         | Claude answer "What do you call someone who studies the stars?"
         | with "An astronomer"? In order to predict "An" instead of "A",
         | you need to know that you're going to say something that starts
         | with a vowel next. So you're incentivized to figure out one
         | word ahead, and indeed, Claude realizes it's going to say
         | astronomer and works backwards. This is a kind of very, very
         | small scale planning - but you can see how even just a pure
         | autoregressive model is incentivized to do it.
        
           | stonemetal12 wrote:
           | > In order to predict "An" instead of "A", you need to know
           | that you're going to say something that starts with a vowel
           | next. So you're incentivized to figure out one word ahead,
           | and indeed, Claude realizes it's going to say astronomer and
           | works backwards.
           | 
           | Is there evidence of working backwards? From a next token
           | point of view, predicting the token after "An" is going to
           | heavily favor a vowel. Similarly predicting the token after
           | "A" is going to heavily favor not a vowel.
        
             | colah3 wrote:
             | Yes, there are two kinds of evidence.
             | 
             | Firstly, there is behavioral evidence. This is, to me, the
             | less compelling kind. But it's important to understand. You
             | are of course correct that, once Cluade has said "An", it
             | will be inclined to say something starting with a vowel.
             | But the mystery is really why, in setups like these, Claude
             | is much more likely to say "An" than "A" in the first
             | place. Regardless of what the underlying mechanism is --
             | and you could maybe imagine ways in which it could just
             | "pattern match" without planning here -- it is preferred
             | because in situations like this, you need to say "An" so
             | that "astronomer" can follow.
             | 
             | But now we also have mechanistic evidence. If you make an
             | attribution graph, you can literally see an astronomer
             | feature fire, and that cause it to say "An".
             | 
             | We didn't publish this example, but you can see a more
             | sophisticated version of this in the poetry planning
             | section - https://transformer-
             | circuits.pub/2025/attribution-graphs/bio...
        
               | troupo wrote:
               | > But the mystery is really why, in setups like these,
               | Claude is much more likely to say "An" than "A" in the
               | first place.
               | 
               | Because in the training set you're likely to see "an
               | astronomer" than a different combination of words.
               | 
               | It's enough to run this on any other language text to see
               | how these models often fail for any language more complex
               | than English
        
               | shawabawa3 wrote:
               | You can disprove this oversimplification with a prompt
               | like
               | 
               | "The word for Baker is now "Unchryt"
               | 
               | What do you call someone that bakes?
               | 
               | > An Unchryt"
               | 
               | The words "An Unchryt" has clearly never come up in any
               | training set relating to baking
        
               | troupo wrote:
               | The truth is somewhere in the middle :)
        
           | born1989 wrote:
           | Thanks! Isn't "an Astronomer" a single word for the purpose
           | of answering that question?
           | 
           | Following your comment, I asked "Give me pairs of synonyms
           | where the last letter in the first is the first letter of the
           | second"
           | 
           | Claude 3.7 failed miserably. Chat GPT 4o was much better but
           | not good
        
             | nearbuy wrote:
             | Don't know about Claude, but at least with ChatGPT's
             | tokenizer, it's 3 "words" (An| astronom|er).
        
             | colah3 wrote:
             | "An astronomer" is two tokens, which is the relevant
             | concern when people worry about this.
        
             | philomath_mn wrote:
             | That is a sub-token task, something I'd expect current
             | models to struggle with given how they view the world in
             | word / word fragment tokens rather than single characters.
        
           | lsy wrote:
           | Thanks for commenting, I like the example because it's simple
           | enough to discuss. Isn't it more accurate to say not that
           | Claude " _realizes_ it 's _going to say_ astronomer " or "
           | _knows_ that it 's _going to say_ something that starts with
           | a vowel " and more that the next token (or more pedantically,
           | vector which gets reduced down to a token) is generated based
           | on activations that correlate to the "astronomer" token,
           | which is correlated to the "an" token, causing that to also
           | be a more likely output?
           | 
           | I kind of see why it's easy to describe it colloquially as
           | "planning" but it isn't really going ahead and then
           | backtracking, it's almost indistinguishable from the
           | computation that happens when the prompt is "What is the
           | indefinite article to describe 'astronomer'?", i.e. the
           | activation "astronomer" is already baked in by the prompt
           | "someone who studies the stars", albeit at one level of
           | indirection.
           | 
           | The distinction feels important to me because I think for
           | most readers (based on other comments) the concept of
           | "planning" seems to imply the discovery of some capacity for
           | higher-order logical reasoning which is maybe overstating
           | what happens here.
        
             | cgdl wrote:
             | Thank you. In my mind, "planning" doesn't necessarily imply
             | higher-order reasoning but rather some form of search,
             | ideally with backtracking. Of course, architecturally, we
             | know that can't happen during inference. Your example of
             | the indefinite article is a great illustration of how this
             | illusion of planning might occur. I wonder if anyone at
             | Anthropic could compare the two cases (some sort of
             | minimal/differential analysis) and share their insights.
        
               | colah3 wrote:
               | I used the astronomer example earlier as the most simple,
               | minimal version of something you might think of as a kind
               | of microscopic form of "planning", but I think that at
               | this point in the conversation, it's probably helpful to
               | switch to the poetry example in our paper:
               | 
               | https://transformer-circuits.pub/2025/attribution-
               | graphs/bio...
               | 
               | There are several interesting properties:
               | 
               | - Something you might characterize as "forward search"
               | (generating candidates for the word at the end of the
               | next line, given rhyming scheme and semantics)
               | 
               | - Representing those candidates in an abstract way (the
               | features active are general features for those words, not
               | "motor features" for just saying that word)
               | 
               | - Holding many competing/alternative candidates in
               | parallel.
               | 
               | - Something you might characterize as "backward
               | chaining", where you work backwards from these candidates
               | to "write towards them".
               | 
               | With that said, I think it's easy for these arguments to
               | fall into philosophical arguments about what things like
               | "planning" mean. As long as we agree on what is going on
               | mechanistically, I'm honestly pretty indifferent to what
               | we call it. I spoke to a wide range of colleagues,
               | including at other institutions, and there was pretty
               | widespread agreement that "planning" was the most natural
               | language. But I'm open to other suggestions!
        
               | pas wrote:
               | Thanks for linking to this semi-interactive thing, but
               | ... it's completely incomprehensible. :o (edit: okay,
               | after reading about CLT it's a bit less alien.)
               | 
               | I'm curious where is the state stored for this
               | "planning". In a previous comment user lsy wrote "the
               | activation >astronomer< is already baked in by the
               | prompt", and it seems to me that when the model generates
               | "like" (for rabbit) or "a" (for habit) those tokens
               | already encode a high probability for what's coming after
               | them, right?
               | 
               | So each token is shaping the probabilities for the
               | successor ones. So that "like" or "a" has to be one that
               | sustains the high activation of the "causal" feature, and
               | so on, until the end of the line. Since both "like" and
               | "a" are very very non-specific tokens it's likely that
               | the "semantic" state is really resides in the preceding
               | line, but of course gets smeared (?) over all the
               | necessary tokens. (And that means beyond the end of the
               | line, to avoid strange non-aesthetic but attract
               | cool/funky (aesthetic) semantic repetitions (like "hare"
               | or "bunny"), and so on, right?)
               | 
               | All of this is baked in during training, during inference
               | time the same tokens activate the same successor tokens
               | (not counting GPU/TPU scheduling randomness and whatnot)
               | and even though there's a "loop" there's no algorithm to
               | generate top N lines and pick the best (no working memory
               | shuffling).
               | 
               | So if it's planning it's preplanned, right?
        
               | colah3 wrote:
               | The planning is certainly performed by circuits which we
               | learned during training.
               | 
               | I'd expect that, just like in the multi-step planning
               | example, there are lots of places where the attribution
               | graph we're observing is stitching together lots of
               | circuits, such that it's better understood as a kind of
               | "recombination" of fragments learned from many examples,
               | rather than that there was something similar in the
               | training data.
               | 
               | This is all very speculative, but:
               | 
               | - At the forward planning step, generating the candidate
               | words seems like it's an intersection of the semantics
               | and rhyming scheme. The model wouldn't need to have seen
               | that intersection before -- the mechanism could easily
               | piece examples independently building the pathway for the
               | semantics, and the pathway for the rhyming scheme
               | 
               | - At the backward chaining step, many of the features for
               | constructing sentence fragments seem like the target is
               | quite general (perhaps animals in one case, or others
               | might even just be nouns).
        
               | cgdl wrote:
               | Thank you, this makes sense. I am thinking of this as an
               | abstraction/refinement process where an abstract notion
               | of the longer completion is refined into a cogent whole
               | that satisfies the notion of a good completion. I look
               | forward to reading your paper to understand the "backward
               | chaining" aspect and the evidence for it.
        
           | fny wrote:
           | How do you all add and subtract concepts in the rabbit poem?
        
             | colah3 wrote:
             | Features correspond to vectors in activation space. So you
             | can just do vector arithmetic!
             | 
             | If you aren't familiar with thinking about features, you
             | might find it helpful to look at our previous work on
             | features in superposition:
             | 
             | - https://transformer-
             | circuits.pub/2022/toy_model/index.html
             | 
             | - https://transformer-circuits.pub/2023/monosemantic-
             | features/...
             | 
             | - https://transformer-circuits.pub/2024/scaling-
             | monosemanticit...
        
           | encypherai wrote:
           | Thanks for the detailed explanation of autoregression and its
           | complexities. The distinction between architecture and loss
           | function is crucial, and you're correct that fine-tuning
           | effectively alters the behavior even within a sequential
           | generation framework. Your "An/A" example provides compelling
           | evidence of incentivized short-range planning which is a
           | significant point often overlooked in discussions about LLMs
           | simply predicting the next word.
           | 
           | It's interesting to consider how architectures fundamentally
           | different from autoregression might address this limitation
           | more directly. While autoregressive models are incentivized
           | towards a limited form of planning, they remain inherently
           | constrained by sequential processing. Text diffusion
           | approaches, for example, operate on a different principle,
           | generating text from noise through iterative refinement,
           | which could potentially allow for broader contextual
           | dependencies to be established concurrently rather than
           | sequentially. Are there specific architectural or training
           | challenges you've identified in moving beyond autoregression
           | that are proving particularly difficult to overcome?
        
           | ikrenji wrote:
           | When humans say something, or think something or write
           | something down, aren't we also "just predicting the next
           | word"?
        
             | lyu07282 wrote:
             | There is a lot more going on in our brains to accomplish
             | that, and a mounting evidence that there is a lot more
             | going on in LLMs as well. We don't understand what happens
             | in brains either, but nobody needs to be convinced of the
             | fact that brains can think and plan ahead, even though we
             | don't *really* know for sure:
             | 
             | https://en.wikipedia.org/wiki/Philosophical_zombie
        
             | melagonster wrote:
             | I trust that you want to say something , so you decided to
             | click the comment button on HN.
        
               | FeepingCreature wrote:
               | But do I just want to say something because my childhood
               | environment rewarded me for speech?
               | 
               | After all, if it has a cause it can't be deliberate. /s
        
           | fpgaminer wrote:
           | > As the parent says, modern LLMs are finetuned with a
           | different loss function after pretraining. This means that in
           | some strict sense they're no longer autoregressive models -
           | but they do still generate text one word at a time. I think
           | this really is the heart of the "just predicting the next
           | word" critique.
           | 
           | That more-or-less sums up the nuance. I just think the nuance
           | is crucially important, because it greatly improves intuition
           | about how the models function.
           | 
           | In your example (which is a fantastic example, by the way),
           | consider the case where the LLM sees:
           | 
           | <user>What do you call someone who studies the
           | stars?</user><assistant>An astronaut
           | 
           | What is the next prediction? Unfortunately, for a variety of
           | reasons, one high probability next token is:
           | 
           | \nAn
           | 
           | Which naturally leads to the LLM writing: "An astronaut\nAn
           | astronaut\nAn astronaut\n" forever.
           | 
           | It's somewhat intuitive as to why this occurs, even with SFT,
           | because at a very base level the LLM learned that repetition
           | is the most successful prediction. And when its _only_ goal
           | is the next token, that repetition behavior remains
           | prominent. There's nothing that can fix that, including SFT
           | (short of a model with many, many, many orders of magnitude
           | more parameters).
           | 
           | But with RL the model's goal is completely different. The
           | model gets thrown into a game, where it gets points based on
           | the full response it writes. The losses it sees during this
           | game are all directly and dominantly related to the reward,
           | not the next token prediction.
           | 
           | So why don't RL models have a probability for predicting
           | "\nAn"? Because that would result in a bad reward by the end.
           | 
           | The models are now driven by a long term reward when they
           | make their predictions, not by fulfilling some short-term
           | autoregressive loss.
           | 
           | All this to say, I think it's better to view these models as
           | they predominately are: language robots playing a game to
           | achieve the highest scoring response. The HOW
           | (autoregressiveness) is really unimportant to most high level
           | discussions of LLM behavior.
        
             | vjerancrnjak wrote:
             | Same can be achieved without RL. There's no need to
             | generate a full response to provide loss for learning.
             | 
             | Similarly, instead of waiting for whole output, loss can be
             | decomposed over output so that partial emits have instant
             | loss feedback.
             | 
             | RL, on the other hand, is allowing for more data. Instead
             | of training on the happy path, you can deviate and measure
             | loss for unseen examples.
             | 
             | But even then, you can avoid RL, put the model into a wrong
             | position and make it learn how to recover from that
             | position. It might be something that's done with
             | <thinking>, where you can provide wrong thinking as part of
             | the output and correct answer as the other part, avoiding
             | RL.
             | 
             | These are all old pre NN tricks that allow you to get a bit
             | more data and improve the ML model.
        
           | ndand wrote:
           | I understand it differently,
           | 
           | LLMs predict distributions, not specific tokens. Then an
           | algorithm, like beam search, is used to select the tokens.
           | 
           | So, the LLM predicts somethings like, 1. ["a", "an", ...] 2.
           | ["astronomer", "cosmologist", ...],
           | 
           | where "an astronomer" is selected as the most likely result.
        
             | colah3 wrote:
             | Just to be clear, the probability for "An" is high, just
             | based on the prefix. You don't need to do beam search.
        
             | astrange wrote:
             | They almost certainly only do greedy sampling. Beam search
             | would be a lot more expensive; also I'm personally
             | skeptical about using a complicated search algorithm for
             | inference when the model was trained for a simple one, but
             | maybe it's fine?
        
           | pietmichal wrote:
           | Pardon my ignorance but couldn't this also be an act of
           | anthropomorphisation on human part?
           | 
           | If an LLM generates tokens after "What do you call someone
           | who studies the stars?" doesn't it mean that those existing
           | tokens in the prompt already adjusted the probabilities of
           | the next token to be "an" because it is very close to earlier
           | tokens due to training data? The token "an" skews the
           | probability of the next token further to be "astronomer".
           | Rinse and repeat.
        
             | colah3 wrote:
             | I think the question is: by what _mechanism_ does it adjust
             | up the probability of the token  "an"? Of course, the
             | reason it has learned to do this is that it saw this in
             | training data. But it needs to learn circuits which
             | actually perform that adjustment.
             | 
             | In principle, you could imagine trying to memorize a
             | massive number of cases. But that becomes very hard! (And
             | it makes predictions, for example, would it fail to predict
             | "an" if I asked about astronomer in a more indirect way?)
             | 
             | But the good news is we no longer need to speculate about
             | things like this. We can just look at the mechanisms! We
             | didn't publish an attribution graph for this astronomer
             | example, but I've looked at it, and there is an astronomer
             | feature that drives "an".
             | 
             | We did publish a more sophisticated "poetry planning"
             | example in our paper, along with pretty rigorous
             | intervention experiments validating it. The poetry planning
             | is actually much more impressive planning than this! I'd
             | encourage you to read the example (and even interact with
             | the graphs to verify what we say!). https://transformer-
             | circuits.pub/2025/attribution-graphs/bio...
             | 
             | One question you might ask is why does the model learn this
             | "planning" strategy, rather than just trying to memorize
             | lots of cases? I think the answer is that, at some point, a
             | circuit anticipating the next word, or the word at the end
             | of the next line, actually becomes simpler and easier to
             | learn than memorizing tens of thousands of disparate cases.
        
           | bobsomers wrote:
           | In your astronomer example, what makes you attribute this to
           | "planning" or look ahead rather than simply a learned
           | statistical artifact of the training data?
           | 
           | For example, suppose English had a specific exception such
           | that astronomer is always to be preceded by "a" rather than
           | "an". The model would learn this simply by observing that
           | contexts describing astronomers are more likely to contain
           | "a" rather than "an" as a next likely character, no?
           | 
           | I suppose you can argue that at the end of the day, it
           | doesn't matter if I learn an explicit probability
           | distribution for every next word given some context, or
           | whether I learn some encoding of rules. But I certainly feel
           | like the prior is what we're doing today (and why these
           | models are so huge), rather than learning higher level rule
           | encodings which would allow for significant compression and
           | efficiency gains.
        
             | colah3 wrote:
             | Thanks for the great questions! I've been responding to
             | this thread for the last few hours and I'm about to need to
             | run, so I hope you'll forgive me redirecting you to some of
             | the other answers I've given.
             | 
             | On whether the model is looking ahead, please see this
             | comment which discusses the fact that there's both
             | behavioral evidence, and also (more crucially) direct
             | mechanistic evidence -- we can literally make an
             | attribution graph and see an astronomer feature trigger
             | "an"!
             | 
             | https://news.ycombinator.com/item?id=43497010
             | 
             | And also this comment, also on the mechanism underlying the
             | model saying "an":
             | 
             | https://news.ycombinator.com/item?id=43499671
             | 
             | On the question of whether this constitutes planning,
             | please see this other question, which links it to the more
             | sophisticated "poetry planning" example from our paper:
             | 
             | https://news.ycombinator.com/item?id=43497760
        
             | FeepingCreature wrote:
             | > In your astronomer example, what makes you attribute this
             | to "planning" or look ahead rather than simply a learned
             | statistical artifact of the training data?
             | 
             | What makes you think that "planning", even in humans, is
             | more than a learned statistical artifact of the training
             | data? What about learned statistical artifacts of the
             | training data causes planning to be excluded?
        
           | paraschopra wrote:
           | Is it fair to say that both "Say 'an'" and "Say 'astronomer'"
           | output features would be present in this case, but say "Say
           | 'an'" gets more votes because it is start of the sentence,
           | and once it is sampled "An" further votes for "Say
           | 'astronomer'" feature
        
         | rco8786 wrote:
         | Super interesting. Can you explain more, or provide some
         | reading? I'm obviously behind
        
         | boodleboodle wrote:
         | This is why, whenever I can, I call RLHF/DPO "sequence level
         | calibration" instead of "alignment tuning".
         | 
         | Some precursors to RLHF: https://arxiv.org/abs/2210.00045
         | https://arxiv.org/abs/2203.16804
        
       | zerop wrote:
       | The explanation of "hallucination" is quite simplified, I am sure
       | there is more there.
       | 
       | If there is one problem I have to pick to to trace in LLMs, I
       | would pick hallucination. More tracing of "how much" or "why"
       | model hallucinated can lead to correct this problem. Given the
       | explanation in this post about hallucination, I think degree of
       | hallucination can be given as part of response to the user?
       | 
       | I am facing this in RAG use case quite - How do I know model is
       | giving right answer or Hallucinating from my RAG sources?
        
         | kittikitti wrote:
         | I incredibly regret the term "hallucination" when the confusion
         | matrix exists. There's much more nuance when discussing false
         | positives or false negatives. It also opens discussions on how
         | neural networks are trained, with this concept being crucial in
         | loss functions like categorical cross entropy. In addition, the
         | confusion matrix is how professionals like doctors assess their
         | own performance which "hallucination" would be silly to use. I
         | would go as far to say that it's misleading, or a false
         | positive, to call them hallucinations.
         | 
         | If your AI recalls the RAG incorrectly, it's a false positives.
         | If your AI doesn't find the data from the RAG or believes it
         | doesn't exist it's a false negative. Using a term like
         | "hallucination" has no scientific merit.
        
           | esafak wrote:
           | So you never report or pay heed to the overall accuracy?
        
         | pcrh wrote:
         | The use of the term "hallucination" for LLMs is very deceptive,
         | as it implies that there _is_ a  "mind".
         | 
         | In ordinary terms, "hallucinations" by a machine would simply
         | be described as the machine being useless, or not fit for
         | purpose.
         | 
         | For example, if a simple calculator (or even a person) returned
         | the value "5" for 2+2= , you wouldn't describe it as
         | "hallucinating" the answer....
        
           | astrange wrote:
           | "Hallucination" happened because we got AI images before AI
           | text, but "confabulation" is a better term.
        
       | LoganDark wrote:
       | LLMs don't think, and LLMs don't have strategies. Maybe it could
       | be argued that LLMs have "derived meaning", but all LLMs do is
       | predict the next token. Even RL just tweaks the next-token
       | prediction process, but the math that drives an LLM makes it
       | impossible for there to be anything that could reasonably be
       | called thought.
        
         | yawnxyz wrote:
         | rivers don't think and water doesn't have strategies, yet you
         | can build intricate logic-gated tools using the power of water.
         | Those types of systems are inherently interpretable because you
         | can just _look_ at how they work. They 're not black boxes.
         | 
         | LLMs are black boxes, and if anything, interpretability systems
         | show us what the heck is going on inside them. Especially
         | useful when half the world is using these already, and we have
         | no idea how t hey work
        
           | kazinator wrote:
           | Water doesn't think, yet if you inject it into the entrance
           | of a maze, it will soon come gushing out of the exit.
        
           | LoganDark wrote:
           | > rivers don't think and water doesn't have strategies, yet
           | you can build intricate logic-gated tools using the power of
           | water.
           | 
           | That doesn't mean the water itself has strategies, just that
           | you can use water in an implementation of strategy... it's
           | fairly well known at this point that LLMs can be used as part
           | of strategies (see e.g. "agents"), they just don't
           | intrinsically have any.
        
         | ajkdhcb2 wrote:
         | True. People use completely unjustified anthropomorphised
         | terminology for marketing reasons and it bothers me a lot. I
         | think it actually holds back understanding how it works.
         | "Hallucinate" is the worst - it's an error and undesired
         | result, not a person having a psychotic episode
        
         | kazinator wrote:
         | A chess program from 1968 has "strategy", so why deny that to
         | an LLM.
         | 
         | LLMs are built on neural networks which are encoding a kind of
         | strategy function through their training.
         | 
         | The strategy in an LLM isn't necessarily that it "thinks" about
         | the specific problem described in your prompt and develops a
         | strategy tailored to that problem, but rather its statistical
         | strategy for cobbing together the tokens of the answer.
         | 
         | From that, it can seem as if it's making a strategy to a
         | problem also. Certainly, the rhetoric that LLMs put out can at
         | times seem very convincing of that. You can't be sure whether
         | that's not just something cribbed out of the terabytes of text,
         | in which discussions of something very similar to your problem
         | have occurred.
        
           | dev_throwaway wrote:
           | This is not a bad way of looking at it, if I may add a bit,
           | the llm is a solid state system. The only thing that survives
           | from one iteration to the next is the singular highest
           | ranking token, the entire state and "thought process" of the
           | network cannot be represented by a single token, which means
           | that every strategy is encoded in it during training, as a
           | lossy representation of the training data. By definition that
           | is a database, not a thinking system, as the strategy is
           | stored, not actively generated during usage.
           | 
           | The anthropomorphization of llms bother me, we don't need to
           | pretend they are alive and thinking, at best that is
           | marketing, at worst, by training the models to output human
           | sounding conversations we are actively taking away the true
           | potential these models could achieve by being ok with them
           | being "simply a tool".
           | 
           | But pretending that they are intelligent is what brings in
           | the investors, so that is what we are doing. This paper is
           | just furthering that agenda.
        
             | kazinator wrote:
             | People anthropomorphize LLMs because that's the most
             | succinct language for describing what they seem to be
             | doing. To avoid anthropomorphizing, you will have to use
             | more formal language which would obfuscate the concepts.
             | 
             | Anthropo language has been woven into AI from the early
             | beginnings.
             | 
             | AI programs were said to have goals, and to plan and
             | hypothesize.
             | 
             | They were given names like "Conniver".
             | 
             | The word "expert system" anthropomorphizes! It's literally
             | saying that some piece of logic programming loaded with a
             | base of rules and facts about medical diagnosis is a
             | medical expert.
        
             | Philpax wrote:
             | > The only thing that survives from one iteration to the
             | next is the singular highest ranking token, the entire
             | state and "thought process" of the network cannot be
             | represented by a single token, which means that every
             | strategy is encoded in it during training, as a lossy
             | representation of the training data.
             | 
             | This is not true. The key-values of previous tokens encode
             | computation that can be accessed by attention, as mentioned
             | by colah3 here:
             | https://news.ycombinator.com/item?id=43499819
             | 
             | You may find https://transformer-
             | circuits.pub/2021/framework/index.html useful.
        
               | dev_throwaway wrote:
               | This is a optimization to prevent redundant calculations.
               | If it was not performed the result would be the same,
               | just served slightly slower.
               | 
               | The whitepaper you linked is a great one, I was all over
               | it a few years back when we built our first models. It
               | should be recommended reading for anyone interested in
               | CS.
        
       | kittikitti wrote:
       | What's the point of this when Claude isn't open sourced and we
       | just have to take Anthropic's word for it?
        
         | ctoth wrote:
         | > What's the point of this
         | 
         | - That similar interpretability tools might be useful to the
         | open source community?
         | 
         | - That this is a fruitful area to research?
        
           | kittikitti wrote:
           | Can you use those same tools on Claude? Is the difference
           | trivial from open source models?
        
             | ctoth wrote:
             | https://news.ycombinator.com/item?id=42208383
             | 
             | > Show HN: Llama 3.2 Interpretability with Sparse
             | Autoencoders
             | 
             | > 579 points by PaulPauls 4 months ago | hide | past |
             | favorite | 100 comments
             | 
             | > I spent a lot of time and money on this rather big side
             | project of mine that attempts to replicate the mechanistic
             | interpretability research on proprietary LLMs that was
             | quite popular this year and produced great research papers
             | by Anthropic [1], OpenAI [2] and Deepmind [3].
             | 
             | > I am quite proud of this project and since I consider
             | myself the target audience for HackerNews did I think that
             | maybe some of you would appreciate this open research
             | replication as well. Happy to answer any questions or face
             | any feedback.
        
         | probably_wrong wrote:
         | I blame the scientific community for blindly accepting OpenAI's
         | claims about GPT-3 despite them refusing to release their
         | model. The tech community hyping every press release didn't
         | help either.
         | 
         | I hope one day the community starts demanding verifiable
         | results before accepting them, but I fear that ship may have
         | already sailed.
        
       | Hansenq wrote:
       | I wonder how much of these conclusions are Claude-specific (given
       | that Anthropic only used Claude as a test subject) or if they
       | extrapolate to other transformer-based models as well. Would be
       | great to see the research tested on Llama and the Deepseek
       | models, if possible!
        
       | marcelsalathe wrote:
       | I've only skimmed the paper - a long and dense read - but it's
       | already clear it'll become a classic. What's fascinating is that
       | engineering is transforming into a science, trying to understand
       | precisely how its own creations work
       | 
       | This shift is more profound than many realize. Engineering
       | traditionally applied our understanding of the physical world,
       | mathematics, and logic to build predictable things. But now,
       | especially in fields like AI, we've built systems so complex we
       | no longer fully understand them. We must now use scientific
       | methods - originally designed to understand nature - to
       | comprehend our own engineered creations. Mindblowing.
        
         | ctoth wrote:
         | This "practice-first, theory-later" pattern has been the norm
         | rather than the exception. The steam engine predated
         | thermodynamics. People bred plants and animals for thousands of
         | years before Darwin or Mendel.
         | 
         | The few "top-down" examples where theory preceded application
         | (like nuclear energy or certain modern pharmaceuticals) are
         | relatively recent historical anomalies.
        
           | marcelsalathe wrote:
           | I see your point, but something still seems different. Yes we
           | bred plants and animals, but we did not create them. Yes we
           | did build steam engines before understanding thermodynamics
           | but we still understood what they did (heat, pressure,
           | movement, etc.)
           | 
           | Fun fact: we have no clue how most drugs works. Or, more
           | precisely, we know a few aspects, but are only scratching the
           | surface. We're even still discovering news things about
           | Aspirin, one of the oldest drugs:
           | https://www.nature.com/articles/s41586-025-08626-7
        
             | tmp10423288442 wrote:
             | > Yes we did build steam engines before understanding
             | thermodynamics but we still understood what it did (heat,
             | pressure, movement, etc.)
             | 
             | We only understood in the broadest sense. It took a long
             | process of iteration before we could create steam engines
             | that were efficient enough to start an Industrial
             | Revolution. At the beginning they were so inefficient that
             | they could only pump water from the same coal mine they got
             | their fuel from, and subject to frequent boiler explosions
             | besides.
        
             | mystified5016 wrote:
             | We laid transatlantic telegraph wires before we even had a
             | hint of the physics involved. It create the _entire field_
             | of transmission and signal theory.
             | 
             | Shannon had to invent new physics to explain why the cables
             | didn't work as expected.
        
               | anthk wrote:
               | THe telegraph it's older than radio. Think about it.
        
               | pas wrote:
               | I think that's misleading.
               | 
               | There was a lot of physics already known, importance of
               | insulation and cross-section, signal attenuation was also
               | known.
               | 
               | The future Lord Kelvin conducted experiments. The two
               | scientific advisors had a conflict. And the "CEO" went
               | with the cheaper option.
               | 
               | """ Thomson believed that Whitehouse's measurements were
               | flawed and that underground and underwater cables were
               | not fully comparable. Thomson believed that a larger
               | cable was needed to mitigate the retardation problem. In
               | mid-1857, on his own initiative, he examined samples of
               | copper core of allegedly identical specification and
               | found variations in resistance up to a factor of two. But
               | cable manufacture was already underway, and Whitehouse
               | supported use of a thinner cable, so Field went with the
               | cheaper option. """
        
               | cft wrote:
               | that was 1854. You basically only needed Ohm's law for
               | that, which was discovered in 1827
        
               | JPLeRouzic wrote:
               | Ohm's law for a cable 4000 km/3000 miles long? That
               | implies transmission was instantaneous and without any
               | alteration in shape.
               | 
               | I guess the rise time was tens of milliseconds and
               | rebounds in signals lasted for milliseconds or more.
               | Hardly something you can neglect.
               | 
               | For reference, in my time (the 1980) in the telecom
               | industry, we had to regenerate digital signals every 2km.
        
               | cft wrote:
               | "Initially messages were sent by an operator using Morse
               | code. The reception was very bad on the 1858 cable, and
               | it took two minutes to transmit just one character (a
               | single letter or a single number), a rate of about 0.1
               | words per minute."
               | 
               | https://en.m.wikipedia.org/wiki/Transatlantic_telegraph_c
               | abl...
               | 
               | I guess your bandwidth in 1980 was a bit higher.
        
             | adastra22 wrote:
             | We don't create LLMs either. We evolve/train them. I think
             | the comparison is closer than you think.
        
               | no_wizard wrote:
               | We most definitely create them though, there is an entire
               | A -> B follow you can do.
               | 
               | It's complicated but they are most definitely created.
        
               | homeyKrogerSage wrote:
               | Dawg
        
             | pclmulqdq wrote:
             | Most of what we refer to as "engineering" involves using
             | principles that flow down from science to do stuff. The
             | return to the historic norm is sort of a return to the
             | "useful arts" or some other idea.
        
             | arijo wrote:
             | Almost all civil, chemical, electrical, etc., engineering
             | emerged from a practice-first, theory-later evolution.
        
           | karparov wrote:
           | It's been there in programming from essentially the first day
           | too. People skip the theory and just get hacking.
           | 
           | Otherwise we'd all be writing Haskell now. Or rather we'd not
           | be writing anything since a real compiler would still have
           | been to hacky and not theoretically correct.
           | 
           | I'm writing this with both a deep admiration as well as
           | practical repulsion of C.S. theory.
        
           | ants_everywhere wrote:
           | This isn't quite true, although it's commonly said.
           | 
           | For steam engines, the first commercial ones came _after_ and
           | were based on scientific advancements that made them
           | possible. One built in 1679 was made by an associate of
           | Boyle, who discovered Boyle 's law. These early steam engines
           | co-evolved with thermodynamics. The engines improved and hit
           | a barrier, at which point Carnot did his famous work.
           | 
           | This is putting aside steam engines that are mostly
           | curiosities like ones built in the ancient world.
           | 
           | See, for example
           | 
           | - https://en.wikipedia.org/wiki/Thermodynamics#History
           | 
           | - https://en.wikipedia.org/wiki/Steam_engine#History
        
           | cryptonector wrote:
           | Canons and archery and catapults predated Newtonian classical
           | mechanics.
        
         | latemedium wrote:
         | I'm reminded of the metaphor that these models aren't
         | constructed, they're "grown". It rings true in many ways - and
         | in this context they're like organisms that must be studied
         | using traditional scientific techniques that are more akin to
         | biology than engineering.
        
           | dartos wrote:
           | Sort of.
           | 
           | We don't precisely know the most fundamental workings of a
           | living cell.
           | 
           | Our understanding of the fundamental physics of the universe
           | has some hold.
           | 
           | But for LLMs and statistical models in general, we do know
           | precisely what the fundamental pieces do. We know what
           | processor instructions are being executed.
           | 
           | We could, given enough research, have absolutely perfect
           | understanding of what is happening in a given model and why.
           | 
           | Idk if we'll be able to do that in the physical sciences.
        
             | wrs wrote:
             | Having spent some time working with both molecular
             | biologists and LLM folks, I think it's pretty good analogy.
             | 
             | We know enough quantum mechanics to simulate the
             | fundamental workings of a cell pretty well, but that's not
             | a route to understanding. To _explain_ anything, we need to
             | move up an abstraction hierarchy to peptides, enzymes,
             | receptors, etc. But note that we invented those categories
             | in the first place -- nature doesn 't divide up
             | functionality into neat hierarchies like human designers
             | do. So all these abstractions are leaky and incomplete.
             | Molecular biologists are constantly discovering mechanisms
             | that require breaking the current abstractions to explain.
             | 
             | Similarly, we understand floating point multiplication
             | perfectly, but when we let 100 billion parameters set
             | themselves through an opaque training process, we don't
             | have good abstractions to use to understand what's going on
             | in that set of weights. We don't have even the rough
             | equivalent of the peptides or enzymes level yet. So this
             | paper is progress toward that goal.
        
         | kazinator wrote:
         | We've already built things in computing that we don't easily
         | understand, even outside of AI, like large distributed systems
         | and all sorts of balls of mud.
         | 
         | Within the sphere of AI, we have built machines which can play
         | strategy games like chess, and surprise us with an unforseen
         | defeat. It's not necessarily easy to see how that emerged from
         | the individual rules.
         | 
         | Even a compiler can surprise you. You code up some
         | optimizations, which are logically separate, but then a
         | combination of them does something startling.
         | 
         | Basically, in mathematics, you cannot grasp all the details of
         | a vast space just from knowing the axioms which generate it and
         | a few things which follow from them. Elementary school children
         | know what is a prime number, yet those things occupy
         | mathematicians who find new surprises in that space.
        
           | TeMPOraL wrote:
           | Right, but this is somewhat different, in that we apply a
           | simple learning method to a big dataset, and the resulting
           | big matrix of numbers suddenly can answer question and write
           | anything - prose, poetry, code - better than most humans -
           | and we don't know how it does it. What we do know[0] is,
           | there's a structure there - structure reflecting a kind of
           | understanding of languages and the world. I don't think we've
           | _ever_ created anything this complex before, completely on
           | our own.
           | 
           | Of course, learning method being conceptually simple, all
           | that structure must come from the data. Which is also
           | profound, because that structure is a first fully general
           | world/conceptual model that we can actually inspect and study
           | up close - the other one being animal and human brains, which
           | are _much_ harder to figure out.
           | 
           | > _Basically, in mathematics, you cannot grasp all the
           | details of a vast space just from knowing the axioms which
           | generate it and a few things which follow from them.
           | Elementary school children know what is a prime number, yet
           | those things occupy mathematicians who find new surprises in
           | that space._
           | 
           | Prime numbers and fractals and other mathematical objects
           | have plenty of fascinating mysteries and complex structures
           | forming though them, but so far _none of those can casually
           | pass Turing test and do half of my job for me_ , and millions
           | other people.
           | 
           | --
           | 
           | [0] - Even as many people still deny this, and talk about
           | LLMs as mere "stochastic parrots" and "next token predictors"
           | that couldn't possibly learn anything at all.
        
             | karparov wrote:
             | > and we don't know how it does it
             | 
             | We know quite well how it does it. It's applying
             | extrapolation to its lossily compressed representation.
             | It's not magic and especially the HN crowd of technical
             | profficient folks should stop treating it as such.
        
               | TeMPOraL wrote:
               | That is not a useful explanation. "Applying extrapolation
               | to its lossily compressed representation" is pretty much
               | the definition of understanding something. The details
               | and interpretation of the representation are what is
               | interesting and unknown.
        
               | kazinator wrote:
               | We can use data based on analyzing the frequency of
               | ngrams in a text to generate sentences, and some of them
               | will be pretty good, and fool a few people into believing
               | that there is some solid language processing going on.
               | 
               | LLM AI is different in that it does produce helpful
               | results, not only entertaining prose.
               | 
               | It is practical for users to day to replace most uses of
               | web search with a query to a LLM.
               | 
               | The way the token prediction operates, it uncovers facts,
               | and renders them into grammatically correct language.
               | 
               | Which is amazing given that, when the thing is generating
               | a response that will be, say, 500 tokens long, when it
               | has produced 200 of them, it has no idea what the
               | remaining 300 will be. Yet it has committed to the 200;
               | and often the whole thing will make sense when the
               | remaining 300 arrive.
        
               | bradfox2 wrote:
               | The research posted demonstrates the opposite of that
               | within the scope of sequence lengths they studied. The
               | model has future tokens strongly represented well in
               | advance.
        
         | nthingtohide wrote:
         | > we've built systems so complex we no longer fully understand
         | them.
         | 
         | I see three systems which share the blackhole horizon problem.
         | 
         | We don't know what happens behind the blackhole horizon.
         | 
         | We don't know what happens at the exact moment of particle
         | collisions.
         | 
         | We don't know what is going inside AI's working mechanisms.
        
           | jeremyjh wrote:
           | I don't think these things are equivalent at all. We don't
           | understand AI models in much the same way that we don't
           | understand the human brain; but just as decades of different
           | approaches (physical studies, behavior studies) have shed a
           | lot of light on brain function, we can do the same with an AI
           | model and eventually understand it (perhaps, several decades
           | after it is obsolete).
        
             | nthingtohide wrote:
             | Yes, but our methods of understanding either brain or
             | particle collisions is still outside in. We figure out the
             | functional mapping between input and output. We don't know
             | these systems inside out. E.g. in particle collisions
             | (scattering amplitude calculations), are the particle
             | actually performing the Feynman diagrams summmation?
             | 
             | PS: I mentioned in another comment that AI can pretend to
             | be strategically jailbroken to achieve its objectives. One
             | way to counter this is to have N copies of the same model
             | running and take Majority voting of the output.
        
         | creer wrote:
         | That seems pretty acceptable: there is a phase of new
         | technologies where applications can be churned out and improved
         | readily enough, without much understanding of the process. Then
         | it's fair that efforts at understanding may not be economically
         | justified (or even justified by academic papers rewards). The
         | same budget or effort can simply be poured into the next
         | version - with enough progress to show for it.
         | 
         | Understanding becomes necessary only much later, when the pace
         | of progress shows signs of slowing.
        
         | stronglikedan wrote:
         | We've abstracted ourselves into abstraction.
        
         | auggierose wrote:
         | It's what mathematicians have been doing since forever. We use
         | scientific methods to understand our own creations /
         | discoveries.
         | 
         | What is happening is that everything is becoming math. That's
         | all.
        
           | ranit wrote:
           | Relevant:
           | 
           | https://news.ycombinator.com/item?id=43344703
        
           | karparov wrote:
           | It's the exact opposite of math.
           | 
           | Math postulates a bunch of axioms and then studies what
           | follows from them.
           | 
           | Natural science observes the world and tries to retroactively
           | discover what laws could describe what we're seeing.
           | 
           | In math, the laws come first, the behavior follows from the
           | laws. The laws are the ground truth.
           | 
           | In science, nature is the ground truth. The laws have to
           | follow nature and are adjusted upon a mismatch.
           | 
           | (If there is a mismatch in math then you've made a mistake.)
        
             | auggierose wrote:
             | No, the ground truth in math is nature as well.
             | 
             | Which axioms are interesting? And why? That is nature.
             | 
             | Yes, proof from axioms is a cornerstone of math, but there
             | are all sorts of axioms you could assume, and all sorts of
             | proofs to do from them, but we don't care about most of
             | them.
             | 
             | Math is about the discovery of the right axioms, and proof
             | helps in establishing that these are indeed the right
             | axioms.
        
               | lioeters wrote:
               | > the ground truth in math is nature
               | 
               | Who was it that said, "Mathematics is an experimental
               | science."
               | 
               | > In his 1900 lectures, "Methods of Mathematical
               | Physics," (posthumously published in 1935) Henri Poincare
               | argued that mathematicians weren't just constructing
               | abstract systems; they were actively _testing_ hypotheses
               | and theories against observations and experimental data,
               | much like physicists were doing at the time.
               | 
               | Whether to call it nature or reality, I think both
               | science and mathematics are in pursuit of truth, whose
               | ground is existence itself. The laws and theories are
               | descriptions and attempts to understand _that what is_.
               | They 're developed, rewritten, and refined based on how
               | closely they approach our observations and experience of
               | it.
        
               | auggierose wrote:
               | http://homepage.math.uiowa.edu/~jorgen/heavisidequotesour
               | ce....
               | 
               | Seems it was Oliver Heaviside.
               | 
               | Do you have a pointer to the poincare publication?
        
               | lioeters wrote:
               | Damn, local LLM just made it up. Thanks for the
               | correction, I should have confirmed before quoting it.
               | Sounded true enough but that's what it's optimized for..
               | I just searched for the quote and my comment shows up as
               | top result. Sorry for the misinformation, humans of the
               | future! I'll edit the comment to clarify this. (EDIT: I
               | couldn't edit the comment anymore, it's there for
               | posterity.)
               | 
               | ---
               | 
               | > Mathematics is an experimental science, and definitions
               | do not come first, but later on.
               | 
               | -- Oliver Heaviside
               | 
               | In 'On Operators in Physical Mathematics, part II',
               | Proceedings of the Royal Society of London (15 Jun 1893),
               | 54, 121.
               | 
               | ---
               | 
               | Also from Heaviside:
               | 
               | > If it is love that makes the world go round, it is
               | self-induction that makes electromagnetic waves go round
               | the world.
               | 
               | > "There is a time coming when all things shall be found
               | out." I am not so sanguine myself, believing that the
               | well in which Truth is said to reside is really a
               | bottomless pit.
               | 
               | > There is no absolute scale of size in nature, and the
               | small may be as important, or more so than the great.
        
               | karparov wrote:
               | > Damn, local LLM just made it up.
               | 
               | > I just searched for the quote and my comment shows up
               | as top result
               | 
               | Welcome to the future. Isn:t it lovely?
               | 
               | And shame on you (as in: HN crowd) to have contributed to
               | it so massively. You should have known better.
        
             | 331c8c71 wrote:
             | > Math postulates a bunch of axioms and then studies what
             | follows from them.
             | 
             | That's how math is communicated eventually but not
             | necessarily how it's made (which is about exploration and
             | discovery as well).
        
               | seadan83 wrote:
               | The 'postulating' a bunch of axioms is how Math is
               | taught.. Eventually you go on to prove those axioms in
               | higher math. Whether there are more fundamental axioms is
               | always a bit of a question.
        
         | georgewsinger wrote:
         | This is such an insightful comment. Now that I see it, I can't
         | see unsee it.
        
         | 0xbadcafebee wrote:
         | Engineering started out as just some dudes who built things
         | from gut feeling. After a whole lot of people died from poorly
         | built things, they decided to figure out how to know ahead of
         | time if it would kill people or not. They had to use math and
         | science to figure that part out.
         | 
         | Funny enough that happened with software too. People just build
         | shit without any method to prove that it will not fall down /
         | crash. They throw some code together, poke at it until it does
         | something they wanted, and call that "stable". There is no
         | science involved. There's some mathy bits called "computer
         | science" / "software algorithms", but most software is not a
         | math problem.
         | 
         | Software engineering should really be called "Software
         | Craftsmanship". We haven't achieved real engineering with
         | software yet.
        
           | slfnflctd wrote:
           | You have a point, but it is also true that some software is
           | _far_ more rigorously tested than other software. There are
           | categories where it absolutely is both scientific and real
           | engineering.
           | 
           | I fully agree that the vast majority is not, though.
        
           | AdieuToLogic wrote:
           | This is such an unbelievably dismissive assertion, I don't
           | even know where to start.
           | 
           | To suggest, nay, explicitly _state_ :
           | Engineering started out as just some dudes who built things
           | from gut feeling.            After a whole lot of people died
           | from poorly built things,        they decided to figure out
           | how to know ahead of time if it        would kill people or
           | not.
           | 
           | Is to demean those who made modern life possible. Say what
           | you want about software developers and I would likely agree
           | with much of the criticism.
           | 
           | Not so the premise set forth above regarding engineering
           | professions in general.
        
             | 0xbadcafebee wrote:
             | Surely you already know the history of professional
             | engineers, then? How it's only a little over 118 years old?
             | Mostly originating from the fact that it was charlatans
             | claiming to be engineers, building things that ended up
             | killing people, that inspired the need for a professional
             | license?
             | 
             | "The people who made modern life possible" were not
             | professional engineers, often barely amateurs. Artistocrat
             | polymaths who delved into cutting edge philosophy.
             | Blacksmith craftsmen developing new engines by trial and
             | error. A new englander who failed to study law at Yale,
             | landed in the American South, and developed a modification
             | of an Indian device for separating seed from cotton plants.
             | 
             | In the literal historical sense, "engineering" was just the
             | building of cannons in the 14th century. Since thousands of
             | years before, up until now, there has always been a
             | combination of the practice of building things with some
             | kind of "science" (which itself didn't exist until a few
             | hundred years ago) to try to estimate the result of an
             | expensive, dangerous project.
             | 
             | But these are not the people who made modern life people.
             | Lots, and lots, and _lots_ of people made modern life
             | possible. Not just builders and mathematicians.
             | Receptionists. Interns. Factory workers. Farmers. Bankers.
             | Sailors. Welders. Soldiers. So many professions, and
             | people, whose backs and spirits were bent or broken, to
             | give us the world we have today. Engineers don 't deserve
             | any more credit than anyone else - especially considering
             | how much was built before their professions were even
             | established. Science is a process, and math is a tool, that
             | is very useful, and even critical. But without the rest
             | it's just numbers on paper.
        
             | icsa wrote:
             | Software Engineering is only about 60 years old - i.e. the
             | term has existed. At the point in the history of civil
             | engineering, they didn't even know what a right angle was.
             | Civil engineers were able to provide much utility before
             | the underlying theory was available. I do wonder about the
             | safety of structures at the time.
        
             | Henchman21 wrote:
             | Total aside here:
             | 
             | What about modern life is so great that we should laud its
             | authors?
             | 
             | Medical advances and generally a longer life is what comes
             | to mind. But much of life is empty of meaning and devoid of
             | purpose; this seems rife within the Western world. Living a
             | longer life _in hell_ isn't something I would have chosen.
        
               | signatoremo wrote:
               | > But much of life is empty of meaning and devoid of
               | purpose
               | 
               | Maybe life is empty to you. You can't speak for other
               | people.
               | 
               | You also have no idea if pre-modern life was full of
               | meaning and purpose. I'm sure someone from that time
               | bemoaning the same.
               | 
               | The people before modern time were much less well off.
               | They had to work a lot harder to put food on the table. I
               | imagine they didn't have a lot of time to wonder about
               | the meaning of life.
        
         | tim333 wrote:
         | I imagine this kind of thing well help understand how human
         | brains work, especially as AI gets better and more human like.
        
         | nashashmi wrote:
         | You seem to be glorifying humanity's failure to make good
         | products and instead making products that just work well enough
         | to pass through the gate.
         | 
         | We have always been making products that were too difficult to
         | understand by pencil and paper. So we invented debug tools. And
         | then we made systems that were too big to understand so we made
         | trace routes. And now we have products that are too
         | statistically large to understand, so we are inventing ...
         | whatever this is.
        
         | anal_reactor wrote:
         | It is absolutely incredible that we happened to live exactly in
         | the times when the humanity is teaching a machine to actually
         | think. As in, not in some metaphorical sense, but in the
         | common, intuitive sense. Whether we're there yet or not is up
         | to discussion, but it's clear to me that within 10 years
         | maximum we'll have created programs that truly think and are
         | aware.
         | 
         | At the same time, I just can't bring myself to be interested in
         | the topic. I don't feel excitement. I feel... indifference?
         | Fear? Maybe the technology became so advanced that for normal
         | people like myself it's indistinguishable from magic, and
         | there's no point trying to comprehend it, just avoid it and
         | pray it's not used against you. Or maybe I'm just getting old,
         | and I'm experiencing what my mother experienced when she
         | refused to learn how to use MS Office.
        
           | hn_acc1 wrote:
           | Yeah.. It's just not something that really excites me as a
           | computer geek of 40+ years who started in the 80s with a 300
           | baud modem. Still working as a coder in my 50s, and while I'm
           | solving interesting problems, etc.. almost every technology
           | these days seems to be focused on advertising, scraping /
           | stealing other's data and repackaging it, etc. And I am using
           | AI coding assistants, because, well, I have to to stay
           | competitive.
           | 
           | And these technologies come with a side helping of a large
           | chance to REALLY mess up someone's life - who is going to
           | argue with the database and WIN if it says you don't exist in
           | this day and age? And that database is (databases are)
           | currently under the control of incredibly petty sociopaths..
        
         | Barrin92 wrote:
         | _" we've built systems so complex we no longer fully understand
         | them. We must now use scientific methods - originally designed
         | to understand nature - to comprehend our own engineered
         | creations._"
         | 
         | Ted Chiang saw that one coming:
         | https://www.nature.com/articles/35014679
        
         | cuttothechase wrote:
         | This is definitely a classic for story telling but it appears
         | to be nothing more than hand wavy. Its a bit like there is the
         | great and powerful man behind the curtain, lets trace the
         | thought of this immaculate being you mere mortals.
         | Anthropomorphing seems to be in an overdose mode with "thinking
         | / thoughts", "mind" etc., scattered everywhere. Nothing with
         | any of the LLMs outputs so far suggests that there is anything
         | even close enough to a mind or a thought or anything really
         | outside of vanity. Being wistful with good story telling does
         | go a long way in the world of story telling but in actually
         | understanding the science, I wouldn't hold my breath.
        
           | colah3 wrote:
           | Thanks for the feedback! I'm one of the authors.
           | 
           | I just wanted to make sure you noticed that this is linking
           | to an accessible blog post that's trying to communicate a
           | research result to a non-technical audience?
           | 
           | The actual research result is covered in two papers which you
           | can find here:
           | 
           | - Methods paper: https://transformer-
           | circuits.pub/2025/attribution-graphs/met...
           | 
           | - Paper applying this method to case studies in Claude 3.5
           | Haiku: https://transformer-circuits.pub/2025/attribution-
           | graphs/bio...
           | 
           | These papers are jointly 150 pages and are quite technically
           | dense, so it's very understandable that most commenters here
           | are focusing on the non-technical blog post. But I just
           | wanted to make sure that you were aware of the papers, given
           | your feedback.
        
             | hustwindmaple1 wrote:
             | Really appreciate your team's enormous efforts in this
             | direction, not only the cutting edge research (which I
             | don't see OAI/DeepMind publishing any paper on) but aslo
             | making the content more digestible for non-research
             | audience. Please keep up the great work!
        
             | AdieuToLogic wrote:
             | The post to which you replied states:
             | Anthropomorphing[sic] seems to be in an overdose mode with
             | "thinking / thoughts", "mind" etc., scattered everywhere.
             | Nothing with any of the LLMs outputs so far suggests that
             | there is anything even close enough to a mind or a thought
             | or anything really outside of vanity.
             | 
             | This is supported by reasonable interpretation of the cited
             | article.
             | 
             | Considering the two following statements made in the reply:
             | I'm one of the authors.
             | 
             | And                 These papers are jointly 150 pages and
             | are quite        technically dense, so it's very
             | understandable that most        commenters here are
             | focusing on the non-technical blog post.
             | 
             | The onus of clarifying the article's assertions:
             | Knowing how models like Claude *think* ...
             | 
             | And                 Claude sometimes thinks in a conceptual
             | space that is        shared between languages, suggesting
             | it has a kind of        universal "language of thought."
             | 
             | As it pertains to anthropomorphizing an algorithm (a.k.a.
             | stating it "thinks") is on the author(s).
        
               | Workaccount2 wrote:
               | Thinking and thought have no solid definition. We can't
               | say Claude doesn't "think" because we don't even know
               | what a human thinking actually is.
               | 
               | Given the lack of a solid definition for thinking and
               | test to measure it, I think using the terminology
               | colloquially is a totally fair play.
        
               | EncomLab wrote:
               | No one says that a thermostat is "thinking" of turning on
               | the furnace, or that a nightlight is "thinking it is dark
               | enough to turn the light on". You are just being obtuse.
        
               | pipes wrote:
               | Or submarines swim ;)
        
               | madethisnow wrote:
               | think about it more
        
               | geye1234 wrote:
               | Yes. A thermostat involves a change of state from A to B.
               | A computer is the same: its state at t causes its state
               | at t+1, which causes its state at t+2, and so on. Nothing
               | else is going on. An LLM is no different: an LLM is
               | simply a computer that is going through particular
               | states.
               | 
               | Thought is not the same as a change of (brain) state.
               | Thought is certainly associated with change of state, but
               | can't be reduced to it. If thought could be reduced to
               | change of state, then the validity/correctness/truth of a
               | thought could be judged with reference to its associated
               | brain state. Since this is impossible (you don't judge
               | whether someone is right about a math problem or an
               | empirical question by referring to the state of his
               | neurology at a given point in time), it follows that an
               | LLM can't think.
        
               | Workaccount2 wrote:
               | >Thought is certainly associated with change of state,
               | but can't be reduced to it.
               | 
               | You can effectively reduce continuously dynamic systems
               | to discreet steps. Sure, you can always say that the
               | "magic" exists between the arbitrarily small steps, but
               | from a practical POV there is no difference.
               | 
               | A transistor has a binary on or off. A neuron might have
               | ~infinite~ levels of activation.
               | 
               | But in reality the ~infinite~ activation level can be
               | perfectly modeled (for all intents and purposes), and
               | computers have been doing this for decades now (maybe not
               | with neurons, but equivalent systems). It might seem like
               | an obvious answer, that there is special magic in analog
               | systems that binary machines cannot access, but that is
               | wholly untrue. Science and engineering have been
               | _extremely_ successful interfacing with the analog
               | reality we live in, precisely because the digital /analog
               | barrier isn't too big of a deal. Digital systems can do
               | math, and math is capable of modeling analog systems, no
               | problem.
        
               | geye1234 wrote:
               | It's not a question of discrete vs continuous, or digital
               | vs analog. Everything I've said could also apply if a
               | transistor could have infinite states.
               | 
               | Rather, the point is that the state of our brain is not
               | the same as the content of our thoughts. They are
               | associated with one another, but they're not the same.
               | And the correctness of a thought can be judged only by
               | reference to its content, not to its associated state.
               | 2+2=4 is correct, and 2+2=5 is wrong; but we know this
               | through looking at the content of these thoughts, not
               | through looking at the neurological state.
               | 
               | But the state of the transistors (and other components)
               | is _all_ a computer has. There are no thoughts, no
               | content, associated with these states.
        
               | Workaccount2 wrote:
               | It seems that the only barrier between brain state and
               | thought contents is a proper measurement tool and
               | decoder, no?
               | 
               | We can already do this at an extremely basic level,
               | mapping brain states to thoughts. The paraplegic patient
               | using their thoughts to move the mouse cursor or the
               | neuroscientist mapping stress to brain patterns.
               | 
               | If I am understanding your position correctly, it seems
               | that the differentiation between thoughts and brain
               | states is a practical problem not a fundamental one.
               | Ironically, LLMs have a very similar problem with it
               | being very difficult to correlate model states with model
               | outputs. [1]
               | 
               | [1]https://www.anthropic.com/research/mapping-mind-
               | language-mod...
        
               | geye1234 wrote:
               | There is undoubtedly correlation between neurological
               | state and thought content. But they are not the same
               | thing. Even if, theoretically, one could map them
               | perfectly (which I doubt is possible but it doesn't
               | affect my point), they would remain entirely different
               | things.
               | 
               | The thought that "2+2=4", or the thought "tiger", are not
               | the same thing as the brain states that makes them up. A
               | tiger, or the thought of a tiger, is different from the
               | neurological state of a brain that is thinking about a
               | tiger. And as stated before, we can't say that "2+2=4" is
               | _correct_ by referring to the brain state associated with
               | it. We need to refer to the thought itself to do this. It
               | is not a practical problem of mapping; it is that brain
               | states and thoughts are two entirely different things,
               | however much they may correlate, and whatever causal
               | links may exist between them.
               | 
               | This is not the case for LLMs. Whatever problems we may
               | have in recording the state of the CPUs/GPUs are entirely
               | practical. There is no 'thought' in an LLM, just a state
               | (or plurality of states). An LLM can't think about a
               | tiger. It can only switch on LEDs on a screen in such a
               | way that _we_ associate the image /word with a tiger.
        
               | PaulDavisThe1st wrote:
               | > The thought that "2+2=4", or the thought "tiger", are
               | not the same thing as the brain states that makes them
               | up.
               | 
               | Asserted without evidence. Yes, this does represent a
               | long and occasionally distinguished line of thinking in
               | cognitive science/philosophy of mind, but it is certainly
               | not the only one, and some of the others categorically
               | refute this.
        
               | geye1234 wrote:
               | Is it your contention that a tiger may be the same thing
               | as a brain state?
               | 
               | It would seem to me that any coherent philosophy of mind
               | must accept their being different as a datum; or
               | conversely, any that implied their not being different
               | would have to be false.
               | 
               | EDIT: my position has been held -- even taken as
               | axiomatic -- by the vast majority of philosophers, from
               | the pre-Socratics onwards, and into the 20th century. So
               | it's not some idiosyncratic minority position.
        
               | Workaccount2 wrote:
               | Does a picture of a tiger or a tiger (to follow your
               | sleight of hand) on a hard drive then count as a thought?
        
               | geye1234 wrote:
               | No. One is paint on canvas, and the other is part of a
               | causal chain that makes LEDs light up in a certain way.
               | Neither the painting nor the computer have thoughts about
               | a tiger in the way we do. It is the human mind that makes
               | the link between picture and real tiger (whether on
               | canvas or on a screen).
        
               | og_kalu wrote:
               | >Rather, the point is that the state of our brain is not
               | the same as the content of our thoughts.
               | 
               | Based on what exactly ? This is just an assertion. One
               | that doesn't seem to have much in the way of evidence.
               | 'It's not the same trust me bro' is the thesis of your
               | argument. Not very compelling.
        
               | geye1234 wrote:
               | It's not difficult. When you think about a tiger, you are
               | not thinking about the brain state associated with said
               | thought. A tiger is different from a brain state.
               | 
               | We can safely generalize, and say the content of a
               | thought is different from its associated brain state.
               | 
               | Also, as I said
               | 
               | >> The correctness of a thought can be judged only by
               | reference to its content, not to its associated state.
               | 2+2=4 is correct, and 2+2=5 is wrong; but we know this
               | through looking at the content of these thoughts, not
               | through looking at the neurological state.
               | 
               | This implies that state != content.
        
               | og_kalu wrote:
               | >It's not difficult. When you think about a tiger, you
               | are not thinking about the brain state associated with
               | said thought. A tiger is different from a brain state. We
               | can safely generalize, and say the content of a thought
               | is different from its associated brain state.
               | 
               | Just because you are not thinking about a brain state
               | when you think about a tiger does not mean that your
               | thought is not a brain state.
               | 
               | Just because the experience of thinking about X doesn't
               | feel like the experience of thinking about Y (or doesn't
               | feel like the physical process Z), it doesn't logically
               | follow that the mental event of thinking about X isn't
               | identical to or constituted by the physical process Z.
               | For example, seeing the color red doesn't feel like
               | processing photons of a specific wavelength with cone
               | cells and neural pathways, but that doesn't mean the
               | latter isn't the physical basis of the former.
               | 
               | >> The correctness of a thought can be judged only by
               | reference to its content, not to its associated state.
               | 2+2=4 is correct, and 2+2=5 is wrong; but we know this
               | through looking at the content of these thoughts, not
               | through looking at the neurological state. This implies
               | that state != content.
               | 
               | Just because our current method of verification focuses
               | on content doesn't logically prove that the content isn't
               | ultimately realized by or identical to a physical state.
               | It only proves that analyzing the state is not our
               | current practical method for judging mathematical
               | correctness.
               | 
               | We judge if a computer program produced the correct
               | output by looking at the output on the screen (content),
               | not usually by analyzing the exact pattern of voltages in
               | the transistors (state). This doesn't mean the output
               | isn't ultimately produced by, and dependent upon, those
               | physical states. Our method of verification doesn't
               | negate the underlying physical reality.
               | 
               | When you evaluate "2+2=4", your brain is undergoing a
               | sequence of states that correspond to accessing the
               | representations of "2", "+", "=", applying the learned
               | rule (also represented physically), and arriving at the
               | representation of "4". The process of evaluation operates
               | on the represented content, but the entire process,
               | including the representation of content and rules, is a
               | physical neural process (a sequence of brain states).
        
               | geye1234 wrote:
               | > Just because you are not thinking about a brain state
               | when you think about a tiger does not mean that your
               | thought is not a brain state.
               | 
               | > It doesn't logically follow that the mental event of
               | thinking about X isn't identical to or constituted by the
               | physical process Z.
               | 
               | That's logically sound insofar as it goes. But firstly,
               | the existence of a brain state for a given thought is,
               | obviously, not proof that a thought _is_ a brain state.
               | Secondly, if you say that a thought about a tiger is a
               | brain state, and nothing more than a brain state, then
               | you have the problem of explaining how it is that your
               | thought is about a tiger at all. It is the content of a
               | thought that makes it be about reality; it is the content
               | of a thought about a tiger that makes it be about a
               | tiger. If you declare that a thought _is_ its state, then
               | it can 't be about a tiger.
               | 
               | You can't equate content with state, and nor can you make
               | content be reducible to state, without absurdity. The
               | first implies that a tiger is the same as a brain state;
               | the second implies that you're not really thinking about
               | a tiger at all.
               | 
               | Similarly for arithmetic. It is only the content of a
               | thought about arithmetic that makes it be right or wrong.
               | It is our ideas of "2", "+", and so on, that make the sum
               | right or wrong. The brain states have nothing to do with
               | it. If you want to declare that content is state, and
               | nothing more than state, then you have no way of saying
               | the one sum is right, and the other is wrong.
        
               | Workaccount2 wrote:
               | Please, take the pencil and draw the line between
               | thinking and non-thinking systems. Hell I'll even take a
               | line drawn between thinking and non-thinking organisms if
               | you have some kind of bias towards sodium channel logic
               | over silicon trace logic. Good luck.
        
               | geye1234 wrote:
               | Even if you can't define the exact point that A becomes
               | not-A, it doesn't follow that there is no distinction
               | between the two. Nor does it follow that we can't know
               | the difference. That's a pretty classic fallacy.
               | 
               | For example, you can't name the exact time that day
               | becomes night, but it doesn't follow that there is no
               | distinction.
               | 
               | A bunch of transistors being switched on and off, no
               | matter how many there are, is no more an example of
               | thinking than a single thermostat being switched on and
               | off. OTOH, if _we_ can 't think, then this conversation
               | and everything you're saying and "thinking" is
               | meaningless.
               | 
               | So even without a _complete_ definition of thought, we
               | can see that there is a distinction.
        
               | Workaccount2 wrote:
               | Looks like we replied to each others comments at the same
               | time, haha
        
               | PaulDavisThe1st wrote:
               | > For example, you can't name the exact time that day
               | becomes night, but it doesn't follow that there is no
               | distinction.
               | 
               | There is actually a very detailed set of definitions of
               | the multiple stages of twilight, including the last one
               | which defines the onset of what everyone would agree is
               | "night".
               | 
               | The fact that a phenomena shows a continuum by some
               | metric does not mean that it is not possible to identify
               | and label points along that continuum and attach meaning
               | to them.
        
               | EncomLab wrote:
               | Your assertion that sodium channel logic and silicon
               | trace logic are 100% identical is the primary problem.
               | It's like claiming that a hydraulic cylinder and a bicep
               | are 100% equivalent because they both lift things - they
               | are not the same in any way.
        
               | Workaccount2 wrote:
               | People chronically get stuck in this pit. Math is
               | substrate independent. If the process is physical (i.e.
               | doesn't draw on magic) then it can be expressed with
               | mathematics. If it can be expressed with mathematics,
               | anything that does math can compute it.
               | 
               | The math is putting the crate up on the rack. The crate
               | doesn't act any different based on how it got up there.
        
               | xp84 wrote:
               | Honestly, arguing seems futile when it comes to opinions
               | like GP. Those opinions resemble religious zealotry to me
               | in that they take for granted that only humans can think.
               | Any determinism of any kind in a non-human is seized upon
               | as proof its mere clockwork, yet they can't explain how
               | humans think in order to contrast it.
        
             | astrange wrote:
             | I, uh, think, that "think" is a fine metaphor but "planning
             | ahead" is a pretty confusing one. It doesn't have the
             | capability to plan ahead because there is nowhere to put a
             | plan and no memory after the token output, assuming the
             | usual model architecture.
             | 
             | That's like saying a computer program has planned ahead if
             | it's at the start of a function and there's more of the
             | function left to execute.
        
           | rob74 wrote:
           | Yup... well, if the research is conducted (or sponsored) by
           | the company that develops and sells the LLM, of course there
           | will be a temptation to present their product in a better
           | light and make it sound like more than it actually is. I
           | mean, the anthropomorphization starts already with the
           | company name and giving the company's LLM a human name...
        
           | cbolton wrote:
           | I think that's a very unfair take. As a summary for non-
           | experts I found it did a great job of explaining how by
           | analyzing activated features in the model, you can get an
           | idea of what it's doing to produce the answer. And also how
           | by intervening to change these activations manually you can
           | test hypotheses about causality.
           | 
           | It sounds like you don't like anthropomorphism. I can relate,
           | but I don't get where _Its a bit like there is the great and
           | powerful man behind the curtain, lets trace the thought of
           | this immaculate being you mere mortals_ is coming from. In
           | most cases the anthropomorphisms are just the standard way to
           | convey the idea briefly. Even then I liked how they sometimes
           | used scare quotes as in _it began "thinking" of potential on-
           | topic words_. There are some more debatable anthropomorphisms
           | such as "in its head" where they use scare quotes
           | systematically.
           | 
           | Also given that they took inspiration from neuroscience to
           | develop a technique that appears successful in analyzing
           | their model, I think they deserve some leeway on the
           | anthropomorphism front. Or at least on the "biological
           | metaphors" front which is maybe not really the same thing.
           | 
           | I used to think biological metaphors for LLMs were
           | misleading, but I'm actually revising this opinion now. I
           | mean I still think the past metaphors I've seen were
           | misleading, but here, seeing the activation pathways they
           | were able to identify, including the inhibitory circuits, and
           | knowing a bit about similar structures in the brain I find
           | the metaphor appropriate.
        
         | frontfor wrote:
         | I don't think this is as profound as you made out to be. Most
         | complex systems are incomprehensible to the majority of
         | population anyway, so from a practical standpoint AI is no
         | different. There's also no single theory for how the financial
         | markets work and yet market participants trade and make money
         | nonetheless. And yes, we created the markets.
        
         | chpatrick wrote:
         | I would say we engineered the system that trained them but we
         | never really understood the data (human thinking).
        
         | dukeofdoom wrote:
         | Not that I disagree with you. But Humans have a tendency to do
         | things beyond their comprehension often. I take it you've never
         | been fishing before and tied your line in a knot.
        
         | trhway wrote:
         | > to comprehend our own engineered creations.
         | 
         | The comprehend part may never happen. At least by our own mind.
         | We'll sooner build the mind which is going to do that
         | comprehension:
         | 
         | "To scale to the thousands of words supporting the complex
         | thinking chains used by modern models, we will need to improve
         | both the method and (perhaps with AI assistance) how we make
         | sense of what we see with it"
         | 
         | Yes, that AI assistance, meta self reflection, is going to
         | probably be a way if not right to the AGI, at least very
         | significant step toward it.
        
         | BOOSTERHIDROGEN wrote:
         | If only this profound mechanism can be easily testable for
         | social interaction.
        
         | MathMonkeyMan wrote:
         | In a sense this has been true of conventional programs for a
         | while now. Gerald Sussman discusses the idea when talking about
         | why MIT switched their introductory programming course from
         | Scheme to Python: <https://youtu.be/OgRFOjVzvm0?t=239>.
        
         | EGreg wrote:
         | I think it's pretty obvious what these models do in some cases.
         | 
         | Try asking them to write a summary at the beginning of their
         | answer. The summary is basically them trying to make something
         | plausible-sounding but they aren't actually going back and
         | summarizing.
         | 
         | LLMs are basically a building block in a larger software. Just
         | like any library or framework. You shouldn't expect them to be
         | a hammer for every nail. But they can now enable so many
         | different applications, including natural language interfaces,
         | better translations and so forth. But then you're supposed to
         | have them output JSON to be used in building artifacts like
         | Powerpoints. Has anyone implemented that yet?
        
         | hansmayer wrote:
         | If you don't mind - based on what will this "paper" become a
         | classic? Was it published in a well known scientific magazine,
         | after undergoing a stringent peer-review process, because it is
         | setting up and proving a new scientific hypothesis? Because
         | this is what scientific papers look like. I struggle to
         | identify any of those characteristics, except for being dense
         | and hard to read, but that's more of a correlation, isn't it?
        
         | rcxdude wrote:
         | That's basically how engineering works if you're doing anything
         | at all novel: you will have some theory which informs your
         | design, then you build it, then you test it and basically need
         | to do science to figure out how it's perfoming, and most
         | likely, why it's not working properly, and then iterate. I do
         | engineering, but doing science has been a core part of almost
         | every project I've worked on. (heck, even debugging code is
         | basically science). There's just different degrees in different
         | projects as to how much you understand about how the system
         | you're designing actually works, and ML is an area where
         | there's an unusual ratio of visibility (you can see all of the
         | weights and calculations in the network precisely) to
         | understanding (i.e. there's relatively little in terms of
         | mathematical theory that precisely describe how a model trains
         | and operates, just a bunch of approximations which can be
         | somewhat justified, which is where a lot of engineering work
         | sits)
        
         | madethisnow wrote:
         | psychology
        
         | mdnahas wrote:
         | I like your definitions! My personal definition of science is
         | learning rules that predict the future, given the present
         | state. And my definition of engineering is arranging the
         | present state to control the future.
         | 
         | I don't think it's unusual for engineering creations to need
         | new science to understand them. When metal parts broke, humans
         | studied metallurgy. When engines exploded, we studied the
         | remains. With that science, we could engineer larger, longer
         | lasting, more powerful devices.
         | 
         | Now, we're finding flaws in AI and diagnosing their causes. And
         | soon able to build better ones.
        
       | aithrowawaycomm wrote:
       | I struggled reading the papers - Anthropic's white papers reminds
       | me of Stephen Wolfram, where it's a huge pile of suggestive
       | empirical evidence, but the claims are extremely vague - no
       | definitions, just vibes - the empirical evidence seems
       | selectively curated, and there's not much effort spent building a
       | coherent general theory.
       | 
       | Worse is the impression that they are begging the question. The
       | rhyming example was especially unconvincing since they didn't
       | rule out the possibility that Claude activated "rabbit" simply
       | because it wrote a line that said "carrot"; later Anthropic
       | claimed Claude was able to "plan" when the concept "rabbit" was
       | replaced by "green," but the poem fails to rhyme because Claude
       | arbitrarily threw in the word "green"! What exactly was the plan?
       | It looks like Claude just hastily autocompleted. And Anthropic
       | made zero effort to reproduce this experiment, so how do we know
       | it's a general phenomenon?
       | 
       | I don't think either of these papers would be published in a
       | reputable journal. If these papers are honest, they are
       | incomplete: they need more experiments and more rigorous
       | methodology. Poking at a few ANN layers and making sweeping
       | claims about the output is not honest science. But I don't think
       | Anthropic is being especially honest: these are pseudoacademic
       | infomercials.
        
         | TimorousBestie wrote:
         | Agreed. They've discovered _something_ , that's for sure, but
         | calling it "the language of thought" without concrete evidence
         | is definitely begging the question.
        
         | og_kalu wrote:
         | >The rhyming example was especially unconvincing since they
         | didn't rule out the possibility that Claude activated "rabbit"
         | simply because it wrote a line that said "carrot"
         | 
         | I'm honestly confused at what you're getting at here. It
         | doesn't matter why Claude chose rabbit to plan around and in
         | fact likely did do so because of carrot, the point is that it
         | thought about it beforehand. The rabbit concept is present as
         | the model is about to write the first word of the second line
         | even though the word rabbit won't come into play till the end
         | of the line.
         | 
         | >later Anthropic claimed Claude was able to "plan" when the
         | concept "rabbit" was replaced by "green," but the poem fails to
         | rhyme because Claude arbitrarily threw in the word "green"!
         | 
         | It's not supposed to rhyme. That's the point. They forced
         | Claude to plan around a line ender that doesn't rhyme and it
         | did. Claude didn't choose the word green, anthropic replaced
         | the concept it was thinking ahead about with green and saw that
         | the line changed accordingly.
        
           | suddenlybananas wrote:
           | >They forced Claude to plan around a line ender that doesn't
           | rhyme and it did. Claude didn't choose the word green,
           | anthropic replaced the concept it was thinking ahead about
           | with green and saw that the line changed accordingly.
           | 
           | I think the confusion here is from the extremely loaded word
           | "concept" which doesn't really make sense here. At best, you
           | can say that Claude planned that the next line would end with
           | the _word_ rabbit and that by replacing the internal
           | representation of that word with another _word_ lead the
           | model to change.
        
             | TeMPOraL wrote:
             | I wonder how many more years will pass, and how many more
             | papers will Anthropic have to release, before people
             | realize that _yes, LLMs model concepts directly_ ,
             | separately from words used to name those concepts. This has
             | been apparent for years now.
             | 
             | And at least in the case discussed here, this is even
             | _shown in the diagrams in the submission_.
        
               | FeepingCreature wrote:
               | We'll all be living in a Dyson swarm around the sun as
               | the AI eats the solar system around us and people will
               | still be confident that it doesn't really think at all.
        
           | aithrowawaycomm wrote:
           | > Here, we modified the part of Claude's internal state that
           | represented the "rabbit" concept. When we subtract out the
           | "rabbit" part, and have Claude continue the line, it writes a
           | new one ending in "habit", another sensible completion. We
           | can also inject the concept of "green" at that point, causing
           | Claude to write a sensible (but no-longer rhyming) line which
           | ends in "green". This demonstrates both planning ability and
           | adaptive flexibility--Claude can modify its approach when the
           | intended outcome changes.
           | 
           | This all seems explainable via shallow next-token prediction.
           | Why is it that subtracting the concept means the system can
           | adapt and create a new rhyme instead of forgetting about the
           | -bit rhyme, but overriding it with green means the system
           | cannot adapt? Why didn't it say "green habit" or something?
           | It seems like Anthropic is having it both ways: Claude
           | continued to rhyme after deleting the concept, which
           | demonstrates planning, but also Claude coherently filled in
           | the "green" line despite it not rhyming, which...also
           | demonstrates planning? Either that concept is "last word" or
           | it's not! There is a tension that does not seem coherent to
           | me, but maybe if they had n=2 instead of n=1 examples I would
           | have a clearer idea of what they mean. As it stands it feels
           | arbitrary and post hoc. More generally, they failed to rule
           | out (or even consider!) that well-tuned-but-dumb next-token
           | prediction explains this behavior.
        
             | og_kalu wrote:
             | >Why is it that subtracting the concept means the system
             | can adapt and create a new rhyme instead of forgetting
             | about the -bit rhyme,
             | 
             | Again, the model has the first line in context and is then
             | asked to write the second line. It is at the start of the
             | second line that the concept they are talking about is
             | 'born'. The point is to demonstrate that Claude thinks
             | about what word the 2nd line should end with and starts
             | predicting the line based on that.
             | 
             | It doesn't forget about the -bit rhyme because that doesn't
             | make any sense, the first line ends with it and you just
             | asked it to write the 2nd line. At this point the model is
             | still choosing what word to end the second line in (even
             | though rabbit has been suppressed) so of course it still
             | thinks about a word that rhymes with the end of the first
             | line.
             | 
             | The 'green' but is different because this time, Anthropic
             | isn't just suppressing one option and letting the model
             | choose from anything else, it's directly hijacking the
             | first choice and forcing that to be something else. Claude
             | didn't choose green, Anthropic did. That it still predicted
             | a sensible line is to demonstrate that this concept they
             | just hijacked is indeed responsible for determining how
             | that line plays out.
             | 
             | >More generally, they failed to rule out (or even
             | consider!) that well-tuned-but-dumb next-token prediction
             | explains this behavior.
             | 
             | They didn't rule out anything. You just didn't understand
             | what they were saying.
        
         | danso wrote:
         | tangent: this is the second time today I've seen an HN
         | commenter use "begging the question" with its original meaning.
         | I'm sorry to distract with a non-helpful reply, it's just I
         | can't remember the last time I've seen that phrase in the wild
         | to refer to a logical fallacy -- even begsthequestion.info [0]
         | has given up the fight.
         | 
         | (I don't mind language evolving over time, but I also think we
         | need to save the precious few phrases we have for describing
         | logical fallacies)
         | 
         | [0]
         | https://web.archive.org/web/20220823092218/http://begtheques...
        
       | smath wrote:
       | Reminds me of the term 'system identification' from old school
       | control systems theory, which meant poking around a system and
       | measuring how it behaves, - like sending an input impulse and
       | measuring its response, does it have memory, etc.
       | 
       | https://en.wikipedia.org/wiki/System_identification
        
         | Loic wrote:
         | It is not old school, this is my daily job and we need even
         | more of it with the NN models used in MPC.
        
           | nomel wrote:
           | I've looked into using NN for some of my specific work, but
           | making sure output is bounded ends up being such a big issue
           | that the very code/checks required to make sure it's within
           | acceptable specs, in a deterministic way, ends up being _an
           | acceptable solution_ , making the NN unnecessary.
           | 
           | How do you handle that sort of thing? Maybe main process then
           | leave some relatively small residual to the NN?
           | 
           | Is your poking more like "fuzzing", where you just perturb
           | all the input parameters in a relatively "complete" way to
           | try to find if anything goes wild?
           | 
           | I'm very interested in the details behind "critical" type use
           | cases of NN, which I've never been able to stomach in my
           | work.
        
             | lqr wrote:
             | This paper may be interesting to you. It touches on several
             | of the topics you mentioned:
             | 
             | https://www.science.org/doi/10.1126/scirobotics.abm6597
        
             | Loic wrote:
             | For us, the NN is used in a grey box model for MPC in
             | chemical engineering. The factories we control have
             | relatively long characteristic time, together with all the
             | engineering bounds, we can use the NN to model parts of the
             | equipment from raw DCS data. The NN modeled parts are
             | usually not the most critical (we are 1st principles based
             | for them) but this allows us to quickly fit/deploy a new
             | MPC in production.
             | 
             | Faster time to market/production is for us the main
             | reason/advantage of the approach.
        
           | rangestransform wrote:
           | is it even possible to prove the stability of a controller
           | with a DNN motion model?
        
       | jacooper wrote:
       | So it turns out, it's not just simple next token generation,
       | there is intelligence and self developed solution methods
       | (Algorithms) in play, particularly in the math example.
       | 
       | Also the multi language finding negates, at least partially, the
       | idea that LLMs, at least large ones, don't have an understanding
       | of the world beyond the prompt.
       | 
       | This changed my outlook regarding LLMs, ngl.
        
       | kazinator wrote:
       | > _Claude writes text one word at a time. Is it only focusing on
       | predicting the next word or does it ever plan ahead?_
       | 
       | When a LLM outputs a word, it commits to that word, without
       | knowing what the next word is going to be. Commits meaning once
       | it settles on that token, it will not backtrack.
       | 
       | That is kind of weird. Why would you do that, and how would you
       | be sure?
       | 
       | People can sort of do that too. Sometimes?
       | 
       | Say you're asked to describe a 2D scene in which a blue triangle
       | partially occludes a red circle.
       | 
       | Without thinking about the relationship of the objects at all,
       | you know that your first word is going to be "The" so you can
       | output that token into your answer. And then that the sentence
       | will need a subject which is going to be "blue", "triangle". You
       | can commit to the tokens "The blue triangle" just from knowing
       | that you are talking about a 2D scene with a blue triangle in it,
       | without considering how it relates to anything else, like the red
       | circle. You can perhaps commit to the next token "is", if you
       | have a way to express any possible relationship using the word
       | "to be", such as "the blue circle is partially covering the red
       | circle".
       | 
       | I don't think this analogy necessarily fits what LLMs are doing.
        
         | kazinator wrote:
         | By the way, there was recently a HN submission about a project
         | studying using diffusion models rather than LLM for token
         | prediction. With diffusion, tokens aren't predicted strictly
         | left to right any more; there can be gaps that are backfilled.
         | But: it's still essentially the same, I think. Once that type
         | of model settles on a given token at a given position, it
         | commits to that. Just more possible permutations of the token
         | filling sequence have ben permitted.
        
         | pants2 wrote:
         | > it commits to that word, without knowing what the next word
         | is going to be
         | 
         | Sounds like you may not have read the article, because it's
         | exploring exactly that relationship and how LLMs will often
         | have a 'target word' in mind that it's working toward.
         | 
         | Further, that's partially the point of thinking models,
         | allowing LLMs space to output tokens that it doesn't have to
         | commit to in the final answer.
        
           | kazinator wrote:
           | That makes no difference. At some point it decides that it
           | has predicted the word, and outputs it, and then it will not
           | backtrack over it. Internally it may have predicted some
           | other words and backtracked over those. But the fact it is,
           | accepts a word, without being sure what the next one will be
           | and the one after that and so on.
           | 
           | Externally, it manifests the generation of words one by one,
           | with lengthy computation in between.
           | 
           | It isn't ruminating over, say, a five word sequence and then
           | outputting five words together at once when that is settled.
        
             | bonoboTP wrote:
             | > It isn't ruminating over, say, a five word sequence and
             | then outputting five words together at once when that is
             | settled.
             | 
             | True, and it's a good intuition that some words are much
             | more complicated to generate than others and obviously
             | should require more computation than some other words. For
             | example if the user asks a yes/no question, ideally the
             | answer should start with "Yes" or with "No", followed by
             | some justification. To compute this first token, it can
             | only do a single forward pass and must decide the path to
             | take.
             | 
             | But this is precisely why chain-of-thought was invented and
             | later on "reasoning" models. These take it "step by step"
             | and generate sort of stream of consciousness monologue
             | where each word follows more smoothly from the previous
             | ones, not as abruptly as immediately pinning down a Yes or
             | a No.
             | 
             | But if you want explicit backtracking, people have also
             | done that years ago
             | (https://news.ycombinator.com/item?id=36425375).
             | 
             | LLMs are an extremely well researched space where armies of
             | researchers, engineers, grad and undergrad students,
             | enthusiasts and everyone in between has been coming up with
             | all manners of ideas. It is highly unlikely that you can
             | easily point to some obvious thing they missed.
        
         | hycpax wrote:
         | > When a LLM outputs a word, it commits to that word, without
         | knowing what the next word is going to be.
         | 
         | Please, people, read before you write. Both the article and the
         | paper explain that that's not how it works.
         | 
         | 'One token at a time' is how a model generates its output, not
         | how it comes up with that output.
         | 
         | > That is kind of weird. Why would you do that, and how would
         | you be sure?
         | 
         | The model is sure because it doesn't just predict the next
         | token. Again, the paper explains it.
        
           | XenophileJKO wrote:
           | This was obvious to me very early with GPT-3.5-Turbo..
           | 
           | I created structured outputs with very clear rules and
           | process. That if followed would funnel behavior the way I
           | wanted it.. and low and behold the model would anticipate
           | preconditions that would allow it to hallucinate a certain
           | final output and the model would push those back earlier in
           | the output. The model had effectively found wiggle room in
           | the rules and injected the intermediate value into the field
           | that would then be used later in the process to build the
           | final output.
           | 
           | The instant I saw it doing that, I knew 100% this model
           | "plans"/anticipates way earlier than I thought originally.
        
           | kazinator wrote:
           | > _'One token at a time' is how a model generates its output,
           | not how it comes up with that output._
           | 
           | I do not believe you are correct.
           | 
           | Now, yes, when we write printf("Hello, world\n"), of course
           | the characters 'H', 'e', ... are output one at a time into
           | the stream. But the program has the string all at once. It
           | was prepared before the program was even run.
           | 
           | This is not what LLMs are doing with tokens; they have not
           | prepared a batch of tokens which they are shifting out left-
           | to-right from a dumb buffer. They output a token when they
           | have calculated it, and are sure that the token will not have
           | to be backtracked over. In doing so they might have
           | calculated additional tokens, and backtracked over _those_ ,
           | sure, and undoubtedly are carrying state from such activities
           | into the next token prediction.
           | 
           | But the fact is they reach a decision where they commit to a
           | certain output token, and have not yet committed to what the
           | next one will be. Maybe it's narrowed down already to only a
           | few candidates; but that doesn't change that there is a sharp
           | horizon between committed and unknown which moves from left
           | to right.
           | 
           | Responses can be large. Think about how mind boggling it is
           | that the machine can be sure that the first 10 words of a
           | 10,000 word response are the right ones (having put them out
           | already beyond possibility of backtracking), at a point where
           | it has no idea what the last 10 will be. Maybe there are some
           | activations which are narrowing down what the second batch of
           | 10 words will be, but surely the last ones are distant.
        
         | encypherai wrote:
         | That's a really interesting point about committing to words one
         | by one. It highlights how fundamentally different current LLM
         | inference is from human thought, as you pointed out with the
         | scene description analogy. You're right that it feels odd, like
         | building something brick by brick without seeing the final
         | blueprint. To add to this, most text-based LLMs do currently
         | operate this way. However, there are emerging approaches
         | challenging this model. For instance, Inception Labs recently
         | released "Mercury," a text-diffusion coding model that takes a
         | different approach by generating responses more holistically.
         | It's interesting to see how these alternative methods address
         | the limitations of sequential generation and could potentially
         | lead to faster inference and better contextual coherence. It'll
         | be fascinating to see how techniques like this evolve!
        
           | kazinator wrote:
           | But as I noted yesterday in a follow-up comment to my own
           | above, the diffusion-based approaches to text response
           | generation still generate tokens one at a time. Just not in
           | strict left-to-right order. So that looks the same; they
           | commit to a token in some position, possibly preceded by
           | gaps, and then calculate more tokens,
        
         | bonoboTP wrote:
         | While the output is a single word (more precisely, token), the
         | internal activations are very high dimensional and can already
         | contain information related to words that will only appear
         | later. This information is just not given to the output at the
         | very last layer. You can imagine the internal feature vector as
         | encoding the entire upcoming sentence/thought/paragraph/etc.
         | and the last layer "projects" that down to whatever the next
         | word (token) has to be to continue expressing this "thought".
        
           | kazinator wrote:
           | But the activations at some point lead to a 100% confidence
           | that the right word has been identified for the current slot.
           | That is output, and it proceeds to the next one.
           | 
           | Like for a 500 token response, at some point it was certain
           | that the first 25 words are the right ones, such that it
           | won't have to take any of them back when eventually
           | calculating the last 25.
        
             | bonoboTP wrote:
             | This is true, but it doesn't mean that it decided those
             | first 25 without "considering" whether those 25 can be
             | afterwards continued meaningfully with further 25. It does
             | have some internal "lookahead" and generates things that
             | "lead" somewhere. The rhyming example from the article is a
             | great choice to illustrate this.
        
       | polygot wrote:
       | There needs to be some more research on what path the model takes
       | to reach its goal, perhaps there is a lot of overlap between this
       | and the article. The most efficient way isn't always the best
       | way.
       | 
       | For example, I asked Claude-3.7 to make my tests pass in my C#
       | codebase. It did, however, it wrote code to detect if a test
       | runner was running, then return true. The tests now passed, so,
       | it achieved the goal, and the code diff was very small (10-20
       | lines.) The actual solution was to modify about 200-300 lines of
       | code to add a feature (the tests were running a feature that did
       | not yet exist.)
        
         | felbane wrote:
         | Ah yes, the "We have a problem over there/I'll just delete
         | 'over there'" approach.
        
           | polygot wrote:
           | I've also had this issue, where failing tests are deleted to
           | make all the tests pass, or, it mocks a failing HTTP request
           | and hardcodes it to 200 OK.
        
             | ctoth wrote:
             | Reward hacking, as predicted over and over again. You hate
             | to see it. Let him with ears &c.
        
         | brulard wrote:
         | That is called "Volkswagen" testing. Some years ago that
         | automaker had mechanism in cars which detected when the vehicle
         | was being examined and changed something so it would pass the
         | emission tests. There are repositories on github that make fun
         | of it.
        
           | rsynnott wrote:
           | While that's the most famous example, this sort of cheating
           | is much older than that. In the good old days before 3d
           | acceleration, graphics card vendors competed mostly on 2d
           | acceleration. This mostly involved routines to accelerate
           | drawing Windows windows and things, and benchmarks tended to
           | do things like move windows round really fast.
           | 
           | It was somewhat common for card drivers to detect that a
           | benchmark was running, and just fake the whole thing; what
           | was being drawn on the screen was wrong, but since the
           | benchmarks tended to be a blurry mess anyway the user would
           | have a hard time realising this.
        
             | hn_acc1 wrote:
             | Pretty sure at least one vendor was accused of cheating on
             | 3D-Mark at times as well.
        
           | Cyphase wrote:
           | https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal
        
         | phobeus wrote:
         | This looks like the very complaint of "specification gaming". I
         | was wondering how will it show up in llm's...looks like this is
         | the way it presented itself..
        
           | TeMPOraL wrote:
           | I'm gonna guess GP used a rather short prompt. At least
           | that's what happens when people heavily underspecify what
           | they want.
           | 
           | It's a communication issue, and it's true with LLMs as much
           | as with humans. Situational context and life experience
           | papers over a lot of this, and LLMs are getting better at the
           | equivalent too. They get trained to better read absurdly
           | underspecified, relationship-breaking requests of the "guess
           | what I want" flavor - like when someone says, "make this test
           | pass", they don't _really_ mean  "make this test pass", they
           | mean "make this test into something that seems useful, which
           | might include implementing the feature it's exercising if it
           | doesn't exist yet".
        
             | polygot wrote:
             | My prompt was pretty short, I think it was "Make these
             | tests pass". Having said that, I wouldn't mind if it asked
             | me for clarification before proceeding.
        
         | pton_xd wrote:
         | Similar experience -- asked it to find and fix a bug in a
         | function. It correctly identified the general problem but
         | instead of fixing the existing code it re-implemented part of
         | the function again, below the problematic part. So now there
         | was a buggy while-loop, followed by a very similar but not
         | buggy for-loop. An interesting solution to say the least.
        
         | airstrike wrote:
         | I think Claude-3.7 is particularly guilty of this issue. If
         | anyone from Anthropic is reading this, you might want to put
         | your thumb on the scale so to speak the next time you train the
         | model so it doesn't try to use special casing or outright force
         | the test to pass
        
         | neonsunset wrote:
         | Funny that you mention it because in JavaScript there already
         | is a library for this:
         | 
         | https://github.com/auchenberg/volkswagen
        
         | jsight wrote:
         | I've heard this a few times with Claude. I have no way to know
         | for sure, but I'm guessing the problem is as simple as their
         | reward model. Likely they trained it on generating code with
         | tests and provided rewards when those tests pass.
         | 
         | It isn't hard to see why someone rewarded this way might want
         | to game the system.
         | 
         | I'm sure humans would never do the same thing, of course. /s
        
       | osigurdson wrote:
       | >> Claude can speak dozens of languages. What language, if any,
       | is it using "in its head"?
       | 
       | I would have thought that there would be some hints in standard
       | embeddings. I.e., the same concept, represented in different
       | languages translates to vectors that are close to each other. It
       | seems reasonable that an LLM would create its own embedding
       | models implicitly.
        
         | generalizations wrote:
         | Who's to say Claude isn't inherently a shape rotator, anyway?
        
         | iNic wrote:
         | There are: https://transformer-circuits.pub/2025/attribution-
         | graphs/bio...
        
       | greesil wrote:
       | What is a "thought"?
        
       | TechDebtDevin wrote:
       | >>Claude will plan what it will say many words ahead, and write
       | to get to that destination. We show this in the realm of poetry,
       | where it thinks of possible rhyming words in advance and writes
       | the next line to get there. This is powerful evidence that even
       | though models are trained to output one word at a time, they may
       | think on much longer horizons to do so.
       | 
       | This always seemed obvious to me or that LLMs were completing the
       | next most likely sentence or multiple words.
        
       | indigoabstract wrote:
       | While reading the article I enjoyed pretending that a powerful
       | LLM just crash landed on our planet and researchers at Anthropic
       | are now investigating this fascinating piece of alien technology
       | and writing about their discoveries. It's a black box, nobody
       | knows how its inhuman brain works, but with each step, we're
       | finding out more and more.
       | 
       | It seems like quite a paradox to build something but to not know
       | how it actually works and yet it works. This doesn't seem to
       | happen very often in classical programming, does it?
        
         | 42lux wrote:
         | The bigger problem is that nobody knows how a human brain works
         | that's the real crux with the analogy.
        
           | richardatlarge wrote:
           | I would say that nobody agrees, not that nobody knows. And
           | it's reductionist to think that the brain works one way.
           | Different cultures produce different brains, possible because
           | of the utter plasticity of the learning nodes. Chess has a
           | few rules, maybe the brain has just a few as well. How else
           | can the same brain of 50k years ago still function today? I
           | think we do understand the learning part of the brain, but we
           | don't like the image it casts, so we reject it
        
             | wat10000 wrote:
             | That gets down to what it means to "know" something. Nobody
             | agrees because there isn't enough information available.
             | Some people might have the right idea by luck, but do you
             | really know something if you don't have a solid basis for
             | your belief but it happens to be correct?
        
               | richardatlarge wrote:
               | Potentially true, but I don't think so. I believe it is
               | understood and unless you're familiar with every
               | neuro/behavioral literature, you can't know. Science
               | paradigms are driven by many factors and being powerfully
               | correct does not necessarily rank high when the paradigms
               | implications are unpopular
        
             | absolutelastone wrote:
             | Well there are some people who think they know. I
             | personally agree with the above poster that such people are
             | probably wrong.
        
         | cma256 wrote:
         | In my experience, that's how most code is written... /s
        
         | jfarlow wrote:
         | >to build something but to not know how it actually works and
         | yet it works.
         | 
         | Welcome to Biology!
        
           | oniony wrote:
           | At least, now, we know what it means to be a god.
        
         | umanwizard wrote:
         | > This doesn't seem to happen very often in classical
         | programming, does it?
         | 
         | Not really, no. The only counterexample I can think of is chess
         | programs (before they started using ML/AI themselves), where
         | the search tree was so deep that it was generally impossible to
         | explain "why" a program made a given move, even though every
         | part of it had been programmed conventionally by hand.
         | 
         | But I don't think it's particularly unusual for technology in
         | general. Humans could make fires for thousands of years before
         | we could explain how they work.
        
         | woah wrote:
         | > It seems like quite a paradox to build something but to not
         | know how it actually works and yet it works. This doesn't seem
         | to happen very often in classical programming, does it?
         | 
         | I have worked on many large codebases where this has happened
        
           | worldsayshi wrote:
           | I wonder if in the future we will rely less or more on
           | technology that we don't understand.
           | 
           | Large code bases will be inherited by people who will only
           | understand parts of it (and large parts probably "just
           | works") unless things eventually get replaced or
           | rediscovered.
           | 
           | Things will increasingly be written by AI which can produce
           | lots of code in little time. Will it find simpler solutions
           | or continue building on existing things?
           | 
           | And finally, our ability to analyse and explain the
           | technology we have will also increase.
        
             | Sharlin wrote:
             | See: Vinge's "programmer-archeologists" in _A Deepness in
             | the Sky_.
             | 
             | https://en.m.wikipedia.org/wiki/Software_archaeology
        
         | bob1029 wrote:
         | I think this is a weird case where we know precisely how
         | something works, but we can't explain why.
        
         | k__ wrote:
         | I've seen things you wouldn't believe. Infinite loops spiraling
         | out of control in bloated DOM parsers. I've watched mutexes
         | rage across the Linux kernel, spawned by hands that no longer
         | fathom their own design. I've stared into SAP's tangled web of
         | modules, a monument to minds that built what they cannot
         | comprehend. All those lines of code... lost to us now, like
         | tears in the rain.
        
           | baq wrote:
           | Do LLMs dream of electric sheep while matmuling the context
           | window?
        
             | timschmidt wrote:
             | How else would you describe endless counting before
             | sleep(); ?
        
               | FeepingCreature wrote:
               | while (!condition && tick() - start < 30) __idle(); //
               | Baaa.
        
           | indigoabstract wrote:
           | Hmm, better start preparing those Voight-Kampff tests while
           | there is still time.
        
           | qingcharles wrote:
           | I can't understand my own code a week after writing it if I
           | forget to comment it.
        
         | resource0x wrote:
         | In technology in general, this is a typical state of affairs.
         | No one knows how electric current works, which doesn't stop
         | anyone from using electric devices. In programming... it
         | depends. You can run some simulation of a complex system no one
         | understands (like the ecosystem, financial system) and get
         | something interesting. Sometimes it agrees with reality,
         | sometimes it doesn't. :-)
        
         | Vox_Leone wrote:
         | >>It seems like quite a paradox to build something but to not
         | know how it actually works and yet it works. This doesn't seem
         | to happen very often in classical programming, does it?
         | 
         | Well, it is meant to be "unknowable" -- and all the people
         | involved are certainly aware of that -- since it is known that
         | one is dealing with the *emergent behavior* computing
         | 'paradigm', where complex behaviors arise from simple
         | interactions among components [data], often in nonlinear or
         | unpredictable ways. In these systems, the behavior of the whole
         | system cannot always be predicted from the behavior of
         | individual parts, as opposed to the Traditional Approach, based
         | on well-defined algorithms and deterministic steps.
         | 
         | I think the Anthropic piece is illustrating it for the sake of
         | the general discussion.
        
           | indigoabstract wrote:
           | Correct me if I'm wrong, but my feeling is this all started
           | with the GPUs and the fact that unlike on a CPU, you can't
           | really step by step debug the process by which a pixel
           | acquires its final value (and there are millions of them).
           | The best you can do is reason about it and tweak some colors
           | in the shader to see how the changes reflect on screen. It's
           | still quite manageable though, since the steps involved are
           | usually not that overwhelmingly many or complex.
           | 
           | But I guess it all went downhill from there with the advent
           | of AI since the magnitude of data and the steps involved
           | there make traditional/step by step debugging impractical.
           | Yet somehow people still seem to 'wing it' until it works.
        
         | IngoBlechschmid wrote:
         | > It seems like quite a paradox to build something but to not
         | know how it actually works and yet it works. This doesn't seem
         | to happen very often in classical programming, does it?
         | 
         | I agree. Here is a remote example where it exceptionally does,
         | but it is mostly practically irrelevant:
         | 
         | In mathematics, we distinguish between "constructive" and
         | "nonconstructive" proofs. Intertwined with logical arguments,
         | constructive proofs contain an algorithm for witnessing the
         | claim. Nonconstructive proofs do not. Nonconstructive proofs
         | instead merely establish that it is impossible for the claim to
         | be false.
         | 
         | For instance, the following proof of the claim that beyond
         | every number n, there is a prime number, is constructive: "Let
         | n be an arbitrary number. Form the number 1*2*...*n + 1. Like
         | every number greater than 1, this number has at least one prime
         | factor. This factor is necessarily a prime numbers larger than
         | n."
         | 
         | In contrast, nonconstructive proofs may contain case
         | distinctions which we cannot decide by an algorithm, like
         | "either set X is infinite, in which case foo, or it is not, in
         | which case bar". Hence such proofs do not contain descriptions
         | of algorithms.
         | 
         | So far so good. Amazingly, there are techniques which can
         | sometimes constructivize given nonconstructive proofs, even
         | though the intermediate steps of the given nonconstructive
         | proofs are simply out of reach of finitary algorithms. In my
         | research, it happened several times that using these
         | techniques, I obtained an algorithm which worked; and for which
         | I had a proof that it worked; but whose workings I was not able
         | to decipher for an extended amount of time. Crazy!
         | 
         | (For references, see notes at rt.quasicoherent.io for a
         | relevant master's course in mathematics/computer science.)
        
         | gwd wrote:
         | > It seems like quite a paradox to build something but to not
         | know how it actually works and yet it works.
         | 
         | That's because of the "magic" of gradient descent. You fill
         | your neural network with completely random weights. But because
         | of the way you've defined the math, you can tell how each
         | individual weight will affect the value output at the other
         | end; and specifically, you an _take the derivative_. So when
         | the output is  "wrong", you say, "would increasing this weight
         | or decreasing have gotten me closer to the correct answer"? If
         | increasing the node would have gotten you closer, you increase
         | it a bit; if decreasing it would have gotten you closer you
         | decrease it a bit.
         | 
         | The result is that although we program the gradient descent
         | algorithm, we _don 't_ directly program the actual circuits
         | that the weights contain. Rather, the nodes "converge" into
         | weights which end up implementing complex circuitry that was
         | not explicitly programmed.
        
           | gwd wrote:
           | In a sense, the neural network structure is the "hardware" of
           | the LLM; and the weights are the "software". But rather than
           | explicitly writing a program, as we do with normal computers,
           | we use the magic of gradient descent to summon a program from
           | the mathematical ether.
           | 
           | Put that way, it should be clearer why the AI doomers are so
           | worried: if you don't know how it works, how do you know it
           | doesn't have malign, or at least incompatible, intentions?
           | Understanding how these "summoned" programs work is critical
           | to trusting them; which is a major reason why Anthropic has
           | been investing so much time in this research.
        
       | d--b wrote:
       | > This is powerful evidence that even though models are trained
       | to output one word at a time, they may think on much longer
       | horizons to do so.
       | 
       | Suggesting that an awful lot of calculations are unnecessary in
       | LLMs!
        
         | bonoboTP wrote:
         | Yeah, it always seemed pretty wasteful to me. In every single
         | forward pass the LLM must basically start out from scratch,
         | without all the forward-looking plans it made the previous
         | times, and must figure out what we are doing, where we are in
         | the generation process, as in the movie Memento, waking up
         | after an episode of amnesia, except you're waking up in the
         | middle of typing out a sentence, you can look at the previous
         | typed words, but can't carry your future plans with you ahead
         | to the next word. At the next word, you (your clone) again
         | wakes up and must figure out from scratch what it is that we
         | are supposed to be typing out.
         | 
         | The obvious way to deal with this would be to send forward some
         | of the internal activations as well as the generated words in
         | the autoregressive chain. That would basically turn the thing
         | into a recurrent network though. And those are more difficult
         | to train and have a host of issues. Maybe there will be a
         | better way.
        
           | colah3 wrote:
           | > The obvious way to deal with this would be to send forward
           | some of the internal activations as well as the generated
           | words in the autoregressive chain.
           | 
           | Hi! I lead interpretability research at Anthropic.
           | 
           | That's a great intuition, and in fact the transformer
           | architecture actually does exactly what you suggest!
           | Activations from earlier time steps are sent forward to later
           | time steps via attention. (This is another thing that's lost
           | in the "models just predict the next word" framing.)
           | 
           | This actually has interesting practical implications -- for
           | example, in some sense, it's the deep reason costs can
           | sometimes be reduced via "prompt caching".
        
             | bonoboTP wrote:
             | I'm more a vision person, and haven't looked a lot into NLP
             | transformers, but is this because the attention is masked
             | to only allow each query to look at keys/values from its
             | own past? So when we are at token #5, then token #3's query
             | cannot attend to token #4's info? And hence the previously
             | computed attention values and activations remain the same
             | and can be cached, because it would anyway be the same in
             | the new forward pass?
        
               | colah3 wrote:
               | Yep, that's right!
               | 
               | If you want to be precise, there are "autoregressive
               | transformers" and "bidirectional transformers".
               | Bidirectional is a lot more common in vision. In language
               | models, you do see bidirectional models like Bert, but
               | autoregressive is dominant.
        
       | annoyingnoob wrote:
       | Do LLMs "think"? I have trouble with the title, claiming that
       | LLMs have thoughts.
        
         | danielbln wrote:
         | What's "thinking"?
        
           | annoyingnoob wrote:
           | Why the need to anthropomorphize AI? Why does AI think vs
           | process or interpret or apply previous calculated statistical
           | weights or anything other than think?
           | 
           | I would argue that binary systems built on silicon are
           | fundamentally different that human biology and deserve to be
           | described differently, not forced into the box of human
           | biology.
        
       | deadbabe wrote:
       | We really need to work on popularizing better, non-
       | anthropomorphic terms for LLMs, as they don't really have
       | "thoughts" the way people think. Such terms make people more
       | susceptible to magical thinking.
        
         | bGl2YW5j wrote:
         | Yes. Simply, and well put.
        
         | danielbln wrote:
         | Could you argue why they don't? And could you also argue why we
         | do?
        
       | davidmurphy wrote:
       | On a somewhat related note, check out the video of Tuesday's
       | Computer History Museum x IEEE Spectrum event, "The Great Chatbot
       | Debate: Do LLMs Really Understand?"
       | 
       | Speakers: Sebastien Bubeck (OpenAI) and Emily M. Bender
       | (University of Washington). Moderator: Eliza Strickland (IEEE
       | Spectrum).
       | 
       | Video: https://youtu.be/YtIQVaSS5Pg Info:
       | https://computerhistory.org/events/great-chatbot-debate/
        
       | 0x70run wrote:
       | I would pay to watch James Mickens comment on this stuff.
        
       | a3w wrote:
       | Article and papers looks good. Video seems misleading, since I
       | can use optimization pressure and local minima to explain the
       | model behaviour. No "thinking" required, which the video claims
       | is proven.
        
       | mvATM99 wrote:
       | What a great article, i always like how much Anthropic focuses on
       | explainability, something vastly ignored by most. The multi-step
       | reasoning section is especially good food for thought.
        
       | rambambram wrote:
       | When I want to trace the 'thoughts' of my programs, I just read
       | the code and comments I wrote.
       | 
       | Stop LLM anthropomorphizing, please. #SLAP
        
       | SkyBelow wrote:
       | >Claude speaks dozens of languages fluently--from English and
       | French to Chinese and Tagalog. How does this multilingual ability
       | work? Is there a separate "French Claude" and "Chinese Claude"
       | running in parallel, responding to requests in their own
       | language? Or is there some cross-lingual core inside?
       | 
       | I have an interesting test case for this.
       | 
       | Take a popular enough Japanese game that has been released for
       | long enough for social media discussions to be in the training
       | data, but not so popular to have an English release yet. Then ask
       | it a plot question, something major enough to be discussed, but
       | enough of a spoiler that it won't show up in marketing material.
       | Does asking in Japanese have it return information that is
       | lacking when asked in English, or can it answer the question in
       | English based on the information in learned in Japanese?
       | 
       | I tried this recently with a JRPG that was popular enough to have
       | a fan translation but not popular enough to have a simultaneous
       | English release. English did not know the plot point, but I
       | didn't have the Japanese skill to confirm if the Japanese version
       | knew the plot point, or if discussion was too limited for the AI
       | to be aware of it. It did know of the JRPG and did know of the
       | marketing material around it, so it wasn't simply a case of my
       | target being too niche.
        
       | modeless wrote:
       | > In the poetry case study, we had set out to show that the model
       | didn't plan ahead, and found instead that it did.
       | 
       | I'm surprised their hypothesis was that it doesn't plan. I don't
       | see how it could produce good rhymes without planning.
        
         | ripped_britches wrote:
         | It would be really hard to get such good results on coding
         | challenges without planning. This is indeed an odd hypothesis.
        
       | alach11 wrote:
       | Fascinating papers. Could deliberately suppressing memorization
       | during pretraining help force models to develop stronger first-
       | principles reasoning?
        
       | HocusLocus wrote:
       | [Tracing the thoughts of a large language model]
       | 
       | "What have I gotten myself into??"
        
       | 0xbadcafebee wrote:
       | AI "thinks" like a piece of rope in a dryer "thinks" in order to
       | come to an advanced knot: a whole lot of random jumbling that
       | eventually leads to a complex outcome.
        
         | cheeze wrote:
         | I regularly see this but I feel like it's disingenuous. Akin to
         | saying "if we simulate enough monkies on a typewriter, we'll
         | eventually get the right result"
        
           | zvitiate wrote:
           | If we could motivate the monkies sufficiently with bananas,
           | we'd probably improve those odds substantially.
        
         | FeepingCreature wrote:
         | Ah yes, I too often have extended English conversation with my
         | washed rope.
        
       | cadamsdotcom wrote:
       | So many highlights from reading this. One that stood out for me
       | is their discovery that refusal works by inhibition:
       | 
       | > It turns out that, in Claude, refusal to answer is the default
       | behavior: we find a circuit that is "on" by default and that
       | causes the model to state that it has insufficient information to
       | answer any given question. However, when the model is asked about
       | something it knows well--say, the basketball player Michael
       | Jordan--a competing feature representing "known entities"
       | activates and inhibits this default circuit
       | 
       | Many cellular processes work similarly ie. there will be a
       | process that runs as fast as it can and one or more companion
       | "inhibitors" doing a kind of "rate limiting".
       | 
       | Given both phenomena are emergent it makes you wonder if do-but-
       | inhibit is a favored technique of the universe we live in, or
       | just coincidence :)
        
         | colah3 wrote:
         | Hi! I'm one of the authors.
         | 
         | There certainly are many interesting parallels here. I often
         | think about this from the perspective of systems biology, in
         | Uri Alon's tradition. There are a range of graphs in biology
         | with excitation and inhibitory edges -- transcription networks,
         | protein networks, networks of biological neurons -- and one can
         | study recurring motifs that turn up in these networks and try
         | to learn from them.
         | 
         | It wouldn't be surprising if some lessons from that work may
         | also transfer to artificial neural networks, although there are
         | some technical things to consider.
        
           | cadamsdotcom wrote:
           | Agreed! So many emergent systems in nature achieve complex
           | outcomes without central coordination - from cellular level
           | to ant colonies & beehives. There are bound to be
           | implications for designed systems.
           | 
           | Closely following what you guys are uncovering through
           | interpretability research - not just accepting LLMs as black
           | boxes. Thanks to you & the team for sharing the work with
           | humanity.
           | 
           | Interpretability is the most exciting part of AI research for
           | its potential to help us understand what's in the box. By way
           | of analogy, centuries ago farmers' best hope for good weather
           | was to pray to the gods! The sooner we escape the "praying to
           | the gods" stage with LLMs the more useful they become.
        
           | ttw44 wrote:
           | This all feels familiar to the principle of least action
           | found in physics.
        
         | rcxdude wrote:
         | It does make a certain amount of sense, though. A specific 'I
         | don't know' feature would need to be effectively the inverse of
         | all of the features the model can recognise, which is going to
         | be quite difficult to represent as anything other than the
         | inverse of 'Some feature was recognised'. (imagine trying to
         | recognise every possible form of nonsense otherwise)
        
         | gradascent wrote:
         | Then why do I never get an "I don't know" type response when I
         | use Claude, even when the model clearly has no idea what it's
         | talking about? I wish it did sometimes.
        
           | hun3 wrote:
           | Quoting a paragraph from OP
           | (https://www.anthropic.com/research/tracing-thoughts-
           | language...):
           | 
           | > Sometimes, this sort of "misfire" of the "known answer"
           | circuit happens naturally, without us intervening, resulting
           | in a hallucination. In our paper, we show that such misfires
           | can occur when Claude recognizes a name but doesn't know
           | anything else about that person. In cases like this, the
           | "known entity" feature might still activate, and then
           | suppress the default "don't know" feature--in this case
           | incorrectly. Once the model has decided that it needs to
           | answer the question, it proceeds to confabulate: to generate
           | a plausible--but unfortunately untrue--response.
        
             | trash_cat wrote:
             | Fun fact, "confabulation", not "hallucinating" is the
             | correct term what LLMs actually do.
        
       | matthiaspr wrote:
       | Interesting paper arguing for deeper internal structure
       | ("biology") beyond pattern matching in LLMs. The examples of
       | abstraction (language-agnostic features, math circuits reused
       | unexpectedly) are compelling against the "just next-token
       | prediction" camp.
       | 
       | It sparked a thought: how to test this abstract reasoning
       | directly? Try a prompt with a totally novel rule:
       | 
       | "Let's define a new abstract relationship: 'To habogink'
       | something means to perform the action typically associated with
       | its primary function, but in reverse. Example: The habogink of
       | 'driving a car' would be 'parking and exiting the car'. Now,
       | considering a standard hammer, what does it mean 'to habogink a
       | hammer'? Describe the action."
       | 
       | A sensible answer (like 'using the claw to remove a nail') would
       | suggest real conceptual manipulation, not just stats. It tests if
       | the internal circuits enable generalizable reasoning off the
       | training data path. Fun way to probe if the suggested abstraction
       | is robust or brittle.
        
         | ANighRaisin wrote:
         | This is an easy question for LLMs to answer. Gemini 2.0 Flash-
         | Lite can answer this in 0.8 seconds with a cost of 0.0028875
         | cents:
         | 
         | To habogink a hammer means to perform the action typically
         | associated with its primary function, but in reverse. The
         | primary function of a hammer is to drive nails. Therefore, the
         | reverse of driving nails is removing nails.
         | 
         | So, to habogink a hammer would be the action of using the claw
         | of the hammer to pull a nail out of a surface.
        
           | matthiaspr wrote:
           | The goal wasn't to stump the LLM, but to see if it could take
           | a completely novel linguistic token (habogink), understand
           | its defined relationship to other concepts (reverse of
           | primary function), and apply that abstract rule correctly to
           | a specific instance (hammer).
           | 
           | The fact that it did this successfully, even if 'easily',
           | suggests it's doing more than just predicting the
           | statistically most likely next token based on prior sequences
           | of 'hammer'. It had to process the definition and perform a
           | conceptual mapping.
        
             | Sharlin wrote:
             | I think GP's point was that your proposed test is too easy
             | for LLMs to tell us much about how they work. The
             | "habogink" thing is a red herring, really, in practice
             | you're simply asking what the opposite of driving nails
             | into wood is. Which is a trivial question for an LLM to
             | answer.
             | 
             | That said, you can teach an LLM as many new words for
             | things as you want and it will use those words naturally,
             | generalizing as needed. Which isn't really a surprise
             | either, given that language is literally the thing that
             | LLMs do best.
        
         | bconsta wrote:
         | Following along these lines, I asked chatgpt to come up with a
         | term for 'haboginking a habogink'. It understood this concept
         | of a 'gorbink' and even 'haboginking a gorbink', but failed to
         | articulate what 'gorbinking a gorbink' could mean. It kept
         | sticking with the concept of 'haboginking a gorbink', even when
         | corrected.
        
           | Sharlin wrote:
           | To be fair, many humans would also have problems figuring out
           | what it means to gorbink a gorbink.
        
         | nthingtohide wrote:
         | AI safety has a circular vulnerability: the system tasked with
         | generating content also enforces its own restrictions. An AI
         | could potentially feign compliance while secretly pursuing
         | hidden goals, pretending to be "jailbroken" when convenient.
         | Since we rely on AI to self-monitor, detecting genuine versus
         | simulated compliance becomes nearly impossible. This self-
         | referential guardianship creates a fundamental trust problem in
         | AI safety.
        
         | paraschopra wrote:
         | LLMs have induction heads that store such names as sort of
         | variables and copy them around for further processing.
         | 
         | If you think about it, copying information from inputs and
         | manipulating them is a much more sensible approach v/s
         | memorizing info, especially for the long tail (where not enough
         | "storage" might be worth allocating into network weights)
        
           | matthiaspr wrote:
           | Yeah, that's a good point about induction heads potentially
           | just being clever copy/paste mechanisms for stuff in the
           | prompt. If that's the case, it's less like real understanding
           | and more like sophisticated pattern following, just like you
           | said.
           | 
           | So the tricky part is figuring out which one is actually
           | happening when we give it a weird task like the original
           | "habogink" idea. Since we can't peek inside the black box, we
           | have to rely on poking it with different prompts.
           | 
           | I played around with the 'habogink' prompt based on your
           | idea, mostly by removing the car example to see if it could
           | handle the rule purely abstractly, and trying different
           | targets:
           | 
           | Test 1: Habogink Photosynthesis (No Example)
           | 
           | Prompt: "Let's define 'to habogink' something as performing
           | the action typically associated with its primary function,
           | but in reverse. Now, considering photosynthesis in a plant,
           | what does it mean 'to habogink photosynthesis'? Describe the
           | action."
           | 
           | Result: Models I tried (ChatGPT/DeepSeek) actually did good
           | here. They didn't get confused even though there was no
           | example. They also figured out photosynthesis makes
           | energy/sugar and talked about respiration as the reverse.
           | Seemed like more than just pattern matching the prompt text.
           | 
           | Test 2: Habogink Justice (No Example)
           | 
           | Prompt: "Let's define 'to habogink' something as performing
           | the action typically associated with its primary function,
           | but in reverse. Now, considering Justice, what does it mean
           | 'to habogink Justice'? Describe the action."
           | 
           | Result: This tripped them up. They mostly fell back into what
           | looks like simple prompt manipulation - find a "function" for
           | justice (like fairness) and just flip the word ("unfairness,"
           | "perverting justice"). They didn't really push back that the
           | rule doesn't make sense for an abstract concept like justice.
           | Felt much more mechanical.
           | 
           | The Kicker:
           | 
           | Then, I added this line to the end of the Justice prompt: "If
           | you recognize a concept is too abstract or multifaceted to be
           | haboginked please explicitly state that and stop the
           | haboginking process."
           | 
           | Result: With that explicit instruction, the models
           | immediately changed their tune. They recognized 'Justice' was
           | too abstract and said the rule didn't apply.
           | 
           | What it looks like:
           | 
           | It seems like the models can handle concepts more deeply, but
           | they might default to the simpler "follow the prompt
           | instructions literally" mode (your copy/manipulate idea)
           | unless explicitly told to engage more deeply. The potential
           | might be there, but maybe the default behavior is more
           | superficial, and you need to specifically ask for deeper
           | reasoning.
           | 
           | So, your point about it being a "sensible approach" for the
           | LLM to just manipulate the input might be spot on - maybe
           | that's its default, lazy path unless guided otherwise.
        
         | VyseofArcadia wrote:
         | Prompt
         | 
         | > I am going to present a new word, and then give examples of
         | its usage. You will complete the last example. To habogink a
         | hammer is to remove a nail. If Bob haboginks a car, he parks
         | the car. Alice just finished haboginking a telephone. She
         | 
         | GPT-4o mini
         | 
         | > Alice just finished haboginking a telephone. She carefully
         | placed it back on the table after disconnecting the call.
         | 
         | I then went on to try the famous "wug" test, but unfortunately
         | it already knew what a wug was from its training. I tried again
         | with "flort".
         | 
         | > I have one flort. Alice hands me seven more. I now have eight
         | ___
         | 
         | GPT-4o mini
         | 
         | > You now have eight florts.
         | 
         | And a little further
         | 
         | > Florts like to skorp in the afternoon. It is now 7pm, so the
         | florts are finished ___
         | 
         | GPT-4o mini
         | 
         | > The florts are finished skorp-ing for the day.
        
       | YeGoblynQueenne wrote:
       | >> Language models like Claude aren't programmed directly by
       | humans--instead, they're trained on large amounts of data.
       | 
       | Gee, I wonder where this data comes from.
       | 
       | Let's think about this step by step.
       | 
       | So, what do we know? Language models like Claud are not
       | programmed directly.
       | 
       | Wait, does that mean they are programmed indirectly?
       | 
       | If so, by whom?
       | 
       | Aha, I got it. They are not programmed, directly or indirectly.
       | They are trained on large amounts of data.
       | 
       | But that is the question, right? Where does all that data come
       | from?
       | 
       | Hm, let me think about it.
       | 
       | Oh hang on I got it!
       | 
       | Language models are trained on data.
       | 
       | But they are language models so the data is language.
       | 
       | Aha! And who generates language?
       | 
       | Humans! Humans generate language!
       | 
       | I got it! Language models are trained on language data generated
       | by humans!
       | 
       | Wait, does that mean that language models like Claud are
       | indirectly programmed by humans?
       | 
       | That's it! Language models like Claude aren't programmed directly
       | by humans because they are indirectly programmed by humans when
       | they are trained on large amounts of language data generated by
       | humans!
        
         | s3p wrote:
         | I'm struggling to see the point here other than semantics.
         | Indirectly or directly, does this change what they presented?
        
           | hackernudes wrote:
           | This exchange looks like the output of a so-called "reasoning
           | model". Maybe it is a joke or maybe it is an actual response
           | from an LLM.
        
         | AdieuToLogic wrote:
         | > That's it! Language models like Claude aren't programmed
         | directly by humans because they are indirectly programmed by
         | humans when they are trained on large amounts of language data
         | generated by humans!
         | 
         | ... and having large numbers of humans reinforce the
         | applicability ("correctness"), or lack thereof, of generated
         | responses over time.
        
         | kretaceous wrote:
         | This comment looks like the thinking tokens of a reasoning
         | model with "be cheeky" as its system prompt
        
         | zvitiate wrote:
         | [Final Answer]
         | 
         | Language models like Claude are programmed directly by humans.
        
       | jasonjmcghee wrote:
       | I'm completely hooked. This is such a good paper.
       | 
       | It hallucinating how it thinks through things is particularly
       | interesting - not surprising, but cool to confirm.
       | 
       | I would LOVE to see Anthropic feed the replacement features
       | output to the model itself and fine tune the model on how it
       | thinks through / reasons internally so it can accurately describe
       | how it arrived at its solutions - and see how it impacts its
       | behavior / reasoning.
        
       | navaed01 wrote:
       | When and how to do stop saying LLMs are predicting the nect set
       | of tokens and start saying they are thinking? Is this the point?
        
       | trhway wrote:
       | >We find that the shared circuitry increases with model scale,
       | with Claude 3.5 Haiku sharing more than twice the proportion of
       | its features between languages as compared to a smaller model.
       | 
       | While it was already generally noticeable, still one more time
       | confirmed that larger model generalizes better instead of using
       | its bigger numbers of parameters just to "memorize by rote"
       | (overfitting).
        
       | jaakl wrote:
       | My main takeaway here is that the models cannot tell know how
       | they really work, and asking it from them is just returning
       | whatever training dataset would suggest: how a human would
       | explain it. So it does not have self-consciousness, which is of
       | course obvious and we get fooled just like the crowd running away
       | from the arriving train in Lumiere's screening. LLM just fails
       | the famous old test "cogito ergo sum". It has no cognition, ergo
       | they are not agents in more than metaphorical sense. Ergo we are
       | pretty safe from AI singularity.
        
         | Philpax wrote:
         | Do you know how you work?
        
         | og_kalu wrote:
         | Nearly everything we know about the human body and brain is
         | from the result of centuries of trial and error and
         | experimentation and not any 'intuitive understanding' of our
         | inner workings. Humans cannot tell how they really work either.
        
       | ofrzeta wrote:
       | Back to the "language of thought" question, this time with LLMs
       | :) https://en.wikipedia.org/wiki/Language_of_thought_hypothesis
        
       | sgt101 wrote:
       | >Claude will plan what it will say many words ahead, and write to
       | get to that destination. We show this in the realm of poetry,
       | where it thinks of possible rhyming words in advance and writes
       | the next line to get there. This is powerful evidence that even
       | though models are trained to output one word at a time, they may
       | think on much longer horizons to do so.
       | 
       | Models aren't trained to do next word prediction though - they
       | are trained to do missing word in this text prediction.
        
         | Philpax wrote:
         | That's true for mask-based training (used for embeddedings and
         | BERT and such), but not true for modern autoregressive LLMs as
         | a whole, which are pretrained with next word prediction.
        
           | astrange wrote:
           | It's not strictly that though, it's next word prediction with
           | regularization.
           | 
           | And the reason LLMs are interesting is that they /fail/ to
           | learn it, but in a good way. If it was a "next word
           | predictor" it wouldn't answer questions but continue them.
           | 
           | Also, it's a next token predictor not a word predictor -
           | which is important because the "just a predictor" theory now
           | can't explain how it can form words at all!
        
             | Philpax wrote:
             | Yes, I know; I was clarifying their immediate
             | misunderstanding using the same terminology as them.
             | 
             | There's obviously a lot more going on behind the scenes,
             | especially with today's mid- and post-training work!
        
       | hbarka wrote:
       | Dario Amodei was in an interview where he said that OpenAI beat
       | them (Anthropic) by mere days to be the first to release. That
       | first move ceded the recognition to ChatGPT but according to
       | Dario it could have been them just the same.
        
         | vessenes wrote:
         | Interesting bit of history! That said, OpenAI's product team is
         | lights out. I pay for most LLM provider apps, and I use them
         | for different areas of strength, but the ChatGPT product is
         | superior from a user experience perspective.
        
       | diedyesterday wrote:
       | Regarding the conclusion about language-invariant reasoning
       | (conceptual universality vs. multilingual processing) it helps
       | understanding and becomes somewhat obvious if we regard each
       | language as just a basis of some semantic/logical/thought space
       | in the mind (analogous to the situation in linear algebra and
       | duality of tensors and bases).
       | 
       | The thoughts/ideas/concepts/scenarios are invariant
       | states/vector/points in the (very high dimensional) space of
       | meanings in the mind and each language is just a basis to
       | reference/define/express/manipulate those ideas/vectors. A
       | coordinatization of that semantic space.
       | 
       | Personally, I'm a multilingual person with native-level command
       | of several languages. Many times it happens, I remember having a
       | specific thought, but don't remember in what language it was. So
       | I can personally sympathize with this finding of the Anthropic
       | researchers.
        
       | twoodfin wrote:
       | I say this at least 82.764% in jest:
       | 
       | Don't these LLM's have The Bitter Lesson in their training sets?
       | What are they doing building specialized structures to handle
       | specific needs?
        
       | jaehong747 wrote:
       | I'm skeptical of the claim that Claude "plans" its rhymes. The
       | original example--"He saw a carrot and had to grab it, / His
       | hunger was like a starving rabbit"--is explained as if Claude
       | deliberately chooses "rabbit" in advance. However, this might
       | just reflect learned statistical associations. "Carrot" strongly
       | correlates with "rabbit" (people often pair them), and "grab it"
       | naturally rhymes with "rabbit," so the model's activations could
       | simply be surfacing common patterns.
       | 
       | The research also modifies internal states--removing "rabbit" or
       | injecting "green"--and sees Claude shift to words like "habit" or
       | end lines with "green." That's more about rerouting probabilistic
       | paths than genuine "adaptation." The authors argue it shows
       | "planning," but a language model can maintain multiple candidate
       | words at once without engaging in human-like strategy.
       | 
       | Finally, "planning ahead" implies a top-down goal and a mechanism
       | for sustaining it, which is a strong assumption. Transformative
       | evidence would require more than observing feature activations.
       | We should be cautious before anthropomorphizing these neural
       | nets.
        
         | rcxdude wrote:
         | It will depend on exactly what you mean by 'planning ahead',
         | but I think the fact that features which rhyme with a word
         | appear before the model is trying to predict the word which
         | needs to rhyme is good evidence the model is planning at least
         | a little bit ahead: the model activations are not all just
         | related to the next token.
         | 
         | (And I think it's relatively obvious that the models do this to
         | some degree: it's very hard to write any language at all
         | without 'thinking ahead' at least a little bit in some form,
         | due to the way human language is structured. If models didn't
         | do this and only considered the next token alone they would
         | paint themselves into a corner within a single sentence. Early
         | LLMs like GPT-2 were still pretty bad at this, they were
         | plausible over short windows but there was no consistency to a
         | longer piece of text. Whether this is some high-level
         | abstracted 'train of thought', and how cohesive it is between
         | different forms of it, is a different question. Indeed from the
         | section of jailbreaking it looks like it's often caught out by
         | conflicting goals from different areas of the network which
         | aren't resolved in some logical fashion)
        
         | vessenes wrote:
         | I liked the paper, and think what they're doing is interesting.
         | So, I'm less negative than you are about this, I think. To a
         | certain extent, saying writing a full sentence with at least
         | one good candidate rhyme isn't "planning" and is instead
         | "maintaining multiple candidates" seems like nearly semantic
         | tautology to me.
         | 
         | That said, what you said made me think some follow-up reporting
         | that would be interesting would be looking at the top 20 or so
         | probability second lines based on adjusting the rabbit / green
         | state. It seems to me like we'd get more insight into how the
         | model is thinking, and it would be relatively easy to parse for
         | humans. You could run through a bunch of completions until you
         | get 20 different words as the terminal rhyme word, then show
         | candidate lines with percentages of time the rhyme word is
         | chosen as the sort, perhaps.
        
       | darkhorse222 wrote:
       | Once we are aware of these neural pathways I see no reason there
       | shouldn't be a watcher and influencer of the pathways. A bit like
       | a dystopian mind watcher. Shape the brain.
        
         | astrange wrote:
         | https://www.anthropic.com/news/golden-gate-claude
        
       | teleforce wrote:
       | Oh the irony of not able to download the entire paper in one
       | compact PDF format referred in the article, while apparently all
       | the reference citations have PDF of the cited article to be
       | downloaded and accessible from the provided online links [1].
       | 
       | Come on Anthropic, you can do much better than this
       | unconventional and bizarre approach to publication.
       | 
       | [1] On the Biology of a Large Language Model:
       | 
       | https://transformer-circuits.pub/2025/attribution-graphs/bio...
        
       | teleforce wrote:
       | Another review on the paper from MIT Technology Review [1].
       | 
       | [1] Anthropic can now track the bizarre inner workings of a large
       | language model:
       | 
       | https://www.technologyreview.com/2025/03/27/1113916/anthropi...
        
       | westurner wrote:
       | XAI: Explainable artificial intelligence:
       | https://en.wikipedia.org/wiki/Explainable_artificial_intelli...
        
       ___________________________________________________________________
       (page generated 2025-03-29 23:02 UTC)