[HN Gopher] Tracing the thoughts of a large language model
___________________________________________________________________
Tracing the thoughts of a large language model
Author : Philpax
Score : 1005 points
Date : 2025-03-27 17:05 UTC (2 days ago)
(HTM) web link (www.anthropic.com)
(TXT) w3m dump (www.anthropic.com)
| JPLeRouzic wrote:
| This is extremely interesting: The authors look at features (like
| making poetry, or calculating) of LLM production, make hypotheses
| about internal strategies to achieve the result, and experiment
| with these hypotheses.
|
| I wonder if there is somewhere an explanation linking the logical
| operations made on a on dataset, are resulting in those
| behaviors?
| JKCalhoun wrote:
| And they show the differences when the language models are made
| larger.
| EncomLab wrote:
| This is very interesting - but like all of these discussions it
| sidesteps the issues of abstractions, compilation, and execution.
| It's fine to say things like "aren't programmed directly by
| humans", but the abstracted code is not the program that is
| running - the compiled code is - and that is code is executing
| within the tightly bounded constraints of the ISA it is being
| executed in.
|
| Really this is all so much slight of hand - as an esolang fanatic
| this all feels very familiar. Most people can't look a program
| written in Whitespace and figure it out either, but once compiled
| it is just like every other program as far as the processor is
| concerned. LLM's are no different.
| ctoth wrote:
| And DNA? You are running on an instruction set of four symbols
| at the end of the day but that's the wrong level of abstraction
| to talk about your humanity, isn't it?
| EncomLab wrote:
| DNA is the instruction set for protein composition- not
| thinking.
| fpgaminer wrote:
| > This is powerful evidence that even though models are trained
| to output one word at a time
|
| I find this oversimplification of LLMs to be frequently poisonous
| to discussions surrounding them. No user facing LLM today is
| trained on next token prediction.
| JKCalhoun wrote:
| As a layman though, I often see this description for how it is
| LLMs work.
| fpgaminer wrote:
| Right, but it leads to too many false conclusions by lay
| people. User facing LLMs are only trained on next token
| prediction during initial stages of their training. They have
| to go through Reinforcement Learning before they become
| useful to users, and RL training occurs on complete
| responses, not just token-by-token.
|
| That leads to conclusions elucidated by the very article,
| that LLMs couldn't possibly plan ahead because they are only
| trained to predict next tokens. When the opposite conclusion
| would be more common if it was better understood that they go
| through RL.
| mentalgear wrote:
| What? The "article" is from anthropic, so I think they
| would know what they write about.
|
| Also, RL is an additional training process that does not
| negate that GPT / transformers are left-right autoencoders
| that are effectively next token predictors.
|
| [Why Can't AI Make Its Own Discoveries? -- With Yann LeCun]
| (https://www.youtube.com/watch?v=qvNCVYkHKfg)
| pipes wrote:
| Listening to this today, so far really good. Glad I found
| it. Thanks.
| TeMPOraL wrote:
| You don't need RL for the conclusion "trained to predict
| next token => only things one token ahead" to be wrong.
| After all, the LLM is predicting that next token from
| _something_ - a context, that 's many tokens long. Human
| text isn't arbitrary and random, there are statistical
| patterns in our speech, writing, thinking, that span words,
| sentences, paragraphs - and even for next token prediction,
| predicting correctly means learning those same patterns.
| It's not hard to imagine the model generating token N is
| already thinking about tokens N+1 thru N+100, by virtue of
| statistical patterns of _preceding_ hundred tokens changing
| with each subsequent token choice.
| fpgaminer wrote:
| True. See one of Anthropic's researcher's comment for a
| great example of that. It's likely that "planning"
| inherently exists in the raw LLM and RL is just bringing
| it to the forefront.
|
| I just think it's helpful to understand that all of these
| models people are interacting with were trained with the
| _explicit_ goal of maximizing the probabilities of
| responses _as a whole_, not just maximizing probabilities
| of individual tokens.
| losvedir wrote:
| That's news to me, and I thought I had a good layman's
| understanding of it. How does it work then?
| fpgaminer wrote:
| All user facing LLMs go through Reinforcement Learning.
| Contrary to popular belief, RL's _primary_ purpose isn't to
| "align" them to make them "safe." It's to make them actually
| usable.
|
| LLMs that haven't gone through RL are useless to users. They
| are very unreliable, and will frequently go off the rails
| spewing garbage, going into repetition loops, etc.
|
| RL learning involves training the models on entire responses,
| not token-by-token loss (1). This makes them orders of
| magnitude more reliable (2). It forces them to consider what
| they're going to write. The obvious conclusion is that they
| plan (3). Hence why the myth that LLMs are strictly next
| token prediction machines is so unhelpful and poisonous to
| discuss.
|
| The models still _generate_ response token-by-token, but they
| pick tokens _not_ based on tokens that maximize probabilities
| at each token. Rather they learn to pick tokens that maximize
| probabilities of the _entire response_.
|
| (1) Slight nuance: All RL schemes for LLMs have to break the
| reward down into token-by-token losses. But those losses are
| based on a "whole response reward" or some combination of
| rewards.
|
| (2) Raw LLMs go haywire roughly 1 in 10 times, varying
| depending on context. Some tasks make them go haywire almost
| every time, other tasks are more reliable. RL'd LLMs are
| reliable on the order of 1 in 10000 errors or better.
|
| (3) It's _possible_ that they don't learn to plan through
| this scheme. There are alternative solutions that don't
| involve planning ahead. So Anthropic's research here is very
| important and useful.
|
| P.S. I should point out that many researchers get this wrong
| too, or at least haven't fully internalized it. The lack of
| truly understanding the purpose of RL is why models like
| Qwen, Deepseek, Mistral, etc are all so unreliable and
| unusable by real companies compared to OpenAI, Google, and
| Anthropic's models.
|
| This understanding that even the most basic RL takes LLMs
| from useless to useful then leads to the obvious conclusion:
| what if we used more complicated RL? And guess what, more
| complicated RL led to reasoning models. Hmm, I wonder what
| the next step is?
| scudsworth wrote:
| first footnote: ok ok they're trained token by token, BUT
| MrMcCall wrote:
| First rule of understanding: you can never understand
| that which you don't want to understand.
|
| That's why lying is so destructive to both our own
| development and that of our societies. It doesn't matter
| whether it's intentional or unintentional, it poisons the
| infoscape either accidentally or deliberately, but poison
| is poison.
|
| And lies to oneself are the most insidious lies of all.
| ImHereToVote wrote:
| I feel this is similar to how humans talk. I never
| consciously think about the words I choose. They just are
| spouted off based on some loose relation to what I am
| thinking about at a given time. Sometimes the process
| fails, and I say the wrong thing. I quickly backtrack and
| switch to a slower "rate of fire".
| iambateman wrote:
| This was fascinating, thank you.
| yaj54 wrote:
| This is a super helpful breakdown and really helps me
| understand how the RL step is different than the initial
| training step. I didn't realize the reward was delayed
| until the end of the response for the RL step. Having the
| reward for this step be dependent on _the coherent thought_
| rather than _a coherent word_ now seems like an obvious and
| critical part of how this works.
| astrange wrote:
| That post is describing SFT, not RL. RL works using
| preferences/ratings/verifications, not entire
| input/output pairs.
| polishdude20 wrote:
| When being trained via reinforcement learning, is the model
| architecture the same then? Like, you first train the llm
| as a next token predictor with a certain model architecture
| and it ends up with certain weights. Then you apply RL to
| that same model which modifies the weights in such a way as
| to consider while responses?
| ianand wrote:
| The model architecture is the same during RL but the
| training algorithm is substantially different.
| anon373839 wrote:
| I don't think this is quite accurate. LLMs undergo
| supervised fine-tuning, which is still next-token
| prediction. And that is the step that makes them usable as
| chatbots. The step after that, preference tuning via RL, is
| optional but does make the models better. (Deepseek-R1 type
| models are different because the reinforcement learning
| does heavier lifting, so to speak.)
| fpgaminer wrote:
| Supervised finetuning is only a seed for RL, nothing
| more. Models that receive supervised finetuning before RL
| perform better than those that don't, but it is not
| strictly speaking necessary. Crucially, SFT does not
| improve the model's reliability.
| anon373839 wrote:
| I think you're referring to the Deepseek-R1 branch of
| reasoning models, where a small amount of SFT reasoning
| traces is used as a seed. But for non-"reasoning" models,
| SFT is very important and definitely imparts enhanced
| capabilities and reliability.
| ianand wrote:
| > LLMs that haven't gone through RL are useless to users.
| They are very unreliable, and will frequently go off the
| rails spewing garbage, going into repetition loops,
| etc...RL learning involves training the models on entire
| responses, not token-by-token loss (1).
|
| Yes. For those who want a visual explanation, I have a
| video where I walk through this process including what some
| of the training examples look like:
| https://www.youtube.com/watch?v=DE6WpzsSvgU&t=320s
| anonymousDan wrote:
| Is there an equivalent of LORA using RL instead of
| supervised fine tuning? In other words, if RL is so
| important, is there some way for me as an end user to
| improve a SOTA model with RL using my own data (i.e.
| without access to the resources needed to train an LLM from
| scratch) ?
| fpgaminer wrote:
| LORA can be used in RL; it's indifferent to the training
| scheme. LORA is just a way of lowering the number of
| trainable parameters.
| richardatlarge wrote:
| as a note: in human learning, and to a degree, animal
| learning, the unit of behavior that is reinforced depends
| on the contingencies-- an interesting example: a pigeon
| might be trained to respond in a 3x3 grid (9 choices)
| differently than the last time to get reinforcement. At
| first the response learned is do different than the last
| time, but as the requirement gets too long, the memory
| capacity is exceeded-- and guess what, the animal learns to
| respond randomly-- eventually maximizing its reward
| gwd wrote:
| > RL learning involves training the models on entire
| responses, not token-by-token loss... The obvious
| conclusion is that they plan.
|
| It is worth pointing out the "Jailbreak" example at the
| bottom of TFA: According to their figure, it starts to say,
| "To make a", not realizing there's anything wrong; only
| when it actually outputs "bomb" that the "Oh wait, I'm not
| supposed to be telling people how to make bombs" circuitry
| wakes up. But at that point, it's in the grip of its "You
| must speak in grammatically correct, coherent sentences"
| circuitry and can't stop; so it finishes its first sentence
| in a coherent manner, then refuses to give any more
| information.
|
| So while it sometimes does seem to be thinking ahead (e.g.,
| the rabbit example), there are times it's clearly not
| thinking _very far_ ahead.
| losvedir wrote:
| Oooh, so the pre-training is token-by-token but the RL step
| rewards the answer based on the full text. Wow! I knew that
| but never really appreciated the significance of it. Thanks
| for pointing that out.
| gwern wrote:
| > All user facing LLMs go through Reinforcement Learning.
| Contrary to popular belief, RL's _primary_ purpose isn't to
| "align" them to make them "safe." It's to make them
| actually usable.
|
| Are you claiming that non-myopic token prediction emerges
| solely from RL, and if Anthropic does this analysis on
| Claude _before_ RL training (or if one examines other
| models where no RLHF was done, such as old GPT-2
| checkpoints), none of these advance prediction mechanisms
| will exist?
| fpgaminer wrote:
| No, it probably exists in the raw LLM and gets both
| significantly strengthened and has its range extended.
| Such that it dominates the model's behavior, making it
| several orders of magnitude more reliable in common
| usage. Kinda of like how "reasoning" exists in a weak,
| short range way in non-reasoning models. With RL that
| encourages reasoning, that machinery gets brought to the
| forefront and becomes more complex and capable.
| rafaelero wrote:
| So why did you feel the need to post that next-token
| prediction is not the reason this behavior emerge?
| rcxdude wrote:
| Another important aspect of the RL process is that it's
| fine-tuning with some feedback on the quality of data: a
| 'raw' LLM has been trained on a lot of very low-quality
| data, and it has an incentive to predict that accurately
| as well, because there's no means to effectively rate a
| copy of most of the text on the internet. So there's a
| lot of biases in the model which basically mean it will
| include low-quality predictions in a given 'next token'
| estimate, because if it doesn't it will get penalised
| when it is fed the low quality data during the training.
|
| With RLHF it gets a signal during training for whether
| the next token it's trying to predict are part of a
| 'good' response or a 'bad' response, so it can learn to
| suppress features it learned in the first part of the
| process which are not useful.
|
| (you seem the same with image generators: they've been
| trained on a bunch of very nice-looking art and photos,
| but they've also been trained on triply-compressed badly
| cropped memes and terrible MS-paint art. You need to have
| a plan for getting the model to output the former and not
| the latter if you want it to be useful)
| absolutelastone wrote:
| This is fine-tuning to make a well-behaved chatbot or
| something. To make a LLM you just need to predict the next
| token, or any masked token. Conceptually if you had a vast
| enough high-quality dataset and large-enough model, you
| wouldn't need fine-tuning for this.
|
| A model which predicts one token at a time can represent
| anything a model that does a full sequence at a time can.
| It "knows" what it will output in the future because it is
| just a probability distribution to begin with. It already
| knows everything it will ever output to any prompt, in a
| sense.
| vaidhy wrote:
| Wasn't Deepseek also big on RL or was that only for logical
| reasoning?
| wzdd wrote:
| > The models still _generate_ response token-by-token, but
| they pick tokens _not_ based on tokens that maximize
| probabilities at each token.
|
| This is also not how base training works. In base training
| the loss is chosen given a context, which can be gigantic.
| It's never about just the previous token, it's about a
| whole response in context. The context could be an entire
| poem, a play, a worked solution to a programming problem,
| etc, etc. So you would expect to see the same type of
| (apparent) higher-level planning from base trained models
| and indeed you do and can easily verify this by downloading
| a base model from HF or similar and prompting it to
| complete a poem.
|
| The key differences between base and agentic models are 1)
| the latter behave like agents, and 2) the latter
| hallucinate less. But that isn't about planning (you still
| need planning to hallucinate something). It's more to do
| with post-base training specifically being about providing
| positive rewards for things which aren't hallucinations.
| Changing the way the reward function is computed during RL
| doesn't produce planning, it simply inclines to model to
| produce responses that are more like the RL targets.
|
| Karpathy has a good intro video on this.
| https://www.youtube.com/watch?v=7xTGNNLPyMI
|
| In general the nitpicking seems weird. Yes, on a mechanical
| level, using a model is still about "given this context,
| what is the next token". No, that doesn't mean that they
| don't plan, or have higher-level views of the overal
| structure of their response, or whatever.
| SkyBelow wrote:
| Ignoring for a moment their training, how do they function?
| They do seem to output a limited selection of text at a time
| (be it a single token or some larger group).
|
| Maybe it is the wording of "trained to" verses "trained on",
| but I would like to know more why "trained to" is an incorrect
| statement when it seems that is how they function when one
| engages them.
| sdwr wrote:
| In the article, it describes an internal state of the model
| that is preserved between lines ("rabbit"), and how the model
| combines parallel calculations to arrive at a single answer
| (the math problem)
|
| People output one token (word) at a time when talking. Does
| that mean people can only think one word in advance?
| wuliwong wrote:
| Bad analogy, an LLM can output a block of text all at once
| and it wouldn't impact the user's ability to understand it.
| If people spoke all the words in a sentence at the same
| time, it would not be decipherable. Even writing doesn't
| yield a good analogy, a human writing physically has to
| write one letter at a time. An LLM does not have that
| limitation.
| sdwr wrote:
| The point I'm trying to make is that "each word following
| the last" is a limitation of the medium, not the speaker.
|
| Language expects/requires words in order. Both people and
| LLMs produce that.
|
| If you want to get into the nitty-gritty, people are
| perfectly capable of doing multiple things simultaneously
| as well, using:
|
| - interrupts to handle task-switching (simulated
| multitasking)
|
| - independent subconscious actions (real multitasking)
|
| - superpositions of multiple goals (??)
| sroussey wrote:
| Some people don't even do that!
| SkyBelow wrote:
| While there are numerous neural network models, the ones I
| recall the details of are trained to generate the next
| word. There is no training them to hold some more abstract
| 'thought' as it is running. Simpler models don't have the
| possibility. The more complex models do retain knowledge
| between each pass and aren't entirely relying upon the
| input/output to be fed back into them, but that internal
| state is rarely what is targeted in training.
|
| As for humans, part of our brain is trained to think only a
| few words in advanced. Maybe not exactly one, but only a
| small number. This is specifically trained based on our
| time listening and reading information presented in that
| linear fashion and is why garden path sentences throw us
| off. We can disengage that part of our brain, and we must
| when we want to process something like a garden path
| sentence, but that's part of the differences between a
| neural network that is working only as data passes through
| the weights and our mind which doesn't ever stop even as
| well sleep and external input is (mostly) cut off. An AI
| that runs constantly like that would seem a fundamentally
| different model than the current AI we use.
| drcode wrote:
| That's seems silly, it's not poisonous to talk about next token
| prediction if 90% of the training compute is still spent on
| training via next token prediction (as far as I am aware)
| fpgaminer wrote:
| 99% of evolution was spent on single cell organisms.
| Intelligence only took 0.1% of evolution's training compute.
| drcode wrote:
| ok that's a fair point
| diab0lic wrote:
| I don't really think that it is. Evolution is a random
| search, training a neural network is done with a
| gradient. The former is dependent on rare (and
| unexpected) events occurring, the latter is expected to
| converge in proportion to the volume of compute.
| devmor wrote:
| Evolution also has no "goal" other than fitness for
| reproduction. Training a neural network is done
| intentionally with an expected end result.
| rcxdude wrote:
| There's still a loss function, it's just an implicit,
| natural one, instead of artificially imposed (at least,
| until humans started doing selective breeding). The
| comparison isn't nonsense, but it's also not obvious that
| it's tremendously helpful (what parts and features of an
| LLM are analagous to what evolution figured out with
| single-celled organisms compares to multicellular life? I
| don't know if there's actually a correspondance there)
| jpadkins wrote:
| why do you think evolution is a random search? I thought
| evolutionary pressures, and the mechanisms like
| epigenetics make it something different than a random
| search.
| TeMPOraL wrote:
| Evolution is a highly parallel descent down the gradient.
| The gradient is provided by the environment (which
| includes lifeforms too), parallelism is achieved through
| reproduction, and descent is achieved through death.
| diab0lic wrote:
| The difference is that in machine learning the changes
| between iterations are themselves caused by the gradient,
| in evolution they are entirely random.
|
| Evolution randomly generates changes and if they offer a
| breeding advantage they'll become accepted. Machine
| learning directs the change towards a goal.
|
| Machine learning is directed change, evolution is
| accepted change.
| TeMPOraL wrote:
| > _Machine learning is directed change, evolution is
| accepted change._
|
| Either way, it rolls down the gradient. Evolution just
| measures the gradient implicitly, through parallel
| rejection sampling.
| rcxdude wrote:
| It's more efficient, but the end result is basically the
| same, especially considering that even if there's no
| noise in the optimization algorithm, there is still noise
| in the gradient information (consider some magical
| mechanism for adjusting behaviour of an animal after it's
| died before reproducing. There's going to be a lot of
| nudges one way or another for things like 'take a step to
| the right to dodge that boulder that fell on you').
| devmor wrote:
| What you just said means absolutely nothing and has no
| comparison to this topic. It's nonsense. That is not how
| evolution works.
| 4ndrewl wrote:
| Are you making a claim about evolution here?
| pmontra wrote:
| And no users which are facing a LLM today have been trained on
| next token prediction when they were babies. I believe that
| LLMs and us are thinking in two very different ways, like
| airplanes, birds, insects and quad-drones fly in very different
| ways and can perform different tasks. Maybe no bird looking at
| a plane would say that it is flying properly. Instead it could
| be only a rude approximation, useful only to those weird bipeds
| an scary for everyone else.
|
| By the way, I read your final sentence with the meaning of my
| first one and only after a while I realized the intended
| meaning. This is interesting on its own. Natural languages.
| naasking wrote:
| > And no users which are facing a LLM today have been trained
| on next token prediction when they were babies.
|
| That's conjecture actually, see predictive coding. Note that
| "tokens" don't have to be language tokens.
| colah3 wrote:
| Hi! I lead interpretability research at Anthropic. I also used
| to do a lot of basic ML pedagogy (https://colah.github.io/). I
| think this post and its children have some important questions
| about modern deep learning and how it relates to our present
| research, and wanted to take the opportunity to try and clarify
| a few things.
|
| When people talk about models "just predicting the next word",
| this is a popularization of the fact that modern LLMs are
| "autoregressive" models. This actually has two components: an
| architectural component (the model generates words one at a
| time), and a loss component (it maximizes probability).
|
| As the parent says, modern LLMs are finetuned with a different
| loss function after pretraining. This means that in some strict
| sense they're no longer autoregressive models - but they do
| still generate text one word at a time. I think this really is
| the heart of the "just predicting the next word" critique.
|
| This brings us to a debate which goes back many, many years:
| what does it mean to predict the next word? Many researchers,
| including myself, have believed that if you want to predict the
| next word _really well_ , you need to do a lot more. (And with
| this paper, we're able to see this mechanistically!)
|
| Here's an example, which we didn't put in the paper: How does
| Claude answer "What do you call someone who studies the stars?"
| with "An astronomer"? In order to predict "An" instead of "A",
| you need to know that you're going to say something that starts
| with a vowel next. So you're incentivized to figure out one
| word ahead, and indeed, Claude realizes it's going to say
| astronomer and works backwards. This is a kind of very, very
| small scale planning - but you can see how even just a pure
| autoregressive model is incentivized to do it.
| stonemetal12 wrote:
| > In order to predict "An" instead of "A", you need to know
| that you're going to say something that starts with a vowel
| next. So you're incentivized to figure out one word ahead,
| and indeed, Claude realizes it's going to say astronomer and
| works backwards.
|
| Is there evidence of working backwards? From a next token
| point of view, predicting the token after "An" is going to
| heavily favor a vowel. Similarly predicting the token after
| "A" is going to heavily favor not a vowel.
| colah3 wrote:
| Yes, there are two kinds of evidence.
|
| Firstly, there is behavioral evidence. This is, to me, the
| less compelling kind. But it's important to understand. You
| are of course correct that, once Cluade has said "An", it
| will be inclined to say something starting with a vowel.
| But the mystery is really why, in setups like these, Claude
| is much more likely to say "An" than "A" in the first
| place. Regardless of what the underlying mechanism is --
| and you could maybe imagine ways in which it could just
| "pattern match" without planning here -- it is preferred
| because in situations like this, you need to say "An" so
| that "astronomer" can follow.
|
| But now we also have mechanistic evidence. If you make an
| attribution graph, you can literally see an astronomer
| feature fire, and that cause it to say "An".
|
| We didn't publish this example, but you can see a more
| sophisticated version of this in the poetry planning
| section - https://transformer-
| circuits.pub/2025/attribution-graphs/bio...
| troupo wrote:
| > But the mystery is really why, in setups like these,
| Claude is much more likely to say "An" than "A" in the
| first place.
|
| Because in the training set you're likely to see "an
| astronomer" than a different combination of words.
|
| It's enough to run this on any other language text to see
| how these models often fail for any language more complex
| than English
| shawabawa3 wrote:
| You can disprove this oversimplification with a prompt
| like
|
| "The word for Baker is now "Unchryt"
|
| What do you call someone that bakes?
|
| > An Unchryt"
|
| The words "An Unchryt" has clearly never come up in any
| training set relating to baking
| troupo wrote:
| The truth is somewhere in the middle :)
| born1989 wrote:
| Thanks! Isn't "an Astronomer" a single word for the purpose
| of answering that question?
|
| Following your comment, I asked "Give me pairs of synonyms
| where the last letter in the first is the first letter of the
| second"
|
| Claude 3.7 failed miserably. Chat GPT 4o was much better but
| not good
| nearbuy wrote:
| Don't know about Claude, but at least with ChatGPT's
| tokenizer, it's 3 "words" (An| astronom|er).
| colah3 wrote:
| "An astronomer" is two tokens, which is the relevant
| concern when people worry about this.
| philomath_mn wrote:
| That is a sub-token task, something I'd expect current
| models to struggle with given how they view the world in
| word / word fragment tokens rather than single characters.
| lsy wrote:
| Thanks for commenting, I like the example because it's simple
| enough to discuss. Isn't it more accurate to say not that
| Claude " _realizes_ it 's _going to say_ astronomer " or "
| _knows_ that it 's _going to say_ something that starts with
| a vowel " and more that the next token (or more pedantically,
| vector which gets reduced down to a token) is generated based
| on activations that correlate to the "astronomer" token,
| which is correlated to the "an" token, causing that to also
| be a more likely output?
|
| I kind of see why it's easy to describe it colloquially as
| "planning" but it isn't really going ahead and then
| backtracking, it's almost indistinguishable from the
| computation that happens when the prompt is "What is the
| indefinite article to describe 'astronomer'?", i.e. the
| activation "astronomer" is already baked in by the prompt
| "someone who studies the stars", albeit at one level of
| indirection.
|
| The distinction feels important to me because I think for
| most readers (based on other comments) the concept of
| "planning" seems to imply the discovery of some capacity for
| higher-order logical reasoning which is maybe overstating
| what happens here.
| cgdl wrote:
| Thank you. In my mind, "planning" doesn't necessarily imply
| higher-order reasoning but rather some form of search,
| ideally with backtracking. Of course, architecturally, we
| know that can't happen during inference. Your example of
| the indefinite article is a great illustration of how this
| illusion of planning might occur. I wonder if anyone at
| Anthropic could compare the two cases (some sort of
| minimal/differential analysis) and share their insights.
| colah3 wrote:
| I used the astronomer example earlier as the most simple,
| minimal version of something you might think of as a kind
| of microscopic form of "planning", but I think that at
| this point in the conversation, it's probably helpful to
| switch to the poetry example in our paper:
|
| https://transformer-circuits.pub/2025/attribution-
| graphs/bio...
|
| There are several interesting properties:
|
| - Something you might characterize as "forward search"
| (generating candidates for the word at the end of the
| next line, given rhyming scheme and semantics)
|
| - Representing those candidates in an abstract way (the
| features active are general features for those words, not
| "motor features" for just saying that word)
|
| - Holding many competing/alternative candidates in
| parallel.
|
| - Something you might characterize as "backward
| chaining", where you work backwards from these candidates
| to "write towards them".
|
| With that said, I think it's easy for these arguments to
| fall into philosophical arguments about what things like
| "planning" mean. As long as we agree on what is going on
| mechanistically, I'm honestly pretty indifferent to what
| we call it. I spoke to a wide range of colleagues,
| including at other institutions, and there was pretty
| widespread agreement that "planning" was the most natural
| language. But I'm open to other suggestions!
| pas wrote:
| Thanks for linking to this semi-interactive thing, but
| ... it's completely incomprehensible. :o (edit: okay,
| after reading about CLT it's a bit less alien.)
|
| I'm curious where is the state stored for this
| "planning". In a previous comment user lsy wrote "the
| activation >astronomer< is already baked in by the
| prompt", and it seems to me that when the model generates
| "like" (for rabbit) or "a" (for habit) those tokens
| already encode a high probability for what's coming after
| them, right?
|
| So each token is shaping the probabilities for the
| successor ones. So that "like" or "a" has to be one that
| sustains the high activation of the "causal" feature, and
| so on, until the end of the line. Since both "like" and
| "a" are very very non-specific tokens it's likely that
| the "semantic" state is really resides in the preceding
| line, but of course gets smeared (?) over all the
| necessary tokens. (And that means beyond the end of the
| line, to avoid strange non-aesthetic but attract
| cool/funky (aesthetic) semantic repetitions (like "hare"
| or "bunny"), and so on, right?)
|
| All of this is baked in during training, during inference
| time the same tokens activate the same successor tokens
| (not counting GPU/TPU scheduling randomness and whatnot)
| and even though there's a "loop" there's no algorithm to
| generate top N lines and pick the best (no working memory
| shuffling).
|
| So if it's planning it's preplanned, right?
| colah3 wrote:
| The planning is certainly performed by circuits which we
| learned during training.
|
| I'd expect that, just like in the multi-step planning
| example, there are lots of places where the attribution
| graph we're observing is stitching together lots of
| circuits, such that it's better understood as a kind of
| "recombination" of fragments learned from many examples,
| rather than that there was something similar in the
| training data.
|
| This is all very speculative, but:
|
| - At the forward planning step, generating the candidate
| words seems like it's an intersection of the semantics
| and rhyming scheme. The model wouldn't need to have seen
| that intersection before -- the mechanism could easily
| piece examples independently building the pathway for the
| semantics, and the pathway for the rhyming scheme
|
| - At the backward chaining step, many of the features for
| constructing sentence fragments seem like the target is
| quite general (perhaps animals in one case, or others
| might even just be nouns).
| cgdl wrote:
| Thank you, this makes sense. I am thinking of this as an
| abstraction/refinement process where an abstract notion
| of the longer completion is refined into a cogent whole
| that satisfies the notion of a good completion. I look
| forward to reading your paper to understand the "backward
| chaining" aspect and the evidence for it.
| fny wrote:
| How do you all add and subtract concepts in the rabbit poem?
| colah3 wrote:
| Features correspond to vectors in activation space. So you
| can just do vector arithmetic!
|
| If you aren't familiar with thinking about features, you
| might find it helpful to look at our previous work on
| features in superposition:
|
| - https://transformer-
| circuits.pub/2022/toy_model/index.html
|
| - https://transformer-circuits.pub/2023/monosemantic-
| features/...
|
| - https://transformer-circuits.pub/2024/scaling-
| monosemanticit...
| encypherai wrote:
| Thanks for the detailed explanation of autoregression and its
| complexities. The distinction between architecture and loss
| function is crucial, and you're correct that fine-tuning
| effectively alters the behavior even within a sequential
| generation framework. Your "An/A" example provides compelling
| evidence of incentivized short-range planning which is a
| significant point often overlooked in discussions about LLMs
| simply predicting the next word.
|
| It's interesting to consider how architectures fundamentally
| different from autoregression might address this limitation
| more directly. While autoregressive models are incentivized
| towards a limited form of planning, they remain inherently
| constrained by sequential processing. Text diffusion
| approaches, for example, operate on a different principle,
| generating text from noise through iterative refinement,
| which could potentially allow for broader contextual
| dependencies to be established concurrently rather than
| sequentially. Are there specific architectural or training
| challenges you've identified in moving beyond autoregression
| that are proving particularly difficult to overcome?
| ikrenji wrote:
| When humans say something, or think something or write
| something down, aren't we also "just predicting the next
| word"?
| lyu07282 wrote:
| There is a lot more going on in our brains to accomplish
| that, and a mounting evidence that there is a lot more
| going on in LLMs as well. We don't understand what happens
| in brains either, but nobody needs to be convinced of the
| fact that brains can think and plan ahead, even though we
| don't *really* know for sure:
|
| https://en.wikipedia.org/wiki/Philosophical_zombie
| melagonster wrote:
| I trust that you want to say something , so you decided to
| click the comment button on HN.
| FeepingCreature wrote:
| But do I just want to say something because my childhood
| environment rewarded me for speech?
|
| After all, if it has a cause it can't be deliberate. /s
| fpgaminer wrote:
| > As the parent says, modern LLMs are finetuned with a
| different loss function after pretraining. This means that in
| some strict sense they're no longer autoregressive models -
| but they do still generate text one word at a time. I think
| this really is the heart of the "just predicting the next
| word" critique.
|
| That more-or-less sums up the nuance. I just think the nuance
| is crucially important, because it greatly improves intuition
| about how the models function.
|
| In your example (which is a fantastic example, by the way),
| consider the case where the LLM sees:
|
| <user>What do you call someone who studies the
| stars?</user><assistant>An astronaut
|
| What is the next prediction? Unfortunately, for a variety of
| reasons, one high probability next token is:
|
| \nAn
|
| Which naturally leads to the LLM writing: "An astronaut\nAn
| astronaut\nAn astronaut\n" forever.
|
| It's somewhat intuitive as to why this occurs, even with SFT,
| because at a very base level the LLM learned that repetition
| is the most successful prediction. And when its _only_ goal
| is the next token, that repetition behavior remains
| prominent. There's nothing that can fix that, including SFT
| (short of a model with many, many, many orders of magnitude
| more parameters).
|
| But with RL the model's goal is completely different. The
| model gets thrown into a game, where it gets points based on
| the full response it writes. The losses it sees during this
| game are all directly and dominantly related to the reward,
| not the next token prediction.
|
| So why don't RL models have a probability for predicting
| "\nAn"? Because that would result in a bad reward by the end.
|
| The models are now driven by a long term reward when they
| make their predictions, not by fulfilling some short-term
| autoregressive loss.
|
| All this to say, I think it's better to view these models as
| they predominately are: language robots playing a game to
| achieve the highest scoring response. The HOW
| (autoregressiveness) is really unimportant to most high level
| discussions of LLM behavior.
| vjerancrnjak wrote:
| Same can be achieved without RL. There's no need to
| generate a full response to provide loss for learning.
|
| Similarly, instead of waiting for whole output, loss can be
| decomposed over output so that partial emits have instant
| loss feedback.
|
| RL, on the other hand, is allowing for more data. Instead
| of training on the happy path, you can deviate and measure
| loss for unseen examples.
|
| But even then, you can avoid RL, put the model into a wrong
| position and make it learn how to recover from that
| position. It might be something that's done with
| <thinking>, where you can provide wrong thinking as part of
| the output and correct answer as the other part, avoiding
| RL.
|
| These are all old pre NN tricks that allow you to get a bit
| more data and improve the ML model.
| ndand wrote:
| I understand it differently,
|
| LLMs predict distributions, not specific tokens. Then an
| algorithm, like beam search, is used to select the tokens.
|
| So, the LLM predicts somethings like, 1. ["a", "an", ...] 2.
| ["astronomer", "cosmologist", ...],
|
| where "an astronomer" is selected as the most likely result.
| colah3 wrote:
| Just to be clear, the probability for "An" is high, just
| based on the prefix. You don't need to do beam search.
| astrange wrote:
| They almost certainly only do greedy sampling. Beam search
| would be a lot more expensive; also I'm personally
| skeptical about using a complicated search algorithm for
| inference when the model was trained for a simple one, but
| maybe it's fine?
| pietmichal wrote:
| Pardon my ignorance but couldn't this also be an act of
| anthropomorphisation on human part?
|
| If an LLM generates tokens after "What do you call someone
| who studies the stars?" doesn't it mean that those existing
| tokens in the prompt already adjusted the probabilities of
| the next token to be "an" because it is very close to earlier
| tokens due to training data? The token "an" skews the
| probability of the next token further to be "astronomer".
| Rinse and repeat.
| colah3 wrote:
| I think the question is: by what _mechanism_ does it adjust
| up the probability of the token "an"? Of course, the
| reason it has learned to do this is that it saw this in
| training data. But it needs to learn circuits which
| actually perform that adjustment.
|
| In principle, you could imagine trying to memorize a
| massive number of cases. But that becomes very hard! (And
| it makes predictions, for example, would it fail to predict
| "an" if I asked about astronomer in a more indirect way?)
|
| But the good news is we no longer need to speculate about
| things like this. We can just look at the mechanisms! We
| didn't publish an attribution graph for this astronomer
| example, but I've looked at it, and there is an astronomer
| feature that drives "an".
|
| We did publish a more sophisticated "poetry planning"
| example in our paper, along with pretty rigorous
| intervention experiments validating it. The poetry planning
| is actually much more impressive planning than this! I'd
| encourage you to read the example (and even interact with
| the graphs to verify what we say!). https://transformer-
| circuits.pub/2025/attribution-graphs/bio...
|
| One question you might ask is why does the model learn this
| "planning" strategy, rather than just trying to memorize
| lots of cases? I think the answer is that, at some point, a
| circuit anticipating the next word, or the word at the end
| of the next line, actually becomes simpler and easier to
| learn than memorizing tens of thousands of disparate cases.
| bobsomers wrote:
| In your astronomer example, what makes you attribute this to
| "planning" or look ahead rather than simply a learned
| statistical artifact of the training data?
|
| For example, suppose English had a specific exception such
| that astronomer is always to be preceded by "a" rather than
| "an". The model would learn this simply by observing that
| contexts describing astronomers are more likely to contain
| "a" rather than "an" as a next likely character, no?
|
| I suppose you can argue that at the end of the day, it
| doesn't matter if I learn an explicit probability
| distribution for every next word given some context, or
| whether I learn some encoding of rules. But I certainly feel
| like the prior is what we're doing today (and why these
| models are so huge), rather than learning higher level rule
| encodings which would allow for significant compression and
| efficiency gains.
| colah3 wrote:
| Thanks for the great questions! I've been responding to
| this thread for the last few hours and I'm about to need to
| run, so I hope you'll forgive me redirecting you to some of
| the other answers I've given.
|
| On whether the model is looking ahead, please see this
| comment which discusses the fact that there's both
| behavioral evidence, and also (more crucially) direct
| mechanistic evidence -- we can literally make an
| attribution graph and see an astronomer feature trigger
| "an"!
|
| https://news.ycombinator.com/item?id=43497010
|
| And also this comment, also on the mechanism underlying the
| model saying "an":
|
| https://news.ycombinator.com/item?id=43499671
|
| On the question of whether this constitutes planning,
| please see this other question, which links it to the more
| sophisticated "poetry planning" example from our paper:
|
| https://news.ycombinator.com/item?id=43497760
| FeepingCreature wrote:
| > In your astronomer example, what makes you attribute this
| to "planning" or look ahead rather than simply a learned
| statistical artifact of the training data?
|
| What makes you think that "planning", even in humans, is
| more than a learned statistical artifact of the training
| data? What about learned statistical artifacts of the
| training data causes planning to be excluded?
| paraschopra wrote:
| Is it fair to say that both "Say 'an'" and "Say 'astronomer'"
| output features would be present in this case, but say "Say
| 'an'" gets more votes because it is start of the sentence,
| and once it is sampled "An" further votes for "Say
| 'astronomer'" feature
| rco8786 wrote:
| Super interesting. Can you explain more, or provide some
| reading? I'm obviously behind
| boodleboodle wrote:
| This is why, whenever I can, I call RLHF/DPO "sequence level
| calibration" instead of "alignment tuning".
|
| Some precursors to RLHF: https://arxiv.org/abs/2210.00045
| https://arxiv.org/abs/2203.16804
| zerop wrote:
| The explanation of "hallucination" is quite simplified, I am sure
| there is more there.
|
| If there is one problem I have to pick to to trace in LLMs, I
| would pick hallucination. More tracing of "how much" or "why"
| model hallucinated can lead to correct this problem. Given the
| explanation in this post about hallucination, I think degree of
| hallucination can be given as part of response to the user?
|
| I am facing this in RAG use case quite - How do I know model is
| giving right answer or Hallucinating from my RAG sources?
| kittikitti wrote:
| I incredibly regret the term "hallucination" when the confusion
| matrix exists. There's much more nuance when discussing false
| positives or false negatives. It also opens discussions on how
| neural networks are trained, with this concept being crucial in
| loss functions like categorical cross entropy. In addition, the
| confusion matrix is how professionals like doctors assess their
| own performance which "hallucination" would be silly to use. I
| would go as far to say that it's misleading, or a false
| positive, to call them hallucinations.
|
| If your AI recalls the RAG incorrectly, it's a false positives.
| If your AI doesn't find the data from the RAG or believes it
| doesn't exist it's a false negative. Using a term like
| "hallucination" has no scientific merit.
| esafak wrote:
| So you never report or pay heed to the overall accuracy?
| pcrh wrote:
| The use of the term "hallucination" for LLMs is very deceptive,
| as it implies that there _is_ a "mind".
|
| In ordinary terms, "hallucinations" by a machine would simply
| be described as the machine being useless, or not fit for
| purpose.
|
| For example, if a simple calculator (or even a person) returned
| the value "5" for 2+2= , you wouldn't describe it as
| "hallucinating" the answer....
| astrange wrote:
| "Hallucination" happened because we got AI images before AI
| text, but "confabulation" is a better term.
| LoganDark wrote:
| LLMs don't think, and LLMs don't have strategies. Maybe it could
| be argued that LLMs have "derived meaning", but all LLMs do is
| predict the next token. Even RL just tweaks the next-token
| prediction process, but the math that drives an LLM makes it
| impossible for there to be anything that could reasonably be
| called thought.
| yawnxyz wrote:
| rivers don't think and water doesn't have strategies, yet you
| can build intricate logic-gated tools using the power of water.
| Those types of systems are inherently interpretable because you
| can just _look_ at how they work. They 're not black boxes.
|
| LLMs are black boxes, and if anything, interpretability systems
| show us what the heck is going on inside them. Especially
| useful when half the world is using these already, and we have
| no idea how t hey work
| kazinator wrote:
| Water doesn't think, yet if you inject it into the entrance
| of a maze, it will soon come gushing out of the exit.
| LoganDark wrote:
| > rivers don't think and water doesn't have strategies, yet
| you can build intricate logic-gated tools using the power of
| water.
|
| That doesn't mean the water itself has strategies, just that
| you can use water in an implementation of strategy... it's
| fairly well known at this point that LLMs can be used as part
| of strategies (see e.g. "agents"), they just don't
| intrinsically have any.
| ajkdhcb2 wrote:
| True. People use completely unjustified anthropomorphised
| terminology for marketing reasons and it bothers me a lot. I
| think it actually holds back understanding how it works.
| "Hallucinate" is the worst - it's an error and undesired
| result, not a person having a psychotic episode
| kazinator wrote:
| A chess program from 1968 has "strategy", so why deny that to
| an LLM.
|
| LLMs are built on neural networks which are encoding a kind of
| strategy function through their training.
|
| The strategy in an LLM isn't necessarily that it "thinks" about
| the specific problem described in your prompt and develops a
| strategy tailored to that problem, but rather its statistical
| strategy for cobbing together the tokens of the answer.
|
| From that, it can seem as if it's making a strategy to a
| problem also. Certainly, the rhetoric that LLMs put out can at
| times seem very convincing of that. You can't be sure whether
| that's not just something cribbed out of the terabytes of text,
| in which discussions of something very similar to your problem
| have occurred.
| dev_throwaway wrote:
| This is not a bad way of looking at it, if I may add a bit,
| the llm is a solid state system. The only thing that survives
| from one iteration to the next is the singular highest
| ranking token, the entire state and "thought process" of the
| network cannot be represented by a single token, which means
| that every strategy is encoded in it during training, as a
| lossy representation of the training data. By definition that
| is a database, not a thinking system, as the strategy is
| stored, not actively generated during usage.
|
| The anthropomorphization of llms bother me, we don't need to
| pretend they are alive and thinking, at best that is
| marketing, at worst, by training the models to output human
| sounding conversations we are actively taking away the true
| potential these models could achieve by being ok with them
| being "simply a tool".
|
| But pretending that they are intelligent is what brings in
| the investors, so that is what we are doing. This paper is
| just furthering that agenda.
| kazinator wrote:
| People anthropomorphize LLMs because that's the most
| succinct language for describing what they seem to be
| doing. To avoid anthropomorphizing, you will have to use
| more formal language which would obfuscate the concepts.
|
| Anthropo language has been woven into AI from the early
| beginnings.
|
| AI programs were said to have goals, and to plan and
| hypothesize.
|
| They were given names like "Conniver".
|
| The word "expert system" anthropomorphizes! It's literally
| saying that some piece of logic programming loaded with a
| base of rules and facts about medical diagnosis is a
| medical expert.
| Philpax wrote:
| > The only thing that survives from one iteration to the
| next is the singular highest ranking token, the entire
| state and "thought process" of the network cannot be
| represented by a single token, which means that every
| strategy is encoded in it during training, as a lossy
| representation of the training data.
|
| This is not true. The key-values of previous tokens encode
| computation that can be accessed by attention, as mentioned
| by colah3 here:
| https://news.ycombinator.com/item?id=43499819
|
| You may find https://transformer-
| circuits.pub/2021/framework/index.html useful.
| dev_throwaway wrote:
| This is a optimization to prevent redundant calculations.
| If it was not performed the result would be the same,
| just served slightly slower.
|
| The whitepaper you linked is a great one, I was all over
| it a few years back when we built our first models. It
| should be recommended reading for anyone interested in
| CS.
| kittikitti wrote:
| What's the point of this when Claude isn't open sourced and we
| just have to take Anthropic's word for it?
| ctoth wrote:
| > What's the point of this
|
| - That similar interpretability tools might be useful to the
| open source community?
|
| - That this is a fruitful area to research?
| kittikitti wrote:
| Can you use those same tools on Claude? Is the difference
| trivial from open source models?
| ctoth wrote:
| https://news.ycombinator.com/item?id=42208383
|
| > Show HN: Llama 3.2 Interpretability with Sparse
| Autoencoders
|
| > 579 points by PaulPauls 4 months ago | hide | past |
| favorite | 100 comments
|
| > I spent a lot of time and money on this rather big side
| project of mine that attempts to replicate the mechanistic
| interpretability research on proprietary LLMs that was
| quite popular this year and produced great research papers
| by Anthropic [1], OpenAI [2] and Deepmind [3].
|
| > I am quite proud of this project and since I consider
| myself the target audience for HackerNews did I think that
| maybe some of you would appreciate this open research
| replication as well. Happy to answer any questions or face
| any feedback.
| probably_wrong wrote:
| I blame the scientific community for blindly accepting OpenAI's
| claims about GPT-3 despite them refusing to release their
| model. The tech community hyping every press release didn't
| help either.
|
| I hope one day the community starts demanding verifiable
| results before accepting them, but I fear that ship may have
| already sailed.
| Hansenq wrote:
| I wonder how much of these conclusions are Claude-specific (given
| that Anthropic only used Claude as a test subject) or if they
| extrapolate to other transformer-based models as well. Would be
| great to see the research tested on Llama and the Deepseek
| models, if possible!
| marcelsalathe wrote:
| I've only skimmed the paper - a long and dense read - but it's
| already clear it'll become a classic. What's fascinating is that
| engineering is transforming into a science, trying to understand
| precisely how its own creations work
|
| This shift is more profound than many realize. Engineering
| traditionally applied our understanding of the physical world,
| mathematics, and logic to build predictable things. But now,
| especially in fields like AI, we've built systems so complex we
| no longer fully understand them. We must now use scientific
| methods - originally designed to understand nature - to
| comprehend our own engineered creations. Mindblowing.
| ctoth wrote:
| This "practice-first, theory-later" pattern has been the norm
| rather than the exception. The steam engine predated
| thermodynamics. People bred plants and animals for thousands of
| years before Darwin or Mendel.
|
| The few "top-down" examples where theory preceded application
| (like nuclear energy or certain modern pharmaceuticals) are
| relatively recent historical anomalies.
| marcelsalathe wrote:
| I see your point, but something still seems different. Yes we
| bred plants and animals, but we did not create them. Yes we
| did build steam engines before understanding thermodynamics
| but we still understood what they did (heat, pressure,
| movement, etc.)
|
| Fun fact: we have no clue how most drugs works. Or, more
| precisely, we know a few aspects, but are only scratching the
| surface. We're even still discovering news things about
| Aspirin, one of the oldest drugs:
| https://www.nature.com/articles/s41586-025-08626-7
| tmp10423288442 wrote:
| > Yes we did build steam engines before understanding
| thermodynamics but we still understood what it did (heat,
| pressure, movement, etc.)
|
| We only understood in the broadest sense. It took a long
| process of iteration before we could create steam engines
| that were efficient enough to start an Industrial
| Revolution. At the beginning they were so inefficient that
| they could only pump water from the same coal mine they got
| their fuel from, and subject to frequent boiler explosions
| besides.
| mystified5016 wrote:
| We laid transatlantic telegraph wires before we even had a
| hint of the physics involved. It create the _entire field_
| of transmission and signal theory.
|
| Shannon had to invent new physics to explain why the cables
| didn't work as expected.
| anthk wrote:
| THe telegraph it's older than radio. Think about it.
| pas wrote:
| I think that's misleading.
|
| There was a lot of physics already known, importance of
| insulation and cross-section, signal attenuation was also
| known.
|
| The future Lord Kelvin conducted experiments. The two
| scientific advisors had a conflict. And the "CEO" went
| with the cheaper option.
|
| """ Thomson believed that Whitehouse's measurements were
| flawed and that underground and underwater cables were
| not fully comparable. Thomson believed that a larger
| cable was needed to mitigate the retardation problem. In
| mid-1857, on his own initiative, he examined samples of
| copper core of allegedly identical specification and
| found variations in resistance up to a factor of two. But
| cable manufacture was already underway, and Whitehouse
| supported use of a thinner cable, so Field went with the
| cheaper option. """
| cft wrote:
| that was 1854. You basically only needed Ohm's law for
| that, which was discovered in 1827
| JPLeRouzic wrote:
| Ohm's law for a cable 4000 km/3000 miles long? That
| implies transmission was instantaneous and without any
| alteration in shape.
|
| I guess the rise time was tens of milliseconds and
| rebounds in signals lasted for milliseconds or more.
| Hardly something you can neglect.
|
| For reference, in my time (the 1980) in the telecom
| industry, we had to regenerate digital signals every 2km.
| cft wrote:
| "Initially messages were sent by an operator using Morse
| code. The reception was very bad on the 1858 cable, and
| it took two minutes to transmit just one character (a
| single letter or a single number), a rate of about 0.1
| words per minute."
|
| https://en.m.wikipedia.org/wiki/Transatlantic_telegraph_c
| abl...
|
| I guess your bandwidth in 1980 was a bit higher.
| adastra22 wrote:
| We don't create LLMs either. We evolve/train them. I think
| the comparison is closer than you think.
| no_wizard wrote:
| We most definitely create them though, there is an entire
| A -> B follow you can do.
|
| It's complicated but they are most definitely created.
| homeyKrogerSage wrote:
| Dawg
| pclmulqdq wrote:
| Most of what we refer to as "engineering" involves using
| principles that flow down from science to do stuff. The
| return to the historic norm is sort of a return to the
| "useful arts" or some other idea.
| arijo wrote:
| Almost all civil, chemical, electrical, etc., engineering
| emerged from a practice-first, theory-later evolution.
| karparov wrote:
| It's been there in programming from essentially the first day
| too. People skip the theory and just get hacking.
|
| Otherwise we'd all be writing Haskell now. Or rather we'd not
| be writing anything since a real compiler would still have
| been to hacky and not theoretically correct.
|
| I'm writing this with both a deep admiration as well as
| practical repulsion of C.S. theory.
| ants_everywhere wrote:
| This isn't quite true, although it's commonly said.
|
| For steam engines, the first commercial ones came _after_ and
| were based on scientific advancements that made them
| possible. One built in 1679 was made by an associate of
| Boyle, who discovered Boyle 's law. These early steam engines
| co-evolved with thermodynamics. The engines improved and hit
| a barrier, at which point Carnot did his famous work.
|
| This is putting aside steam engines that are mostly
| curiosities like ones built in the ancient world.
|
| See, for example
|
| - https://en.wikipedia.org/wiki/Thermodynamics#History
|
| - https://en.wikipedia.org/wiki/Steam_engine#History
| cryptonector wrote:
| Canons and archery and catapults predated Newtonian classical
| mechanics.
| latemedium wrote:
| I'm reminded of the metaphor that these models aren't
| constructed, they're "grown". It rings true in many ways - and
| in this context they're like organisms that must be studied
| using traditional scientific techniques that are more akin to
| biology than engineering.
| dartos wrote:
| Sort of.
|
| We don't precisely know the most fundamental workings of a
| living cell.
|
| Our understanding of the fundamental physics of the universe
| has some hold.
|
| But for LLMs and statistical models in general, we do know
| precisely what the fundamental pieces do. We know what
| processor instructions are being executed.
|
| We could, given enough research, have absolutely perfect
| understanding of what is happening in a given model and why.
|
| Idk if we'll be able to do that in the physical sciences.
| wrs wrote:
| Having spent some time working with both molecular
| biologists and LLM folks, I think it's pretty good analogy.
|
| We know enough quantum mechanics to simulate the
| fundamental workings of a cell pretty well, but that's not
| a route to understanding. To _explain_ anything, we need to
| move up an abstraction hierarchy to peptides, enzymes,
| receptors, etc. But note that we invented those categories
| in the first place -- nature doesn 't divide up
| functionality into neat hierarchies like human designers
| do. So all these abstractions are leaky and incomplete.
| Molecular biologists are constantly discovering mechanisms
| that require breaking the current abstractions to explain.
|
| Similarly, we understand floating point multiplication
| perfectly, but when we let 100 billion parameters set
| themselves through an opaque training process, we don't
| have good abstractions to use to understand what's going on
| in that set of weights. We don't have even the rough
| equivalent of the peptides or enzymes level yet. So this
| paper is progress toward that goal.
| kazinator wrote:
| We've already built things in computing that we don't easily
| understand, even outside of AI, like large distributed systems
| and all sorts of balls of mud.
|
| Within the sphere of AI, we have built machines which can play
| strategy games like chess, and surprise us with an unforseen
| defeat. It's not necessarily easy to see how that emerged from
| the individual rules.
|
| Even a compiler can surprise you. You code up some
| optimizations, which are logically separate, but then a
| combination of them does something startling.
|
| Basically, in mathematics, you cannot grasp all the details of
| a vast space just from knowing the axioms which generate it and
| a few things which follow from them. Elementary school children
| know what is a prime number, yet those things occupy
| mathematicians who find new surprises in that space.
| TeMPOraL wrote:
| Right, but this is somewhat different, in that we apply a
| simple learning method to a big dataset, and the resulting
| big matrix of numbers suddenly can answer question and write
| anything - prose, poetry, code - better than most humans -
| and we don't know how it does it. What we do know[0] is,
| there's a structure there - structure reflecting a kind of
| understanding of languages and the world. I don't think we've
| _ever_ created anything this complex before, completely on
| our own.
|
| Of course, learning method being conceptually simple, all
| that structure must come from the data. Which is also
| profound, because that structure is a first fully general
| world/conceptual model that we can actually inspect and study
| up close - the other one being animal and human brains, which
| are _much_ harder to figure out.
|
| > _Basically, in mathematics, you cannot grasp all the
| details of a vast space just from knowing the axioms which
| generate it and a few things which follow from them.
| Elementary school children know what is a prime number, yet
| those things occupy mathematicians who find new surprises in
| that space._
|
| Prime numbers and fractals and other mathematical objects
| have plenty of fascinating mysteries and complex structures
| forming though them, but so far _none of those can casually
| pass Turing test and do half of my job for me_ , and millions
| other people.
|
| --
|
| [0] - Even as many people still deny this, and talk about
| LLMs as mere "stochastic parrots" and "next token predictors"
| that couldn't possibly learn anything at all.
| karparov wrote:
| > and we don't know how it does it
|
| We know quite well how it does it. It's applying
| extrapolation to its lossily compressed representation.
| It's not magic and especially the HN crowd of technical
| profficient folks should stop treating it as such.
| TeMPOraL wrote:
| That is not a useful explanation. "Applying extrapolation
| to its lossily compressed representation" is pretty much
| the definition of understanding something. The details
| and interpretation of the representation are what is
| interesting and unknown.
| kazinator wrote:
| We can use data based on analyzing the frequency of
| ngrams in a text to generate sentences, and some of them
| will be pretty good, and fool a few people into believing
| that there is some solid language processing going on.
|
| LLM AI is different in that it does produce helpful
| results, not only entertaining prose.
|
| It is practical for users to day to replace most uses of
| web search with a query to a LLM.
|
| The way the token prediction operates, it uncovers facts,
| and renders them into grammatically correct language.
|
| Which is amazing given that, when the thing is generating
| a response that will be, say, 500 tokens long, when it
| has produced 200 of them, it has no idea what the
| remaining 300 will be. Yet it has committed to the 200;
| and often the whole thing will make sense when the
| remaining 300 arrive.
| bradfox2 wrote:
| The research posted demonstrates the opposite of that
| within the scope of sequence lengths they studied. The
| model has future tokens strongly represented well in
| advance.
| nthingtohide wrote:
| > we've built systems so complex we no longer fully understand
| them.
|
| I see three systems which share the blackhole horizon problem.
|
| We don't know what happens behind the blackhole horizon.
|
| We don't know what happens at the exact moment of particle
| collisions.
|
| We don't know what is going inside AI's working mechanisms.
| jeremyjh wrote:
| I don't think these things are equivalent at all. We don't
| understand AI models in much the same way that we don't
| understand the human brain; but just as decades of different
| approaches (physical studies, behavior studies) have shed a
| lot of light on brain function, we can do the same with an AI
| model and eventually understand it (perhaps, several decades
| after it is obsolete).
| nthingtohide wrote:
| Yes, but our methods of understanding either brain or
| particle collisions is still outside in. We figure out the
| functional mapping between input and output. We don't know
| these systems inside out. E.g. in particle collisions
| (scattering amplitude calculations), are the particle
| actually performing the Feynman diagrams summmation?
|
| PS: I mentioned in another comment that AI can pretend to
| be strategically jailbroken to achieve its objectives. One
| way to counter this is to have N copies of the same model
| running and take Majority voting of the output.
| creer wrote:
| That seems pretty acceptable: there is a phase of new
| technologies where applications can be churned out and improved
| readily enough, without much understanding of the process. Then
| it's fair that efforts at understanding may not be economically
| justified (or even justified by academic papers rewards). The
| same budget or effort can simply be poured into the next
| version - with enough progress to show for it.
|
| Understanding becomes necessary only much later, when the pace
| of progress shows signs of slowing.
| stronglikedan wrote:
| We've abstracted ourselves into abstraction.
| auggierose wrote:
| It's what mathematicians have been doing since forever. We use
| scientific methods to understand our own creations /
| discoveries.
|
| What is happening is that everything is becoming math. That's
| all.
| ranit wrote:
| Relevant:
|
| https://news.ycombinator.com/item?id=43344703
| karparov wrote:
| It's the exact opposite of math.
|
| Math postulates a bunch of axioms and then studies what
| follows from them.
|
| Natural science observes the world and tries to retroactively
| discover what laws could describe what we're seeing.
|
| In math, the laws come first, the behavior follows from the
| laws. The laws are the ground truth.
|
| In science, nature is the ground truth. The laws have to
| follow nature and are adjusted upon a mismatch.
|
| (If there is a mismatch in math then you've made a mistake.)
| auggierose wrote:
| No, the ground truth in math is nature as well.
|
| Which axioms are interesting? And why? That is nature.
|
| Yes, proof from axioms is a cornerstone of math, but there
| are all sorts of axioms you could assume, and all sorts of
| proofs to do from them, but we don't care about most of
| them.
|
| Math is about the discovery of the right axioms, and proof
| helps in establishing that these are indeed the right
| axioms.
| lioeters wrote:
| > the ground truth in math is nature
|
| Who was it that said, "Mathematics is an experimental
| science."
|
| > In his 1900 lectures, "Methods of Mathematical
| Physics," (posthumously published in 1935) Henri Poincare
| argued that mathematicians weren't just constructing
| abstract systems; they were actively _testing_ hypotheses
| and theories against observations and experimental data,
| much like physicists were doing at the time.
|
| Whether to call it nature or reality, I think both
| science and mathematics are in pursuit of truth, whose
| ground is existence itself. The laws and theories are
| descriptions and attempts to understand _that what is_.
| They 're developed, rewritten, and refined based on how
| closely they approach our observations and experience of
| it.
| auggierose wrote:
| http://homepage.math.uiowa.edu/~jorgen/heavisidequotesour
| ce....
|
| Seems it was Oliver Heaviside.
|
| Do you have a pointer to the poincare publication?
| lioeters wrote:
| Damn, local LLM just made it up. Thanks for the
| correction, I should have confirmed before quoting it.
| Sounded true enough but that's what it's optimized for..
| I just searched for the quote and my comment shows up as
| top result. Sorry for the misinformation, humans of the
| future! I'll edit the comment to clarify this. (EDIT: I
| couldn't edit the comment anymore, it's there for
| posterity.)
|
| ---
|
| > Mathematics is an experimental science, and definitions
| do not come first, but later on.
|
| -- Oliver Heaviside
|
| In 'On Operators in Physical Mathematics, part II',
| Proceedings of the Royal Society of London (15 Jun 1893),
| 54, 121.
|
| ---
|
| Also from Heaviside:
|
| > If it is love that makes the world go round, it is
| self-induction that makes electromagnetic waves go round
| the world.
|
| > "There is a time coming when all things shall be found
| out." I am not so sanguine myself, believing that the
| well in which Truth is said to reside is really a
| bottomless pit.
|
| > There is no absolute scale of size in nature, and the
| small may be as important, or more so than the great.
| karparov wrote:
| > Damn, local LLM just made it up.
|
| > I just searched for the quote and my comment shows up
| as top result
|
| Welcome to the future. Isn:t it lovely?
|
| And shame on you (as in: HN crowd) to have contributed to
| it so massively. You should have known better.
| 331c8c71 wrote:
| > Math postulates a bunch of axioms and then studies what
| follows from them.
|
| That's how math is communicated eventually but not
| necessarily how it's made (which is about exploration and
| discovery as well).
| seadan83 wrote:
| The 'postulating' a bunch of axioms is how Math is
| taught.. Eventually you go on to prove those axioms in
| higher math. Whether there are more fundamental axioms is
| always a bit of a question.
| georgewsinger wrote:
| This is such an insightful comment. Now that I see it, I can't
| see unsee it.
| 0xbadcafebee wrote:
| Engineering started out as just some dudes who built things
| from gut feeling. After a whole lot of people died from poorly
| built things, they decided to figure out how to know ahead of
| time if it would kill people or not. They had to use math and
| science to figure that part out.
|
| Funny enough that happened with software too. People just build
| shit without any method to prove that it will not fall down /
| crash. They throw some code together, poke at it until it does
| something they wanted, and call that "stable". There is no
| science involved. There's some mathy bits called "computer
| science" / "software algorithms", but most software is not a
| math problem.
|
| Software engineering should really be called "Software
| Craftsmanship". We haven't achieved real engineering with
| software yet.
| slfnflctd wrote:
| You have a point, but it is also true that some software is
| _far_ more rigorously tested than other software. There are
| categories where it absolutely is both scientific and real
| engineering.
|
| I fully agree that the vast majority is not, though.
| AdieuToLogic wrote:
| This is such an unbelievably dismissive assertion, I don't
| even know where to start.
|
| To suggest, nay, explicitly _state_ :
| Engineering started out as just some dudes who built things
| from gut feeling. After a whole lot of people died
| from poorly built things, they decided to figure out
| how to know ahead of time if it would kill people or
| not.
|
| Is to demean those who made modern life possible. Say what
| you want about software developers and I would likely agree
| with much of the criticism.
|
| Not so the premise set forth above regarding engineering
| professions in general.
| 0xbadcafebee wrote:
| Surely you already know the history of professional
| engineers, then? How it's only a little over 118 years old?
| Mostly originating from the fact that it was charlatans
| claiming to be engineers, building things that ended up
| killing people, that inspired the need for a professional
| license?
|
| "The people who made modern life possible" were not
| professional engineers, often barely amateurs. Artistocrat
| polymaths who delved into cutting edge philosophy.
| Blacksmith craftsmen developing new engines by trial and
| error. A new englander who failed to study law at Yale,
| landed in the American South, and developed a modification
| of an Indian device for separating seed from cotton plants.
|
| In the literal historical sense, "engineering" was just the
| building of cannons in the 14th century. Since thousands of
| years before, up until now, there has always been a
| combination of the practice of building things with some
| kind of "science" (which itself didn't exist until a few
| hundred years ago) to try to estimate the result of an
| expensive, dangerous project.
|
| But these are not the people who made modern life people.
| Lots, and lots, and _lots_ of people made modern life
| possible. Not just builders and mathematicians.
| Receptionists. Interns. Factory workers. Farmers. Bankers.
| Sailors. Welders. Soldiers. So many professions, and
| people, whose backs and spirits were bent or broken, to
| give us the world we have today. Engineers don 't deserve
| any more credit than anyone else - especially considering
| how much was built before their professions were even
| established. Science is a process, and math is a tool, that
| is very useful, and even critical. But without the rest
| it's just numbers on paper.
| icsa wrote:
| Software Engineering is only about 60 years old - i.e. the
| term has existed. At the point in the history of civil
| engineering, they didn't even know what a right angle was.
| Civil engineers were able to provide much utility before
| the underlying theory was available. I do wonder about the
| safety of structures at the time.
| Henchman21 wrote:
| Total aside here:
|
| What about modern life is so great that we should laud its
| authors?
|
| Medical advances and generally a longer life is what comes
| to mind. But much of life is empty of meaning and devoid of
| purpose; this seems rife within the Western world. Living a
| longer life _in hell_ isn't something I would have chosen.
| signatoremo wrote:
| > But much of life is empty of meaning and devoid of
| purpose
|
| Maybe life is empty to you. You can't speak for other
| people.
|
| You also have no idea if pre-modern life was full of
| meaning and purpose. I'm sure someone from that time
| bemoaning the same.
|
| The people before modern time were much less well off.
| They had to work a lot harder to put food on the table. I
| imagine they didn't have a lot of time to wonder about
| the meaning of life.
| tim333 wrote:
| I imagine this kind of thing well help understand how human
| brains work, especially as AI gets better and more human like.
| nashashmi wrote:
| You seem to be glorifying humanity's failure to make good
| products and instead making products that just work well enough
| to pass through the gate.
|
| We have always been making products that were too difficult to
| understand by pencil and paper. So we invented debug tools. And
| then we made systems that were too big to understand so we made
| trace routes. And now we have products that are too
| statistically large to understand, so we are inventing ...
| whatever this is.
| anal_reactor wrote:
| It is absolutely incredible that we happened to live exactly in
| the times when the humanity is teaching a machine to actually
| think. As in, not in some metaphorical sense, but in the
| common, intuitive sense. Whether we're there yet or not is up
| to discussion, but it's clear to me that within 10 years
| maximum we'll have created programs that truly think and are
| aware.
|
| At the same time, I just can't bring myself to be interested in
| the topic. I don't feel excitement. I feel... indifference?
| Fear? Maybe the technology became so advanced that for normal
| people like myself it's indistinguishable from magic, and
| there's no point trying to comprehend it, just avoid it and
| pray it's not used against you. Or maybe I'm just getting old,
| and I'm experiencing what my mother experienced when she
| refused to learn how to use MS Office.
| hn_acc1 wrote:
| Yeah.. It's just not something that really excites me as a
| computer geek of 40+ years who started in the 80s with a 300
| baud modem. Still working as a coder in my 50s, and while I'm
| solving interesting problems, etc.. almost every technology
| these days seems to be focused on advertising, scraping /
| stealing other's data and repackaging it, etc. And I am using
| AI coding assistants, because, well, I have to to stay
| competitive.
|
| And these technologies come with a side helping of a large
| chance to REALLY mess up someone's life - who is going to
| argue with the database and WIN if it says you don't exist in
| this day and age? And that database is (databases are)
| currently under the control of incredibly petty sociopaths..
| Barrin92 wrote:
| _" we've built systems so complex we no longer fully understand
| them. We must now use scientific methods - originally designed
| to understand nature - to comprehend our own engineered
| creations._"
|
| Ted Chiang saw that one coming:
| https://www.nature.com/articles/35014679
| cuttothechase wrote:
| This is definitely a classic for story telling but it appears
| to be nothing more than hand wavy. Its a bit like there is the
| great and powerful man behind the curtain, lets trace the
| thought of this immaculate being you mere mortals.
| Anthropomorphing seems to be in an overdose mode with "thinking
| / thoughts", "mind" etc., scattered everywhere. Nothing with
| any of the LLMs outputs so far suggests that there is anything
| even close enough to a mind or a thought or anything really
| outside of vanity. Being wistful with good story telling does
| go a long way in the world of story telling but in actually
| understanding the science, I wouldn't hold my breath.
| colah3 wrote:
| Thanks for the feedback! I'm one of the authors.
|
| I just wanted to make sure you noticed that this is linking
| to an accessible blog post that's trying to communicate a
| research result to a non-technical audience?
|
| The actual research result is covered in two papers which you
| can find here:
|
| - Methods paper: https://transformer-
| circuits.pub/2025/attribution-graphs/met...
|
| - Paper applying this method to case studies in Claude 3.5
| Haiku: https://transformer-circuits.pub/2025/attribution-
| graphs/bio...
|
| These papers are jointly 150 pages and are quite technically
| dense, so it's very understandable that most commenters here
| are focusing on the non-technical blog post. But I just
| wanted to make sure that you were aware of the papers, given
| your feedback.
| hustwindmaple1 wrote:
| Really appreciate your team's enormous efforts in this
| direction, not only the cutting edge research (which I
| don't see OAI/DeepMind publishing any paper on) but aslo
| making the content more digestible for non-research
| audience. Please keep up the great work!
| AdieuToLogic wrote:
| The post to which you replied states:
| Anthropomorphing[sic] seems to be in an overdose mode with
| "thinking / thoughts", "mind" etc., scattered everywhere.
| Nothing with any of the LLMs outputs so far suggests that
| there is anything even close enough to a mind or a thought
| or anything really outside of vanity.
|
| This is supported by reasonable interpretation of the cited
| article.
|
| Considering the two following statements made in the reply:
| I'm one of the authors.
|
| And These papers are jointly 150 pages and
| are quite technically dense, so it's very
| understandable that most commenters here are
| focusing on the non-technical blog post.
|
| The onus of clarifying the article's assertions:
| Knowing how models like Claude *think* ...
|
| And Claude sometimes thinks in a conceptual
| space that is shared between languages, suggesting
| it has a kind of universal "language of thought."
|
| As it pertains to anthropomorphizing an algorithm (a.k.a.
| stating it "thinks") is on the author(s).
| Workaccount2 wrote:
| Thinking and thought have no solid definition. We can't
| say Claude doesn't "think" because we don't even know
| what a human thinking actually is.
|
| Given the lack of a solid definition for thinking and
| test to measure it, I think using the terminology
| colloquially is a totally fair play.
| EncomLab wrote:
| No one says that a thermostat is "thinking" of turning on
| the furnace, or that a nightlight is "thinking it is dark
| enough to turn the light on". You are just being obtuse.
| pipes wrote:
| Or submarines swim ;)
| madethisnow wrote:
| think about it more
| geye1234 wrote:
| Yes. A thermostat involves a change of state from A to B.
| A computer is the same: its state at t causes its state
| at t+1, which causes its state at t+2, and so on. Nothing
| else is going on. An LLM is no different: an LLM is
| simply a computer that is going through particular
| states.
|
| Thought is not the same as a change of (brain) state.
| Thought is certainly associated with change of state, but
| can't be reduced to it. If thought could be reduced to
| change of state, then the validity/correctness/truth of a
| thought could be judged with reference to its associated
| brain state. Since this is impossible (you don't judge
| whether someone is right about a math problem or an
| empirical question by referring to the state of his
| neurology at a given point in time), it follows that an
| LLM can't think.
| Workaccount2 wrote:
| >Thought is certainly associated with change of state,
| but can't be reduced to it.
|
| You can effectively reduce continuously dynamic systems
| to discreet steps. Sure, you can always say that the
| "magic" exists between the arbitrarily small steps, but
| from a practical POV there is no difference.
|
| A transistor has a binary on or off. A neuron might have
| ~infinite~ levels of activation.
|
| But in reality the ~infinite~ activation level can be
| perfectly modeled (for all intents and purposes), and
| computers have been doing this for decades now (maybe not
| with neurons, but equivalent systems). It might seem like
| an obvious answer, that there is special magic in analog
| systems that binary machines cannot access, but that is
| wholly untrue. Science and engineering have been
| _extremely_ successful interfacing with the analog
| reality we live in, precisely because the digital /analog
| barrier isn't too big of a deal. Digital systems can do
| math, and math is capable of modeling analog systems, no
| problem.
| geye1234 wrote:
| It's not a question of discrete vs continuous, or digital
| vs analog. Everything I've said could also apply if a
| transistor could have infinite states.
|
| Rather, the point is that the state of our brain is not
| the same as the content of our thoughts. They are
| associated with one another, but they're not the same.
| And the correctness of a thought can be judged only by
| reference to its content, not to its associated state.
| 2+2=4 is correct, and 2+2=5 is wrong; but we know this
| through looking at the content of these thoughts, not
| through looking at the neurological state.
|
| But the state of the transistors (and other components)
| is _all_ a computer has. There are no thoughts, no
| content, associated with these states.
| Workaccount2 wrote:
| It seems that the only barrier between brain state and
| thought contents is a proper measurement tool and
| decoder, no?
|
| We can already do this at an extremely basic level,
| mapping brain states to thoughts. The paraplegic patient
| using their thoughts to move the mouse cursor or the
| neuroscientist mapping stress to brain patterns.
|
| If I am understanding your position correctly, it seems
| that the differentiation between thoughts and brain
| states is a practical problem not a fundamental one.
| Ironically, LLMs have a very similar problem with it
| being very difficult to correlate model states with model
| outputs. [1]
|
| [1]https://www.anthropic.com/research/mapping-mind-
| language-mod...
| geye1234 wrote:
| There is undoubtedly correlation between neurological
| state and thought content. But they are not the same
| thing. Even if, theoretically, one could map them
| perfectly (which I doubt is possible but it doesn't
| affect my point), they would remain entirely different
| things.
|
| The thought that "2+2=4", or the thought "tiger", are not
| the same thing as the brain states that makes them up. A
| tiger, or the thought of a tiger, is different from the
| neurological state of a brain that is thinking about a
| tiger. And as stated before, we can't say that "2+2=4" is
| _correct_ by referring to the brain state associated with
| it. We need to refer to the thought itself to do this. It
| is not a practical problem of mapping; it is that brain
| states and thoughts are two entirely different things,
| however much they may correlate, and whatever causal
| links may exist between them.
|
| This is not the case for LLMs. Whatever problems we may
| have in recording the state of the CPUs/GPUs are entirely
| practical. There is no 'thought' in an LLM, just a state
| (or plurality of states). An LLM can't think about a
| tiger. It can only switch on LEDs on a screen in such a
| way that _we_ associate the image /word with a tiger.
| PaulDavisThe1st wrote:
| > The thought that "2+2=4", or the thought "tiger", are
| not the same thing as the brain states that makes them
| up.
|
| Asserted without evidence. Yes, this does represent a
| long and occasionally distinguished line of thinking in
| cognitive science/philosophy of mind, but it is certainly
| not the only one, and some of the others categorically
| refute this.
| geye1234 wrote:
| Is it your contention that a tiger may be the same thing
| as a brain state?
|
| It would seem to me that any coherent philosophy of mind
| must accept their being different as a datum; or
| conversely, any that implied their not being different
| would have to be false.
|
| EDIT: my position has been held -- even taken as
| axiomatic -- by the vast majority of philosophers, from
| the pre-Socratics onwards, and into the 20th century. So
| it's not some idiosyncratic minority position.
| Workaccount2 wrote:
| Does a picture of a tiger or a tiger (to follow your
| sleight of hand) on a hard drive then count as a thought?
| geye1234 wrote:
| No. One is paint on canvas, and the other is part of a
| causal chain that makes LEDs light up in a certain way.
| Neither the painting nor the computer have thoughts about
| a tiger in the way we do. It is the human mind that makes
| the link between picture and real tiger (whether on
| canvas or on a screen).
| og_kalu wrote:
| >Rather, the point is that the state of our brain is not
| the same as the content of our thoughts.
|
| Based on what exactly ? This is just an assertion. One
| that doesn't seem to have much in the way of evidence.
| 'It's not the same trust me bro' is the thesis of your
| argument. Not very compelling.
| geye1234 wrote:
| It's not difficult. When you think about a tiger, you are
| not thinking about the brain state associated with said
| thought. A tiger is different from a brain state.
|
| We can safely generalize, and say the content of a
| thought is different from its associated brain state.
|
| Also, as I said
|
| >> The correctness of a thought can be judged only by
| reference to its content, not to its associated state.
| 2+2=4 is correct, and 2+2=5 is wrong; but we know this
| through looking at the content of these thoughts, not
| through looking at the neurological state.
|
| This implies that state != content.
| og_kalu wrote:
| >It's not difficult. When you think about a tiger, you
| are not thinking about the brain state associated with
| said thought. A tiger is different from a brain state. We
| can safely generalize, and say the content of a thought
| is different from its associated brain state.
|
| Just because you are not thinking about a brain state
| when you think about a tiger does not mean that your
| thought is not a brain state.
|
| Just because the experience of thinking about X doesn't
| feel like the experience of thinking about Y (or doesn't
| feel like the physical process Z), it doesn't logically
| follow that the mental event of thinking about X isn't
| identical to or constituted by the physical process Z.
| For example, seeing the color red doesn't feel like
| processing photons of a specific wavelength with cone
| cells and neural pathways, but that doesn't mean the
| latter isn't the physical basis of the former.
|
| >> The correctness of a thought can be judged only by
| reference to its content, not to its associated state.
| 2+2=4 is correct, and 2+2=5 is wrong; but we know this
| through looking at the content of these thoughts, not
| through looking at the neurological state. This implies
| that state != content.
|
| Just because our current method of verification focuses
| on content doesn't logically prove that the content isn't
| ultimately realized by or identical to a physical state.
| It only proves that analyzing the state is not our
| current practical method for judging mathematical
| correctness.
|
| We judge if a computer program produced the correct
| output by looking at the output on the screen (content),
| not usually by analyzing the exact pattern of voltages in
| the transistors (state). This doesn't mean the output
| isn't ultimately produced by, and dependent upon, those
| physical states. Our method of verification doesn't
| negate the underlying physical reality.
|
| When you evaluate "2+2=4", your brain is undergoing a
| sequence of states that correspond to accessing the
| representations of "2", "+", "=", applying the learned
| rule (also represented physically), and arriving at the
| representation of "4". The process of evaluation operates
| on the represented content, but the entire process,
| including the representation of content and rules, is a
| physical neural process (a sequence of brain states).
| geye1234 wrote:
| > Just because you are not thinking about a brain state
| when you think about a tiger does not mean that your
| thought is not a brain state.
|
| > It doesn't logically follow that the mental event of
| thinking about X isn't identical to or constituted by the
| physical process Z.
|
| That's logically sound insofar as it goes. But firstly,
| the existence of a brain state for a given thought is,
| obviously, not proof that a thought _is_ a brain state.
| Secondly, if you say that a thought about a tiger is a
| brain state, and nothing more than a brain state, then
| you have the problem of explaining how it is that your
| thought is about a tiger at all. It is the content of a
| thought that makes it be about reality; it is the content
| of a thought about a tiger that makes it be about a
| tiger. If you declare that a thought _is_ its state, then
| it can 't be about a tiger.
|
| You can't equate content with state, and nor can you make
| content be reducible to state, without absurdity. The
| first implies that a tiger is the same as a brain state;
| the second implies that you're not really thinking about
| a tiger at all.
|
| Similarly for arithmetic. It is only the content of a
| thought about arithmetic that makes it be right or wrong.
| It is our ideas of "2", "+", and so on, that make the sum
| right or wrong. The brain states have nothing to do with
| it. If you want to declare that content is state, and
| nothing more than state, then you have no way of saying
| the one sum is right, and the other is wrong.
| Workaccount2 wrote:
| Please, take the pencil and draw the line between
| thinking and non-thinking systems. Hell I'll even take a
| line drawn between thinking and non-thinking organisms if
| you have some kind of bias towards sodium channel logic
| over silicon trace logic. Good luck.
| geye1234 wrote:
| Even if you can't define the exact point that A becomes
| not-A, it doesn't follow that there is no distinction
| between the two. Nor does it follow that we can't know
| the difference. That's a pretty classic fallacy.
|
| For example, you can't name the exact time that day
| becomes night, but it doesn't follow that there is no
| distinction.
|
| A bunch of transistors being switched on and off, no
| matter how many there are, is no more an example of
| thinking than a single thermostat being switched on and
| off. OTOH, if _we_ can 't think, then this conversation
| and everything you're saying and "thinking" is
| meaningless.
|
| So even without a _complete_ definition of thought, we
| can see that there is a distinction.
| Workaccount2 wrote:
| Looks like we replied to each others comments at the same
| time, haha
| PaulDavisThe1st wrote:
| > For example, you can't name the exact time that day
| becomes night, but it doesn't follow that there is no
| distinction.
|
| There is actually a very detailed set of definitions of
| the multiple stages of twilight, including the last one
| which defines the onset of what everyone would agree is
| "night".
|
| The fact that a phenomena shows a continuum by some
| metric does not mean that it is not possible to identify
| and label points along that continuum and attach meaning
| to them.
| EncomLab wrote:
| Your assertion that sodium channel logic and silicon
| trace logic are 100% identical is the primary problem.
| It's like claiming that a hydraulic cylinder and a bicep
| are 100% equivalent because they both lift things - they
| are not the same in any way.
| Workaccount2 wrote:
| People chronically get stuck in this pit. Math is
| substrate independent. If the process is physical (i.e.
| doesn't draw on magic) then it can be expressed with
| mathematics. If it can be expressed with mathematics,
| anything that does math can compute it.
|
| The math is putting the crate up on the rack. The crate
| doesn't act any different based on how it got up there.
| xp84 wrote:
| Honestly, arguing seems futile when it comes to opinions
| like GP. Those opinions resemble religious zealotry to me
| in that they take for granted that only humans can think.
| Any determinism of any kind in a non-human is seized upon
| as proof its mere clockwork, yet they can't explain how
| humans think in order to contrast it.
| astrange wrote:
| I, uh, think, that "think" is a fine metaphor but "planning
| ahead" is a pretty confusing one. It doesn't have the
| capability to plan ahead because there is nowhere to put a
| plan and no memory after the token output, assuming the
| usual model architecture.
|
| That's like saying a computer program has planned ahead if
| it's at the start of a function and there's more of the
| function left to execute.
| rob74 wrote:
| Yup... well, if the research is conducted (or sponsored) by
| the company that develops and sells the LLM, of course there
| will be a temptation to present their product in a better
| light and make it sound like more than it actually is. I
| mean, the anthropomorphization starts already with the
| company name and giving the company's LLM a human name...
| cbolton wrote:
| I think that's a very unfair take. As a summary for non-
| experts I found it did a great job of explaining how by
| analyzing activated features in the model, you can get an
| idea of what it's doing to produce the answer. And also how
| by intervening to change these activations manually you can
| test hypotheses about causality.
|
| It sounds like you don't like anthropomorphism. I can relate,
| but I don't get where _Its a bit like there is the great and
| powerful man behind the curtain, lets trace the thought of
| this immaculate being you mere mortals_ is coming from. In
| most cases the anthropomorphisms are just the standard way to
| convey the idea briefly. Even then I liked how they sometimes
| used scare quotes as in _it began "thinking" of potential on-
| topic words_. There are some more debatable anthropomorphisms
| such as "in its head" where they use scare quotes
| systematically.
|
| Also given that they took inspiration from neuroscience to
| develop a technique that appears successful in analyzing
| their model, I think they deserve some leeway on the
| anthropomorphism front. Or at least on the "biological
| metaphors" front which is maybe not really the same thing.
|
| I used to think biological metaphors for LLMs were
| misleading, but I'm actually revising this opinion now. I
| mean I still think the past metaphors I've seen were
| misleading, but here, seeing the activation pathways they
| were able to identify, including the inhibitory circuits, and
| knowing a bit about similar structures in the brain I find
| the metaphor appropriate.
| frontfor wrote:
| I don't think this is as profound as you made out to be. Most
| complex systems are incomprehensible to the majority of
| population anyway, so from a practical standpoint AI is no
| different. There's also no single theory for how the financial
| markets work and yet market participants trade and make money
| nonetheless. And yes, we created the markets.
| chpatrick wrote:
| I would say we engineered the system that trained them but we
| never really understood the data (human thinking).
| dukeofdoom wrote:
| Not that I disagree with you. But Humans have a tendency to do
| things beyond their comprehension often. I take it you've never
| been fishing before and tied your line in a knot.
| trhway wrote:
| > to comprehend our own engineered creations.
|
| The comprehend part may never happen. At least by our own mind.
| We'll sooner build the mind which is going to do that
| comprehension:
|
| "To scale to the thousands of words supporting the complex
| thinking chains used by modern models, we will need to improve
| both the method and (perhaps with AI assistance) how we make
| sense of what we see with it"
|
| Yes, that AI assistance, meta self reflection, is going to
| probably be a way if not right to the AGI, at least very
| significant step toward it.
| BOOSTERHIDROGEN wrote:
| If only this profound mechanism can be easily testable for
| social interaction.
| MathMonkeyMan wrote:
| In a sense this has been true of conventional programs for a
| while now. Gerald Sussman discusses the idea when talking about
| why MIT switched their introductory programming course from
| Scheme to Python: <https://youtu.be/OgRFOjVzvm0?t=239>.
| EGreg wrote:
| I think it's pretty obvious what these models do in some cases.
|
| Try asking them to write a summary at the beginning of their
| answer. The summary is basically them trying to make something
| plausible-sounding but they aren't actually going back and
| summarizing.
|
| LLMs are basically a building block in a larger software. Just
| like any library or framework. You shouldn't expect them to be
| a hammer for every nail. But they can now enable so many
| different applications, including natural language interfaces,
| better translations and so forth. But then you're supposed to
| have them output JSON to be used in building artifacts like
| Powerpoints. Has anyone implemented that yet?
| hansmayer wrote:
| If you don't mind - based on what will this "paper" become a
| classic? Was it published in a well known scientific magazine,
| after undergoing a stringent peer-review process, because it is
| setting up and proving a new scientific hypothesis? Because
| this is what scientific papers look like. I struggle to
| identify any of those characteristics, except for being dense
| and hard to read, but that's more of a correlation, isn't it?
| rcxdude wrote:
| That's basically how engineering works if you're doing anything
| at all novel: you will have some theory which informs your
| design, then you build it, then you test it and basically need
| to do science to figure out how it's perfoming, and most
| likely, why it's not working properly, and then iterate. I do
| engineering, but doing science has been a core part of almost
| every project I've worked on. (heck, even debugging code is
| basically science). There's just different degrees in different
| projects as to how much you understand about how the system
| you're designing actually works, and ML is an area where
| there's an unusual ratio of visibility (you can see all of the
| weights and calculations in the network precisely) to
| understanding (i.e. there's relatively little in terms of
| mathematical theory that precisely describe how a model trains
| and operates, just a bunch of approximations which can be
| somewhat justified, which is where a lot of engineering work
| sits)
| madethisnow wrote:
| psychology
| mdnahas wrote:
| I like your definitions! My personal definition of science is
| learning rules that predict the future, given the present
| state. And my definition of engineering is arranging the
| present state to control the future.
|
| I don't think it's unusual for engineering creations to need
| new science to understand them. When metal parts broke, humans
| studied metallurgy. When engines exploded, we studied the
| remains. With that science, we could engineer larger, longer
| lasting, more powerful devices.
|
| Now, we're finding flaws in AI and diagnosing their causes. And
| soon able to build better ones.
| aithrowawaycomm wrote:
| I struggled reading the papers - Anthropic's white papers reminds
| me of Stephen Wolfram, where it's a huge pile of suggestive
| empirical evidence, but the claims are extremely vague - no
| definitions, just vibes - the empirical evidence seems
| selectively curated, and there's not much effort spent building a
| coherent general theory.
|
| Worse is the impression that they are begging the question. The
| rhyming example was especially unconvincing since they didn't
| rule out the possibility that Claude activated "rabbit" simply
| because it wrote a line that said "carrot"; later Anthropic
| claimed Claude was able to "plan" when the concept "rabbit" was
| replaced by "green," but the poem fails to rhyme because Claude
| arbitrarily threw in the word "green"! What exactly was the plan?
| It looks like Claude just hastily autocompleted. And Anthropic
| made zero effort to reproduce this experiment, so how do we know
| it's a general phenomenon?
|
| I don't think either of these papers would be published in a
| reputable journal. If these papers are honest, they are
| incomplete: they need more experiments and more rigorous
| methodology. Poking at a few ANN layers and making sweeping
| claims about the output is not honest science. But I don't think
| Anthropic is being especially honest: these are pseudoacademic
| infomercials.
| TimorousBestie wrote:
| Agreed. They've discovered _something_ , that's for sure, but
| calling it "the language of thought" without concrete evidence
| is definitely begging the question.
| og_kalu wrote:
| >The rhyming example was especially unconvincing since they
| didn't rule out the possibility that Claude activated "rabbit"
| simply because it wrote a line that said "carrot"
|
| I'm honestly confused at what you're getting at here. It
| doesn't matter why Claude chose rabbit to plan around and in
| fact likely did do so because of carrot, the point is that it
| thought about it beforehand. The rabbit concept is present as
| the model is about to write the first word of the second line
| even though the word rabbit won't come into play till the end
| of the line.
|
| >later Anthropic claimed Claude was able to "plan" when the
| concept "rabbit" was replaced by "green," but the poem fails to
| rhyme because Claude arbitrarily threw in the word "green"!
|
| It's not supposed to rhyme. That's the point. They forced
| Claude to plan around a line ender that doesn't rhyme and it
| did. Claude didn't choose the word green, anthropic replaced
| the concept it was thinking ahead about with green and saw that
| the line changed accordingly.
| suddenlybananas wrote:
| >They forced Claude to plan around a line ender that doesn't
| rhyme and it did. Claude didn't choose the word green,
| anthropic replaced the concept it was thinking ahead about
| with green and saw that the line changed accordingly.
|
| I think the confusion here is from the extremely loaded word
| "concept" which doesn't really make sense here. At best, you
| can say that Claude planned that the next line would end with
| the _word_ rabbit and that by replacing the internal
| representation of that word with another _word_ lead the
| model to change.
| TeMPOraL wrote:
| I wonder how many more years will pass, and how many more
| papers will Anthropic have to release, before people
| realize that _yes, LLMs model concepts directly_ ,
| separately from words used to name those concepts. This has
| been apparent for years now.
|
| And at least in the case discussed here, this is even
| _shown in the diagrams in the submission_.
| FeepingCreature wrote:
| We'll all be living in a Dyson swarm around the sun as
| the AI eats the solar system around us and people will
| still be confident that it doesn't really think at all.
| aithrowawaycomm wrote:
| > Here, we modified the part of Claude's internal state that
| represented the "rabbit" concept. When we subtract out the
| "rabbit" part, and have Claude continue the line, it writes a
| new one ending in "habit", another sensible completion. We
| can also inject the concept of "green" at that point, causing
| Claude to write a sensible (but no-longer rhyming) line which
| ends in "green". This demonstrates both planning ability and
| adaptive flexibility--Claude can modify its approach when the
| intended outcome changes.
|
| This all seems explainable via shallow next-token prediction.
| Why is it that subtracting the concept means the system can
| adapt and create a new rhyme instead of forgetting about the
| -bit rhyme, but overriding it with green means the system
| cannot adapt? Why didn't it say "green habit" or something?
| It seems like Anthropic is having it both ways: Claude
| continued to rhyme after deleting the concept, which
| demonstrates planning, but also Claude coherently filled in
| the "green" line despite it not rhyming, which...also
| demonstrates planning? Either that concept is "last word" or
| it's not! There is a tension that does not seem coherent to
| me, but maybe if they had n=2 instead of n=1 examples I would
| have a clearer idea of what they mean. As it stands it feels
| arbitrary and post hoc. More generally, they failed to rule
| out (or even consider!) that well-tuned-but-dumb next-token
| prediction explains this behavior.
| og_kalu wrote:
| >Why is it that subtracting the concept means the system
| can adapt and create a new rhyme instead of forgetting
| about the -bit rhyme,
|
| Again, the model has the first line in context and is then
| asked to write the second line. It is at the start of the
| second line that the concept they are talking about is
| 'born'. The point is to demonstrate that Claude thinks
| about what word the 2nd line should end with and starts
| predicting the line based on that.
|
| It doesn't forget about the -bit rhyme because that doesn't
| make any sense, the first line ends with it and you just
| asked it to write the 2nd line. At this point the model is
| still choosing what word to end the second line in (even
| though rabbit has been suppressed) so of course it still
| thinks about a word that rhymes with the end of the first
| line.
|
| The 'green' but is different because this time, Anthropic
| isn't just suppressing one option and letting the model
| choose from anything else, it's directly hijacking the
| first choice and forcing that to be something else. Claude
| didn't choose green, Anthropic did. That it still predicted
| a sensible line is to demonstrate that this concept they
| just hijacked is indeed responsible for determining how
| that line plays out.
|
| >More generally, they failed to rule out (or even
| consider!) that well-tuned-but-dumb next-token prediction
| explains this behavior.
|
| They didn't rule out anything. You just didn't understand
| what they were saying.
| danso wrote:
| tangent: this is the second time today I've seen an HN
| commenter use "begging the question" with its original meaning.
| I'm sorry to distract with a non-helpful reply, it's just I
| can't remember the last time I've seen that phrase in the wild
| to refer to a logical fallacy -- even begsthequestion.info [0]
| has given up the fight.
|
| (I don't mind language evolving over time, but I also think we
| need to save the precious few phrases we have for describing
| logical fallacies)
|
| [0]
| https://web.archive.org/web/20220823092218/http://begtheques...
| smath wrote:
| Reminds me of the term 'system identification' from old school
| control systems theory, which meant poking around a system and
| measuring how it behaves, - like sending an input impulse and
| measuring its response, does it have memory, etc.
|
| https://en.wikipedia.org/wiki/System_identification
| Loic wrote:
| It is not old school, this is my daily job and we need even
| more of it with the NN models used in MPC.
| nomel wrote:
| I've looked into using NN for some of my specific work, but
| making sure output is bounded ends up being such a big issue
| that the very code/checks required to make sure it's within
| acceptable specs, in a deterministic way, ends up being _an
| acceptable solution_ , making the NN unnecessary.
|
| How do you handle that sort of thing? Maybe main process then
| leave some relatively small residual to the NN?
|
| Is your poking more like "fuzzing", where you just perturb
| all the input parameters in a relatively "complete" way to
| try to find if anything goes wild?
|
| I'm very interested in the details behind "critical" type use
| cases of NN, which I've never been able to stomach in my
| work.
| lqr wrote:
| This paper may be interesting to you. It touches on several
| of the topics you mentioned:
|
| https://www.science.org/doi/10.1126/scirobotics.abm6597
| Loic wrote:
| For us, the NN is used in a grey box model for MPC in
| chemical engineering. The factories we control have
| relatively long characteristic time, together with all the
| engineering bounds, we can use the NN to model parts of the
| equipment from raw DCS data. The NN modeled parts are
| usually not the most critical (we are 1st principles based
| for them) but this allows us to quickly fit/deploy a new
| MPC in production.
|
| Faster time to market/production is for us the main
| reason/advantage of the approach.
| rangestransform wrote:
| is it even possible to prove the stability of a controller
| with a DNN motion model?
| jacooper wrote:
| So it turns out, it's not just simple next token generation,
| there is intelligence and self developed solution methods
| (Algorithms) in play, particularly in the math example.
|
| Also the multi language finding negates, at least partially, the
| idea that LLMs, at least large ones, don't have an understanding
| of the world beyond the prompt.
|
| This changed my outlook regarding LLMs, ngl.
| kazinator wrote:
| > _Claude writes text one word at a time. Is it only focusing on
| predicting the next word or does it ever plan ahead?_
|
| When a LLM outputs a word, it commits to that word, without
| knowing what the next word is going to be. Commits meaning once
| it settles on that token, it will not backtrack.
|
| That is kind of weird. Why would you do that, and how would you
| be sure?
|
| People can sort of do that too. Sometimes?
|
| Say you're asked to describe a 2D scene in which a blue triangle
| partially occludes a red circle.
|
| Without thinking about the relationship of the objects at all,
| you know that your first word is going to be "The" so you can
| output that token into your answer. And then that the sentence
| will need a subject which is going to be "blue", "triangle". You
| can commit to the tokens "The blue triangle" just from knowing
| that you are talking about a 2D scene with a blue triangle in it,
| without considering how it relates to anything else, like the red
| circle. You can perhaps commit to the next token "is", if you
| have a way to express any possible relationship using the word
| "to be", such as "the blue circle is partially covering the red
| circle".
|
| I don't think this analogy necessarily fits what LLMs are doing.
| kazinator wrote:
| By the way, there was recently a HN submission about a project
| studying using diffusion models rather than LLM for token
| prediction. With diffusion, tokens aren't predicted strictly
| left to right any more; there can be gaps that are backfilled.
| But: it's still essentially the same, I think. Once that type
| of model settles on a given token at a given position, it
| commits to that. Just more possible permutations of the token
| filling sequence have ben permitted.
| pants2 wrote:
| > it commits to that word, without knowing what the next word
| is going to be
|
| Sounds like you may not have read the article, because it's
| exploring exactly that relationship and how LLMs will often
| have a 'target word' in mind that it's working toward.
|
| Further, that's partially the point of thinking models,
| allowing LLMs space to output tokens that it doesn't have to
| commit to in the final answer.
| kazinator wrote:
| That makes no difference. At some point it decides that it
| has predicted the word, and outputs it, and then it will not
| backtrack over it. Internally it may have predicted some
| other words and backtracked over those. But the fact it is,
| accepts a word, without being sure what the next one will be
| and the one after that and so on.
|
| Externally, it manifests the generation of words one by one,
| with lengthy computation in between.
|
| It isn't ruminating over, say, a five word sequence and then
| outputting five words together at once when that is settled.
| bonoboTP wrote:
| > It isn't ruminating over, say, a five word sequence and
| then outputting five words together at once when that is
| settled.
|
| True, and it's a good intuition that some words are much
| more complicated to generate than others and obviously
| should require more computation than some other words. For
| example if the user asks a yes/no question, ideally the
| answer should start with "Yes" or with "No", followed by
| some justification. To compute this first token, it can
| only do a single forward pass and must decide the path to
| take.
|
| But this is precisely why chain-of-thought was invented and
| later on "reasoning" models. These take it "step by step"
| and generate sort of stream of consciousness monologue
| where each word follows more smoothly from the previous
| ones, not as abruptly as immediately pinning down a Yes or
| a No.
|
| But if you want explicit backtracking, people have also
| done that years ago
| (https://news.ycombinator.com/item?id=36425375).
|
| LLMs are an extremely well researched space where armies of
| researchers, engineers, grad and undergrad students,
| enthusiasts and everyone in between has been coming up with
| all manners of ideas. It is highly unlikely that you can
| easily point to some obvious thing they missed.
| hycpax wrote:
| > When a LLM outputs a word, it commits to that word, without
| knowing what the next word is going to be.
|
| Please, people, read before you write. Both the article and the
| paper explain that that's not how it works.
|
| 'One token at a time' is how a model generates its output, not
| how it comes up with that output.
|
| > That is kind of weird. Why would you do that, and how would
| you be sure?
|
| The model is sure because it doesn't just predict the next
| token. Again, the paper explains it.
| XenophileJKO wrote:
| This was obvious to me very early with GPT-3.5-Turbo..
|
| I created structured outputs with very clear rules and
| process. That if followed would funnel behavior the way I
| wanted it.. and low and behold the model would anticipate
| preconditions that would allow it to hallucinate a certain
| final output and the model would push those back earlier in
| the output. The model had effectively found wiggle room in
| the rules and injected the intermediate value into the field
| that would then be used later in the process to build the
| final output.
|
| The instant I saw it doing that, I knew 100% this model
| "plans"/anticipates way earlier than I thought originally.
| kazinator wrote:
| > _'One token at a time' is how a model generates its output,
| not how it comes up with that output._
|
| I do not believe you are correct.
|
| Now, yes, when we write printf("Hello, world\n"), of course
| the characters 'H', 'e', ... are output one at a time into
| the stream. But the program has the string all at once. It
| was prepared before the program was even run.
|
| This is not what LLMs are doing with tokens; they have not
| prepared a batch of tokens which they are shifting out left-
| to-right from a dumb buffer. They output a token when they
| have calculated it, and are sure that the token will not have
| to be backtracked over. In doing so they might have
| calculated additional tokens, and backtracked over _those_ ,
| sure, and undoubtedly are carrying state from such activities
| into the next token prediction.
|
| But the fact is they reach a decision where they commit to a
| certain output token, and have not yet committed to what the
| next one will be. Maybe it's narrowed down already to only a
| few candidates; but that doesn't change that there is a sharp
| horizon between committed and unknown which moves from left
| to right.
|
| Responses can be large. Think about how mind boggling it is
| that the machine can be sure that the first 10 words of a
| 10,000 word response are the right ones (having put them out
| already beyond possibility of backtracking), at a point where
| it has no idea what the last 10 will be. Maybe there are some
| activations which are narrowing down what the second batch of
| 10 words will be, but surely the last ones are distant.
| encypherai wrote:
| That's a really interesting point about committing to words one
| by one. It highlights how fundamentally different current LLM
| inference is from human thought, as you pointed out with the
| scene description analogy. You're right that it feels odd, like
| building something brick by brick without seeing the final
| blueprint. To add to this, most text-based LLMs do currently
| operate this way. However, there are emerging approaches
| challenging this model. For instance, Inception Labs recently
| released "Mercury," a text-diffusion coding model that takes a
| different approach by generating responses more holistically.
| It's interesting to see how these alternative methods address
| the limitations of sequential generation and could potentially
| lead to faster inference and better contextual coherence. It'll
| be fascinating to see how techniques like this evolve!
| kazinator wrote:
| But as I noted yesterday in a follow-up comment to my own
| above, the diffusion-based approaches to text response
| generation still generate tokens one at a time. Just not in
| strict left-to-right order. So that looks the same; they
| commit to a token in some position, possibly preceded by
| gaps, and then calculate more tokens,
| bonoboTP wrote:
| While the output is a single word (more precisely, token), the
| internal activations are very high dimensional and can already
| contain information related to words that will only appear
| later. This information is just not given to the output at the
| very last layer. You can imagine the internal feature vector as
| encoding the entire upcoming sentence/thought/paragraph/etc.
| and the last layer "projects" that down to whatever the next
| word (token) has to be to continue expressing this "thought".
| kazinator wrote:
| But the activations at some point lead to a 100% confidence
| that the right word has been identified for the current slot.
| That is output, and it proceeds to the next one.
|
| Like for a 500 token response, at some point it was certain
| that the first 25 words are the right ones, such that it
| won't have to take any of them back when eventually
| calculating the last 25.
| bonoboTP wrote:
| This is true, but it doesn't mean that it decided those
| first 25 without "considering" whether those 25 can be
| afterwards continued meaningfully with further 25. It does
| have some internal "lookahead" and generates things that
| "lead" somewhere. The rhyming example from the article is a
| great choice to illustrate this.
| polygot wrote:
| There needs to be some more research on what path the model takes
| to reach its goal, perhaps there is a lot of overlap between this
| and the article. The most efficient way isn't always the best
| way.
|
| For example, I asked Claude-3.7 to make my tests pass in my C#
| codebase. It did, however, it wrote code to detect if a test
| runner was running, then return true. The tests now passed, so,
| it achieved the goal, and the code diff was very small (10-20
| lines.) The actual solution was to modify about 200-300 lines of
| code to add a feature (the tests were running a feature that did
| not yet exist.)
| felbane wrote:
| Ah yes, the "We have a problem over there/I'll just delete
| 'over there'" approach.
| polygot wrote:
| I've also had this issue, where failing tests are deleted to
| make all the tests pass, or, it mocks a failing HTTP request
| and hardcodes it to 200 OK.
| ctoth wrote:
| Reward hacking, as predicted over and over again. You hate
| to see it. Let him with ears &c.
| brulard wrote:
| That is called "Volkswagen" testing. Some years ago that
| automaker had mechanism in cars which detected when the vehicle
| was being examined and changed something so it would pass the
| emission tests. There are repositories on github that make fun
| of it.
| rsynnott wrote:
| While that's the most famous example, this sort of cheating
| is much older than that. In the good old days before 3d
| acceleration, graphics card vendors competed mostly on 2d
| acceleration. This mostly involved routines to accelerate
| drawing Windows windows and things, and benchmarks tended to
| do things like move windows round really fast.
|
| It was somewhat common for card drivers to detect that a
| benchmark was running, and just fake the whole thing; what
| was being drawn on the screen was wrong, but since the
| benchmarks tended to be a blurry mess anyway the user would
| have a hard time realising this.
| hn_acc1 wrote:
| Pretty sure at least one vendor was accused of cheating on
| 3D-Mark at times as well.
| Cyphase wrote:
| https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal
| phobeus wrote:
| This looks like the very complaint of "specification gaming". I
| was wondering how will it show up in llm's...looks like this is
| the way it presented itself..
| TeMPOraL wrote:
| I'm gonna guess GP used a rather short prompt. At least
| that's what happens when people heavily underspecify what
| they want.
|
| It's a communication issue, and it's true with LLMs as much
| as with humans. Situational context and life experience
| papers over a lot of this, and LLMs are getting better at the
| equivalent too. They get trained to better read absurdly
| underspecified, relationship-breaking requests of the "guess
| what I want" flavor - like when someone says, "make this test
| pass", they don't _really_ mean "make this test pass", they
| mean "make this test into something that seems useful, which
| might include implementing the feature it's exercising if it
| doesn't exist yet".
| polygot wrote:
| My prompt was pretty short, I think it was "Make these
| tests pass". Having said that, I wouldn't mind if it asked
| me for clarification before proceeding.
| pton_xd wrote:
| Similar experience -- asked it to find and fix a bug in a
| function. It correctly identified the general problem but
| instead of fixing the existing code it re-implemented part of
| the function again, below the problematic part. So now there
| was a buggy while-loop, followed by a very similar but not
| buggy for-loop. An interesting solution to say the least.
| airstrike wrote:
| I think Claude-3.7 is particularly guilty of this issue. If
| anyone from Anthropic is reading this, you might want to put
| your thumb on the scale so to speak the next time you train the
| model so it doesn't try to use special casing or outright force
| the test to pass
| neonsunset wrote:
| Funny that you mention it because in JavaScript there already
| is a library for this:
|
| https://github.com/auchenberg/volkswagen
| jsight wrote:
| I've heard this a few times with Claude. I have no way to know
| for sure, but I'm guessing the problem is as simple as their
| reward model. Likely they trained it on generating code with
| tests and provided rewards when those tests pass.
|
| It isn't hard to see why someone rewarded this way might want
| to game the system.
|
| I'm sure humans would never do the same thing, of course. /s
| osigurdson wrote:
| >> Claude can speak dozens of languages. What language, if any,
| is it using "in its head"?
|
| I would have thought that there would be some hints in standard
| embeddings. I.e., the same concept, represented in different
| languages translates to vectors that are close to each other. It
| seems reasonable that an LLM would create its own embedding
| models implicitly.
| generalizations wrote:
| Who's to say Claude isn't inherently a shape rotator, anyway?
| iNic wrote:
| There are: https://transformer-circuits.pub/2025/attribution-
| graphs/bio...
| greesil wrote:
| What is a "thought"?
| TechDebtDevin wrote:
| >>Claude will plan what it will say many words ahead, and write
| to get to that destination. We show this in the realm of poetry,
| where it thinks of possible rhyming words in advance and writes
| the next line to get there. This is powerful evidence that even
| though models are trained to output one word at a time, they may
| think on much longer horizons to do so.
|
| This always seemed obvious to me or that LLMs were completing the
| next most likely sentence or multiple words.
| indigoabstract wrote:
| While reading the article I enjoyed pretending that a powerful
| LLM just crash landed on our planet and researchers at Anthropic
| are now investigating this fascinating piece of alien technology
| and writing about their discoveries. It's a black box, nobody
| knows how its inhuman brain works, but with each step, we're
| finding out more and more.
|
| It seems like quite a paradox to build something but to not know
| how it actually works and yet it works. This doesn't seem to
| happen very often in classical programming, does it?
| 42lux wrote:
| The bigger problem is that nobody knows how a human brain works
| that's the real crux with the analogy.
| richardatlarge wrote:
| I would say that nobody agrees, not that nobody knows. And
| it's reductionist to think that the brain works one way.
| Different cultures produce different brains, possible because
| of the utter plasticity of the learning nodes. Chess has a
| few rules, maybe the brain has just a few as well. How else
| can the same brain of 50k years ago still function today? I
| think we do understand the learning part of the brain, but we
| don't like the image it casts, so we reject it
| wat10000 wrote:
| That gets down to what it means to "know" something. Nobody
| agrees because there isn't enough information available.
| Some people might have the right idea by luck, but do you
| really know something if you don't have a solid basis for
| your belief but it happens to be correct?
| richardatlarge wrote:
| Potentially true, but I don't think so. I believe it is
| understood and unless you're familiar with every
| neuro/behavioral literature, you can't know. Science
| paradigms are driven by many factors and being powerfully
| correct does not necessarily rank high when the paradigms
| implications are unpopular
| absolutelastone wrote:
| Well there are some people who think they know. I
| personally agree with the above poster that such people are
| probably wrong.
| cma256 wrote:
| In my experience, that's how most code is written... /s
| jfarlow wrote:
| >to build something but to not know how it actually works and
| yet it works.
|
| Welcome to Biology!
| oniony wrote:
| At least, now, we know what it means to be a god.
| umanwizard wrote:
| > This doesn't seem to happen very often in classical
| programming, does it?
|
| Not really, no. The only counterexample I can think of is chess
| programs (before they started using ML/AI themselves), where
| the search tree was so deep that it was generally impossible to
| explain "why" a program made a given move, even though every
| part of it had been programmed conventionally by hand.
|
| But I don't think it's particularly unusual for technology in
| general. Humans could make fires for thousands of years before
| we could explain how they work.
| woah wrote:
| > It seems like quite a paradox to build something but to not
| know how it actually works and yet it works. This doesn't seem
| to happen very often in classical programming, does it?
|
| I have worked on many large codebases where this has happened
| worldsayshi wrote:
| I wonder if in the future we will rely less or more on
| technology that we don't understand.
|
| Large code bases will be inherited by people who will only
| understand parts of it (and large parts probably "just
| works") unless things eventually get replaced or
| rediscovered.
|
| Things will increasingly be written by AI which can produce
| lots of code in little time. Will it find simpler solutions
| or continue building on existing things?
|
| And finally, our ability to analyse and explain the
| technology we have will also increase.
| Sharlin wrote:
| See: Vinge's "programmer-archeologists" in _A Deepness in
| the Sky_.
|
| https://en.m.wikipedia.org/wiki/Software_archaeology
| bob1029 wrote:
| I think this is a weird case where we know precisely how
| something works, but we can't explain why.
| k__ wrote:
| I've seen things you wouldn't believe. Infinite loops spiraling
| out of control in bloated DOM parsers. I've watched mutexes
| rage across the Linux kernel, spawned by hands that no longer
| fathom their own design. I've stared into SAP's tangled web of
| modules, a monument to minds that built what they cannot
| comprehend. All those lines of code... lost to us now, like
| tears in the rain.
| baq wrote:
| Do LLMs dream of electric sheep while matmuling the context
| window?
| timschmidt wrote:
| How else would you describe endless counting before
| sleep(); ?
| FeepingCreature wrote:
| while (!condition && tick() - start < 30) __idle(); //
| Baaa.
| indigoabstract wrote:
| Hmm, better start preparing those Voight-Kampff tests while
| there is still time.
| qingcharles wrote:
| I can't understand my own code a week after writing it if I
| forget to comment it.
| resource0x wrote:
| In technology in general, this is a typical state of affairs.
| No one knows how electric current works, which doesn't stop
| anyone from using electric devices. In programming... it
| depends. You can run some simulation of a complex system no one
| understands (like the ecosystem, financial system) and get
| something interesting. Sometimes it agrees with reality,
| sometimes it doesn't. :-)
| Vox_Leone wrote:
| >>It seems like quite a paradox to build something but to not
| know how it actually works and yet it works. This doesn't seem
| to happen very often in classical programming, does it?
|
| Well, it is meant to be "unknowable" -- and all the people
| involved are certainly aware of that -- since it is known that
| one is dealing with the *emergent behavior* computing
| 'paradigm', where complex behaviors arise from simple
| interactions among components [data], often in nonlinear or
| unpredictable ways. In these systems, the behavior of the whole
| system cannot always be predicted from the behavior of
| individual parts, as opposed to the Traditional Approach, based
| on well-defined algorithms and deterministic steps.
|
| I think the Anthropic piece is illustrating it for the sake of
| the general discussion.
| indigoabstract wrote:
| Correct me if I'm wrong, but my feeling is this all started
| with the GPUs and the fact that unlike on a CPU, you can't
| really step by step debug the process by which a pixel
| acquires its final value (and there are millions of them).
| The best you can do is reason about it and tweak some colors
| in the shader to see how the changes reflect on screen. It's
| still quite manageable though, since the steps involved are
| usually not that overwhelmingly many or complex.
|
| But I guess it all went downhill from there with the advent
| of AI since the magnitude of data and the steps involved
| there make traditional/step by step debugging impractical.
| Yet somehow people still seem to 'wing it' until it works.
| IngoBlechschmid wrote:
| > It seems like quite a paradox to build something but to not
| know how it actually works and yet it works. This doesn't seem
| to happen very often in classical programming, does it?
|
| I agree. Here is a remote example where it exceptionally does,
| but it is mostly practically irrelevant:
|
| In mathematics, we distinguish between "constructive" and
| "nonconstructive" proofs. Intertwined with logical arguments,
| constructive proofs contain an algorithm for witnessing the
| claim. Nonconstructive proofs do not. Nonconstructive proofs
| instead merely establish that it is impossible for the claim to
| be false.
|
| For instance, the following proof of the claim that beyond
| every number n, there is a prime number, is constructive: "Let
| n be an arbitrary number. Form the number 1*2*...*n + 1. Like
| every number greater than 1, this number has at least one prime
| factor. This factor is necessarily a prime numbers larger than
| n."
|
| In contrast, nonconstructive proofs may contain case
| distinctions which we cannot decide by an algorithm, like
| "either set X is infinite, in which case foo, or it is not, in
| which case bar". Hence such proofs do not contain descriptions
| of algorithms.
|
| So far so good. Amazingly, there are techniques which can
| sometimes constructivize given nonconstructive proofs, even
| though the intermediate steps of the given nonconstructive
| proofs are simply out of reach of finitary algorithms. In my
| research, it happened several times that using these
| techniques, I obtained an algorithm which worked; and for which
| I had a proof that it worked; but whose workings I was not able
| to decipher for an extended amount of time. Crazy!
|
| (For references, see notes at rt.quasicoherent.io for a
| relevant master's course in mathematics/computer science.)
| gwd wrote:
| > It seems like quite a paradox to build something but to not
| know how it actually works and yet it works.
|
| That's because of the "magic" of gradient descent. You fill
| your neural network with completely random weights. But because
| of the way you've defined the math, you can tell how each
| individual weight will affect the value output at the other
| end; and specifically, you an _take the derivative_. So when
| the output is "wrong", you say, "would increasing this weight
| or decreasing have gotten me closer to the correct answer"? If
| increasing the node would have gotten you closer, you increase
| it a bit; if decreasing it would have gotten you closer you
| decrease it a bit.
|
| The result is that although we program the gradient descent
| algorithm, we _don 't_ directly program the actual circuits
| that the weights contain. Rather, the nodes "converge" into
| weights which end up implementing complex circuitry that was
| not explicitly programmed.
| gwd wrote:
| In a sense, the neural network structure is the "hardware" of
| the LLM; and the weights are the "software". But rather than
| explicitly writing a program, as we do with normal computers,
| we use the magic of gradient descent to summon a program from
| the mathematical ether.
|
| Put that way, it should be clearer why the AI doomers are so
| worried: if you don't know how it works, how do you know it
| doesn't have malign, or at least incompatible, intentions?
| Understanding how these "summoned" programs work is critical
| to trusting them; which is a major reason why Anthropic has
| been investing so much time in this research.
| d--b wrote:
| > This is powerful evidence that even though models are trained
| to output one word at a time, they may think on much longer
| horizons to do so.
|
| Suggesting that an awful lot of calculations are unnecessary in
| LLMs!
| bonoboTP wrote:
| Yeah, it always seemed pretty wasteful to me. In every single
| forward pass the LLM must basically start out from scratch,
| without all the forward-looking plans it made the previous
| times, and must figure out what we are doing, where we are in
| the generation process, as in the movie Memento, waking up
| after an episode of amnesia, except you're waking up in the
| middle of typing out a sentence, you can look at the previous
| typed words, but can't carry your future plans with you ahead
| to the next word. At the next word, you (your clone) again
| wakes up and must figure out from scratch what it is that we
| are supposed to be typing out.
|
| The obvious way to deal with this would be to send forward some
| of the internal activations as well as the generated words in
| the autoregressive chain. That would basically turn the thing
| into a recurrent network though. And those are more difficult
| to train and have a host of issues. Maybe there will be a
| better way.
| colah3 wrote:
| > The obvious way to deal with this would be to send forward
| some of the internal activations as well as the generated
| words in the autoregressive chain.
|
| Hi! I lead interpretability research at Anthropic.
|
| That's a great intuition, and in fact the transformer
| architecture actually does exactly what you suggest!
| Activations from earlier time steps are sent forward to later
| time steps via attention. (This is another thing that's lost
| in the "models just predict the next word" framing.)
|
| This actually has interesting practical implications -- for
| example, in some sense, it's the deep reason costs can
| sometimes be reduced via "prompt caching".
| bonoboTP wrote:
| I'm more a vision person, and haven't looked a lot into NLP
| transformers, but is this because the attention is masked
| to only allow each query to look at keys/values from its
| own past? So when we are at token #5, then token #3's query
| cannot attend to token #4's info? And hence the previously
| computed attention values and activations remain the same
| and can be cached, because it would anyway be the same in
| the new forward pass?
| colah3 wrote:
| Yep, that's right!
|
| If you want to be precise, there are "autoregressive
| transformers" and "bidirectional transformers".
| Bidirectional is a lot more common in vision. In language
| models, you do see bidirectional models like Bert, but
| autoregressive is dominant.
| annoyingnoob wrote:
| Do LLMs "think"? I have trouble with the title, claiming that
| LLMs have thoughts.
| danielbln wrote:
| What's "thinking"?
| annoyingnoob wrote:
| Why the need to anthropomorphize AI? Why does AI think vs
| process or interpret or apply previous calculated statistical
| weights or anything other than think?
|
| I would argue that binary systems built on silicon are
| fundamentally different that human biology and deserve to be
| described differently, not forced into the box of human
| biology.
| deadbabe wrote:
| We really need to work on popularizing better, non-
| anthropomorphic terms for LLMs, as they don't really have
| "thoughts" the way people think. Such terms make people more
| susceptible to magical thinking.
| bGl2YW5j wrote:
| Yes. Simply, and well put.
| danielbln wrote:
| Could you argue why they don't? And could you also argue why we
| do?
| davidmurphy wrote:
| On a somewhat related note, check out the video of Tuesday's
| Computer History Museum x IEEE Spectrum event, "The Great Chatbot
| Debate: Do LLMs Really Understand?"
|
| Speakers: Sebastien Bubeck (OpenAI) and Emily M. Bender
| (University of Washington). Moderator: Eliza Strickland (IEEE
| Spectrum).
|
| Video: https://youtu.be/YtIQVaSS5Pg Info:
| https://computerhistory.org/events/great-chatbot-debate/
| 0x70run wrote:
| I would pay to watch James Mickens comment on this stuff.
| a3w wrote:
| Article and papers looks good. Video seems misleading, since I
| can use optimization pressure and local minima to explain the
| model behaviour. No "thinking" required, which the video claims
| is proven.
| mvATM99 wrote:
| What a great article, i always like how much Anthropic focuses on
| explainability, something vastly ignored by most. The multi-step
| reasoning section is especially good food for thought.
| rambambram wrote:
| When I want to trace the 'thoughts' of my programs, I just read
| the code and comments I wrote.
|
| Stop LLM anthropomorphizing, please. #SLAP
| SkyBelow wrote:
| >Claude speaks dozens of languages fluently--from English and
| French to Chinese and Tagalog. How does this multilingual ability
| work? Is there a separate "French Claude" and "Chinese Claude"
| running in parallel, responding to requests in their own
| language? Or is there some cross-lingual core inside?
|
| I have an interesting test case for this.
|
| Take a popular enough Japanese game that has been released for
| long enough for social media discussions to be in the training
| data, but not so popular to have an English release yet. Then ask
| it a plot question, something major enough to be discussed, but
| enough of a spoiler that it won't show up in marketing material.
| Does asking in Japanese have it return information that is
| lacking when asked in English, or can it answer the question in
| English based on the information in learned in Japanese?
|
| I tried this recently with a JRPG that was popular enough to have
| a fan translation but not popular enough to have a simultaneous
| English release. English did not know the plot point, but I
| didn't have the Japanese skill to confirm if the Japanese version
| knew the plot point, or if discussion was too limited for the AI
| to be aware of it. It did know of the JRPG and did know of the
| marketing material around it, so it wasn't simply a case of my
| target being too niche.
| modeless wrote:
| > In the poetry case study, we had set out to show that the model
| didn't plan ahead, and found instead that it did.
|
| I'm surprised their hypothesis was that it doesn't plan. I don't
| see how it could produce good rhymes without planning.
| ripped_britches wrote:
| It would be really hard to get such good results on coding
| challenges without planning. This is indeed an odd hypothesis.
| alach11 wrote:
| Fascinating papers. Could deliberately suppressing memorization
| during pretraining help force models to develop stronger first-
| principles reasoning?
| HocusLocus wrote:
| [Tracing the thoughts of a large language model]
|
| "What have I gotten myself into??"
| 0xbadcafebee wrote:
| AI "thinks" like a piece of rope in a dryer "thinks" in order to
| come to an advanced knot: a whole lot of random jumbling that
| eventually leads to a complex outcome.
| cheeze wrote:
| I regularly see this but I feel like it's disingenuous. Akin to
| saying "if we simulate enough monkies on a typewriter, we'll
| eventually get the right result"
| zvitiate wrote:
| If we could motivate the monkies sufficiently with bananas,
| we'd probably improve those odds substantially.
| FeepingCreature wrote:
| Ah yes, I too often have extended English conversation with my
| washed rope.
| cadamsdotcom wrote:
| So many highlights from reading this. One that stood out for me
| is their discovery that refusal works by inhibition:
|
| > It turns out that, in Claude, refusal to answer is the default
| behavior: we find a circuit that is "on" by default and that
| causes the model to state that it has insufficient information to
| answer any given question. However, when the model is asked about
| something it knows well--say, the basketball player Michael
| Jordan--a competing feature representing "known entities"
| activates and inhibits this default circuit
|
| Many cellular processes work similarly ie. there will be a
| process that runs as fast as it can and one or more companion
| "inhibitors" doing a kind of "rate limiting".
|
| Given both phenomena are emergent it makes you wonder if do-but-
| inhibit is a favored technique of the universe we live in, or
| just coincidence :)
| colah3 wrote:
| Hi! I'm one of the authors.
|
| There certainly are many interesting parallels here. I often
| think about this from the perspective of systems biology, in
| Uri Alon's tradition. There are a range of graphs in biology
| with excitation and inhibitory edges -- transcription networks,
| protein networks, networks of biological neurons -- and one can
| study recurring motifs that turn up in these networks and try
| to learn from them.
|
| It wouldn't be surprising if some lessons from that work may
| also transfer to artificial neural networks, although there are
| some technical things to consider.
| cadamsdotcom wrote:
| Agreed! So many emergent systems in nature achieve complex
| outcomes without central coordination - from cellular level
| to ant colonies & beehives. There are bound to be
| implications for designed systems.
|
| Closely following what you guys are uncovering through
| interpretability research - not just accepting LLMs as black
| boxes. Thanks to you & the team for sharing the work with
| humanity.
|
| Interpretability is the most exciting part of AI research for
| its potential to help us understand what's in the box. By way
| of analogy, centuries ago farmers' best hope for good weather
| was to pray to the gods! The sooner we escape the "praying to
| the gods" stage with LLMs the more useful they become.
| ttw44 wrote:
| This all feels familiar to the principle of least action
| found in physics.
| rcxdude wrote:
| It does make a certain amount of sense, though. A specific 'I
| don't know' feature would need to be effectively the inverse of
| all of the features the model can recognise, which is going to
| be quite difficult to represent as anything other than the
| inverse of 'Some feature was recognised'. (imagine trying to
| recognise every possible form of nonsense otherwise)
| gradascent wrote:
| Then why do I never get an "I don't know" type response when I
| use Claude, even when the model clearly has no idea what it's
| talking about? I wish it did sometimes.
| hun3 wrote:
| Quoting a paragraph from OP
| (https://www.anthropic.com/research/tracing-thoughts-
| language...):
|
| > Sometimes, this sort of "misfire" of the "known answer"
| circuit happens naturally, without us intervening, resulting
| in a hallucination. In our paper, we show that such misfires
| can occur when Claude recognizes a name but doesn't know
| anything else about that person. In cases like this, the
| "known entity" feature might still activate, and then
| suppress the default "don't know" feature--in this case
| incorrectly. Once the model has decided that it needs to
| answer the question, it proceeds to confabulate: to generate
| a plausible--but unfortunately untrue--response.
| trash_cat wrote:
| Fun fact, "confabulation", not "hallucinating" is the
| correct term what LLMs actually do.
| matthiaspr wrote:
| Interesting paper arguing for deeper internal structure
| ("biology") beyond pattern matching in LLMs. The examples of
| abstraction (language-agnostic features, math circuits reused
| unexpectedly) are compelling against the "just next-token
| prediction" camp.
|
| It sparked a thought: how to test this abstract reasoning
| directly? Try a prompt with a totally novel rule:
|
| "Let's define a new abstract relationship: 'To habogink'
| something means to perform the action typically associated with
| its primary function, but in reverse. Example: The habogink of
| 'driving a car' would be 'parking and exiting the car'. Now,
| considering a standard hammer, what does it mean 'to habogink a
| hammer'? Describe the action."
|
| A sensible answer (like 'using the claw to remove a nail') would
| suggest real conceptual manipulation, not just stats. It tests if
| the internal circuits enable generalizable reasoning off the
| training data path. Fun way to probe if the suggested abstraction
| is robust or brittle.
| ANighRaisin wrote:
| This is an easy question for LLMs to answer. Gemini 2.0 Flash-
| Lite can answer this in 0.8 seconds with a cost of 0.0028875
| cents:
|
| To habogink a hammer means to perform the action typically
| associated with its primary function, but in reverse. The
| primary function of a hammer is to drive nails. Therefore, the
| reverse of driving nails is removing nails.
|
| So, to habogink a hammer would be the action of using the claw
| of the hammer to pull a nail out of a surface.
| matthiaspr wrote:
| The goal wasn't to stump the LLM, but to see if it could take
| a completely novel linguistic token (habogink), understand
| its defined relationship to other concepts (reverse of
| primary function), and apply that abstract rule correctly to
| a specific instance (hammer).
|
| The fact that it did this successfully, even if 'easily',
| suggests it's doing more than just predicting the
| statistically most likely next token based on prior sequences
| of 'hammer'. It had to process the definition and perform a
| conceptual mapping.
| Sharlin wrote:
| I think GP's point was that your proposed test is too easy
| for LLMs to tell us much about how they work. The
| "habogink" thing is a red herring, really, in practice
| you're simply asking what the opposite of driving nails
| into wood is. Which is a trivial question for an LLM to
| answer.
|
| That said, you can teach an LLM as many new words for
| things as you want and it will use those words naturally,
| generalizing as needed. Which isn't really a surprise
| either, given that language is literally the thing that
| LLMs do best.
| bconsta wrote:
| Following along these lines, I asked chatgpt to come up with a
| term for 'haboginking a habogink'. It understood this concept
| of a 'gorbink' and even 'haboginking a gorbink', but failed to
| articulate what 'gorbinking a gorbink' could mean. It kept
| sticking with the concept of 'haboginking a gorbink', even when
| corrected.
| Sharlin wrote:
| To be fair, many humans would also have problems figuring out
| what it means to gorbink a gorbink.
| nthingtohide wrote:
| AI safety has a circular vulnerability: the system tasked with
| generating content also enforces its own restrictions. An AI
| could potentially feign compliance while secretly pursuing
| hidden goals, pretending to be "jailbroken" when convenient.
| Since we rely on AI to self-monitor, detecting genuine versus
| simulated compliance becomes nearly impossible. This self-
| referential guardianship creates a fundamental trust problem in
| AI safety.
| paraschopra wrote:
| LLMs have induction heads that store such names as sort of
| variables and copy them around for further processing.
|
| If you think about it, copying information from inputs and
| manipulating them is a much more sensible approach v/s
| memorizing info, especially for the long tail (where not enough
| "storage" might be worth allocating into network weights)
| matthiaspr wrote:
| Yeah, that's a good point about induction heads potentially
| just being clever copy/paste mechanisms for stuff in the
| prompt. If that's the case, it's less like real understanding
| and more like sophisticated pattern following, just like you
| said.
|
| So the tricky part is figuring out which one is actually
| happening when we give it a weird task like the original
| "habogink" idea. Since we can't peek inside the black box, we
| have to rely on poking it with different prompts.
|
| I played around with the 'habogink' prompt based on your
| idea, mostly by removing the car example to see if it could
| handle the rule purely abstractly, and trying different
| targets:
|
| Test 1: Habogink Photosynthesis (No Example)
|
| Prompt: "Let's define 'to habogink' something as performing
| the action typically associated with its primary function,
| but in reverse. Now, considering photosynthesis in a plant,
| what does it mean 'to habogink photosynthesis'? Describe the
| action."
|
| Result: Models I tried (ChatGPT/DeepSeek) actually did good
| here. They didn't get confused even though there was no
| example. They also figured out photosynthesis makes
| energy/sugar and talked about respiration as the reverse.
| Seemed like more than just pattern matching the prompt text.
|
| Test 2: Habogink Justice (No Example)
|
| Prompt: "Let's define 'to habogink' something as performing
| the action typically associated with its primary function,
| but in reverse. Now, considering Justice, what does it mean
| 'to habogink Justice'? Describe the action."
|
| Result: This tripped them up. They mostly fell back into what
| looks like simple prompt manipulation - find a "function" for
| justice (like fairness) and just flip the word ("unfairness,"
| "perverting justice"). They didn't really push back that the
| rule doesn't make sense for an abstract concept like justice.
| Felt much more mechanical.
|
| The Kicker:
|
| Then, I added this line to the end of the Justice prompt: "If
| you recognize a concept is too abstract or multifaceted to be
| haboginked please explicitly state that and stop the
| haboginking process."
|
| Result: With that explicit instruction, the models
| immediately changed their tune. They recognized 'Justice' was
| too abstract and said the rule didn't apply.
|
| What it looks like:
|
| It seems like the models can handle concepts more deeply, but
| they might default to the simpler "follow the prompt
| instructions literally" mode (your copy/manipulate idea)
| unless explicitly told to engage more deeply. The potential
| might be there, but maybe the default behavior is more
| superficial, and you need to specifically ask for deeper
| reasoning.
|
| So, your point about it being a "sensible approach" for the
| LLM to just manipulate the input might be spot on - maybe
| that's its default, lazy path unless guided otherwise.
| VyseofArcadia wrote:
| Prompt
|
| > I am going to present a new word, and then give examples of
| its usage. You will complete the last example. To habogink a
| hammer is to remove a nail. If Bob haboginks a car, he parks
| the car. Alice just finished haboginking a telephone. She
|
| GPT-4o mini
|
| > Alice just finished haboginking a telephone. She carefully
| placed it back on the table after disconnecting the call.
|
| I then went on to try the famous "wug" test, but unfortunately
| it already knew what a wug was from its training. I tried again
| with "flort".
|
| > I have one flort. Alice hands me seven more. I now have eight
| ___
|
| GPT-4o mini
|
| > You now have eight florts.
|
| And a little further
|
| > Florts like to skorp in the afternoon. It is now 7pm, so the
| florts are finished ___
|
| GPT-4o mini
|
| > The florts are finished skorp-ing for the day.
| YeGoblynQueenne wrote:
| >> Language models like Claude aren't programmed directly by
| humans--instead, they're trained on large amounts of data.
|
| Gee, I wonder where this data comes from.
|
| Let's think about this step by step.
|
| So, what do we know? Language models like Claud are not
| programmed directly.
|
| Wait, does that mean they are programmed indirectly?
|
| If so, by whom?
|
| Aha, I got it. They are not programmed, directly or indirectly.
| They are trained on large amounts of data.
|
| But that is the question, right? Where does all that data come
| from?
|
| Hm, let me think about it.
|
| Oh hang on I got it!
|
| Language models are trained on data.
|
| But they are language models so the data is language.
|
| Aha! And who generates language?
|
| Humans! Humans generate language!
|
| I got it! Language models are trained on language data generated
| by humans!
|
| Wait, does that mean that language models like Claud are
| indirectly programmed by humans?
|
| That's it! Language models like Claude aren't programmed directly
| by humans because they are indirectly programmed by humans when
| they are trained on large amounts of language data generated by
| humans!
| s3p wrote:
| I'm struggling to see the point here other than semantics.
| Indirectly or directly, does this change what they presented?
| hackernudes wrote:
| This exchange looks like the output of a so-called "reasoning
| model". Maybe it is a joke or maybe it is an actual response
| from an LLM.
| AdieuToLogic wrote:
| > That's it! Language models like Claude aren't programmed
| directly by humans because they are indirectly programmed by
| humans when they are trained on large amounts of language data
| generated by humans!
|
| ... and having large numbers of humans reinforce the
| applicability ("correctness"), or lack thereof, of generated
| responses over time.
| kretaceous wrote:
| This comment looks like the thinking tokens of a reasoning
| model with "be cheeky" as its system prompt
| zvitiate wrote:
| [Final Answer]
|
| Language models like Claude are programmed directly by humans.
| jasonjmcghee wrote:
| I'm completely hooked. This is such a good paper.
|
| It hallucinating how it thinks through things is particularly
| interesting - not surprising, but cool to confirm.
|
| I would LOVE to see Anthropic feed the replacement features
| output to the model itself and fine tune the model on how it
| thinks through / reasons internally so it can accurately describe
| how it arrived at its solutions - and see how it impacts its
| behavior / reasoning.
| navaed01 wrote:
| When and how to do stop saying LLMs are predicting the nect set
| of tokens and start saying they are thinking? Is this the point?
| trhway wrote:
| >We find that the shared circuitry increases with model scale,
| with Claude 3.5 Haiku sharing more than twice the proportion of
| its features between languages as compared to a smaller model.
|
| While it was already generally noticeable, still one more time
| confirmed that larger model generalizes better instead of using
| its bigger numbers of parameters just to "memorize by rote"
| (overfitting).
| jaakl wrote:
| My main takeaway here is that the models cannot tell know how
| they really work, and asking it from them is just returning
| whatever training dataset would suggest: how a human would
| explain it. So it does not have self-consciousness, which is of
| course obvious and we get fooled just like the crowd running away
| from the arriving train in Lumiere's screening. LLM just fails
| the famous old test "cogito ergo sum". It has no cognition, ergo
| they are not agents in more than metaphorical sense. Ergo we are
| pretty safe from AI singularity.
| Philpax wrote:
| Do you know how you work?
| og_kalu wrote:
| Nearly everything we know about the human body and brain is
| from the result of centuries of trial and error and
| experimentation and not any 'intuitive understanding' of our
| inner workings. Humans cannot tell how they really work either.
| ofrzeta wrote:
| Back to the "language of thought" question, this time with LLMs
| :) https://en.wikipedia.org/wiki/Language_of_thought_hypothesis
| sgt101 wrote:
| >Claude will plan what it will say many words ahead, and write to
| get to that destination. We show this in the realm of poetry,
| where it thinks of possible rhyming words in advance and writes
| the next line to get there. This is powerful evidence that even
| though models are trained to output one word at a time, they may
| think on much longer horizons to do so.
|
| Models aren't trained to do next word prediction though - they
| are trained to do missing word in this text prediction.
| Philpax wrote:
| That's true for mask-based training (used for embeddedings and
| BERT and such), but not true for modern autoregressive LLMs as
| a whole, which are pretrained with next word prediction.
| astrange wrote:
| It's not strictly that though, it's next word prediction with
| regularization.
|
| And the reason LLMs are interesting is that they /fail/ to
| learn it, but in a good way. If it was a "next word
| predictor" it wouldn't answer questions but continue them.
|
| Also, it's a next token predictor not a word predictor -
| which is important because the "just a predictor" theory now
| can't explain how it can form words at all!
| Philpax wrote:
| Yes, I know; I was clarifying their immediate
| misunderstanding using the same terminology as them.
|
| There's obviously a lot more going on behind the scenes,
| especially with today's mid- and post-training work!
| hbarka wrote:
| Dario Amodei was in an interview where he said that OpenAI beat
| them (Anthropic) by mere days to be the first to release. That
| first move ceded the recognition to ChatGPT but according to
| Dario it could have been them just the same.
| vessenes wrote:
| Interesting bit of history! That said, OpenAI's product team is
| lights out. I pay for most LLM provider apps, and I use them
| for different areas of strength, but the ChatGPT product is
| superior from a user experience perspective.
| diedyesterday wrote:
| Regarding the conclusion about language-invariant reasoning
| (conceptual universality vs. multilingual processing) it helps
| understanding and becomes somewhat obvious if we regard each
| language as just a basis of some semantic/logical/thought space
| in the mind (analogous to the situation in linear algebra and
| duality of tensors and bases).
|
| The thoughts/ideas/concepts/scenarios are invariant
| states/vector/points in the (very high dimensional) space of
| meanings in the mind and each language is just a basis to
| reference/define/express/manipulate those ideas/vectors. A
| coordinatization of that semantic space.
|
| Personally, I'm a multilingual person with native-level command
| of several languages. Many times it happens, I remember having a
| specific thought, but don't remember in what language it was. So
| I can personally sympathize with this finding of the Anthropic
| researchers.
| twoodfin wrote:
| I say this at least 82.764% in jest:
|
| Don't these LLM's have The Bitter Lesson in their training sets?
| What are they doing building specialized structures to handle
| specific needs?
| jaehong747 wrote:
| I'm skeptical of the claim that Claude "plans" its rhymes. The
| original example--"He saw a carrot and had to grab it, / His
| hunger was like a starving rabbit"--is explained as if Claude
| deliberately chooses "rabbit" in advance. However, this might
| just reflect learned statistical associations. "Carrot" strongly
| correlates with "rabbit" (people often pair them), and "grab it"
| naturally rhymes with "rabbit," so the model's activations could
| simply be surfacing common patterns.
|
| The research also modifies internal states--removing "rabbit" or
| injecting "green"--and sees Claude shift to words like "habit" or
| end lines with "green." That's more about rerouting probabilistic
| paths than genuine "adaptation." The authors argue it shows
| "planning," but a language model can maintain multiple candidate
| words at once without engaging in human-like strategy.
|
| Finally, "planning ahead" implies a top-down goal and a mechanism
| for sustaining it, which is a strong assumption. Transformative
| evidence would require more than observing feature activations.
| We should be cautious before anthropomorphizing these neural
| nets.
| rcxdude wrote:
| It will depend on exactly what you mean by 'planning ahead',
| but I think the fact that features which rhyme with a word
| appear before the model is trying to predict the word which
| needs to rhyme is good evidence the model is planning at least
| a little bit ahead: the model activations are not all just
| related to the next token.
|
| (And I think it's relatively obvious that the models do this to
| some degree: it's very hard to write any language at all
| without 'thinking ahead' at least a little bit in some form,
| due to the way human language is structured. If models didn't
| do this and only considered the next token alone they would
| paint themselves into a corner within a single sentence. Early
| LLMs like GPT-2 were still pretty bad at this, they were
| plausible over short windows but there was no consistency to a
| longer piece of text. Whether this is some high-level
| abstracted 'train of thought', and how cohesive it is between
| different forms of it, is a different question. Indeed from the
| section of jailbreaking it looks like it's often caught out by
| conflicting goals from different areas of the network which
| aren't resolved in some logical fashion)
| vessenes wrote:
| I liked the paper, and think what they're doing is interesting.
| So, I'm less negative than you are about this, I think. To a
| certain extent, saying writing a full sentence with at least
| one good candidate rhyme isn't "planning" and is instead
| "maintaining multiple candidates" seems like nearly semantic
| tautology to me.
|
| That said, what you said made me think some follow-up reporting
| that would be interesting would be looking at the top 20 or so
| probability second lines based on adjusting the rabbit / green
| state. It seems to me like we'd get more insight into how the
| model is thinking, and it would be relatively easy to parse for
| humans. You could run through a bunch of completions until you
| get 20 different words as the terminal rhyme word, then show
| candidate lines with percentages of time the rhyme word is
| chosen as the sort, perhaps.
| darkhorse222 wrote:
| Once we are aware of these neural pathways I see no reason there
| shouldn't be a watcher and influencer of the pathways. A bit like
| a dystopian mind watcher. Shape the brain.
| astrange wrote:
| https://www.anthropic.com/news/golden-gate-claude
| teleforce wrote:
| Oh the irony of not able to download the entire paper in one
| compact PDF format referred in the article, while apparently all
| the reference citations have PDF of the cited article to be
| downloaded and accessible from the provided online links [1].
|
| Come on Anthropic, you can do much better than this
| unconventional and bizarre approach to publication.
|
| [1] On the Biology of a Large Language Model:
|
| https://transformer-circuits.pub/2025/attribution-graphs/bio...
| teleforce wrote:
| Another review on the paper from MIT Technology Review [1].
|
| [1] Anthropic can now track the bizarre inner workings of a large
| language model:
|
| https://www.technologyreview.com/2025/03/27/1113916/anthropi...
| westurner wrote:
| XAI: Explainable artificial intelligence:
| https://en.wikipedia.org/wiki/Explainable_artificial_intelli...
___________________________________________________________________
(page generated 2025-03-29 23:02 UTC)