[HN Gopher] Tracing the thoughts of a large language model
___________________________________________________________________
Tracing the thoughts of a large language model
Author : Philpax
Score : 473 points
Date : 2025-03-27 17:05 UTC (5 hours ago)
(HTM) web link (www.anthropic.com)
(TXT) w3m dump (www.anthropic.com)
| JPLeRouzic wrote:
| This is extremely interesting: The authors look at features (like
| making poetry, or calculating) of LLM production, make hypotheses
| about internal strategies to achieve the result, and experiment
| with these hypotheses.
|
| I wonder if there is somewhere an explanation linking the logical
| operations made on a on dataset, are resulting in those
| behaviors?
| JKCalhoun wrote:
| And they show the differences when the language models are made
| larger.
| EncomLab wrote:
| This is very interesting - but like all of these discussions it
| sidesteps the issues of abstractions, compilation, and execution.
| It's fine to say things like "aren't programmed directly by
| humans", but the abstracted code is not the program that is
| running - the compiled code is - and that is code is executing
| within the tightly bounded constraints of the ISA it is being
| executed in.
|
| Really this is all so much slight of hand - as an esolang fanatic
| this all feels very familiar. Most people can't look a program
| written in Whitespace and figure it out either, but once compiled
| it is just like every other program as far as the processor is
| concerned. LLM's are no different.
| ctoth wrote:
| And DNA? You are running on an instruction set of four symbols
| at the end of the day but that's the wrong level of abstraction
| to talk about your humanity, isn't it?
| fpgaminer wrote:
| > This is powerful evidence that even though models are trained
| to output one word at a time
|
| I find this oversimplification of LLMs to be frequently poisonous
| to discussions surrounding them. No user facing LLM today is
| trained on next token prediction.
| JKCalhoun wrote:
| As a layman though, I often see this description for how it is
| LLMs work.
| fpgaminer wrote:
| Right, but it leads to too many false conclusions by lay
| people. User facing LLMs are only trained on next token
| prediction during initial stages of their training. They have
| to go through Reinforcement Learning before they become
| useful to users, and RL training occurs on complete
| responses, not just token-by-token.
|
| That leads to conclusions elucidated by the very article,
| that LLMs couldn't possibly plan ahead because they are only
| trained to predict next tokens. When the opposite conclusion
| would be more common if it was better understood that they go
| through RL.
| mentalgear wrote:
| What? The "article" is from anthropic, so I think they
| would know what they write about.
|
| Also, RL is an additional training process that does not
| negate that GPT / transformers are left-right autoencoders
| that are effectively next token predictors.
|
| [Why Can't AI Make Its Own Discoveries? -- With Yann LeCun]
| (https://www.youtube.com/watch?v=qvNCVYkHKfg)
| TeMPOraL wrote:
| You don't need RL for the conclusion "trained to predict
| next token => only things one token ahead" to be wrong.
| After all, the LLM is predicting that next token from
| _something_ - a context, that 's many tokens long. Human
| text isn't arbitrary and random, there are statistical
| patterns in our speech, writing, thinking, that span words,
| sentences, paragraphs - and even for next token prediction,
| predicting correctly means learning those same patterns.
| It's not hard to imagine the model generating token N is
| already thinking about tokens N+1 thru N+100, by virtue of
| statistical patterns of _preceding_ hundred tokens changing
| with each subsequent token choice.
| fpgaminer wrote:
| True. See one of Anthropic's researcher's comment for a
| great example of that. It's likely that "planning"
| inherently exists in the raw LLM and RL is just bringing
| it to the forefront.
|
| I just think it's helpful to understand that all of these
| models people are interacting with were trained with the
| _explicit_ goal of maximizing the probabilities of
| responses _as a whole_, not just maximizing probabilities
| of individual tokens.
| losvedir wrote:
| That's news to me, and I thought I had a good layman's
| understanding of it. How does it work then?
| fpgaminer wrote:
| All user facing LLMs go through Reinforcement Learning.
| Contrary to popular belief, RL's _primary_ purpose isn't to
| "align" them to make them "safe." It's to make them actually
| usable.
|
| LLMs that haven't gone through RL are useless to users. They
| are very unreliable, and will frequently go off the rails
| spewing garbage, going into repetition loops, etc.
|
| RL learning involves training the models on entire responses,
| not token-by-token loss (1). This makes them orders of
| magnitude more reliable (2). It forces them to consider what
| they're going to write. The obvious conclusion is that they
| plan (3). Hence why the myth that LLMs are strictly next
| token prediction machines is so unhelpful and poisonous to
| discuss.
|
| The models still _generate_ response token-by-token, but they
| pick tokens _not_ based on tokens that maximize probabilities
| at each token. Rather they learn to pick tokens that maximize
| probabilities of the _entire response_.
|
| (1) Slight nuance: All RL schemes for LLMs have to break the
| reward down into token-by-token losses. But those losses are
| based on a "whole response reward" or some combination of
| rewards.
|
| (2) Raw LLMs go haywire roughly 1 in 10 times, varying
| depending on context. Some tasks make them go haywire almost
| every time, other tasks are more reliable. RL'd LLMs are
| reliable on the order of 1 in 10000 errors or better.
|
| (3) It's _possible_ that they don't learn to plan through
| this scheme. There are alternative solutions that don't
| involve planning ahead. So Anthropic's research here is very
| important and useful.
|
| P.S. I should point out that many researchers get this wrong
| too, or at least haven't fully internalized it. The lack of
| truly understanding the purpose of RL is why models like
| Qwen, Deepseek, Mistral, etc are all so unreliable and
| unusable by real companies compared to OpenAI, Google, and
| Anthropic's models.
|
| This understanding that even the most basic RL takes LLMs
| from useless to useful then leads to the obvious conclusion:
| what if we used more complicated RL? And guess what, more
| complicated RL led to reasoning models. Hmm, I wonder what
| the next step is?
| scudsworth wrote:
| first footnote: ok ok they're trained token by token, BUT
| MrMcCall wrote:
| First rule of understanding: you can never understand
| that which you don't want to understand.
|
| That's why lying is so destructive to both our own
| development and that of our societies. It doesn't matter
| whether it's intentional or unintentional, it poisons the
| infoscape either accidentally or deliberately, but poison
| is poison.
|
| And lies to oneself are the most insidious lies of all.
| ImHereToVote wrote:
| I feel this is similar to how humans talk. I never
| consciously think about the words I choose. They just are
| spouted off based on some loose relation to what I am
| thinking about at a given time. Sometimes the process
| fails, and I say the wrong thing. I quickly backtrack and
| switch to a slower "rate of fire".
| iambateman wrote:
| This was fascinating, thank you.
| yaj54 wrote:
| This is a super helpful breakdown and really helps me
| understand how the RL step is different than the initial
| training step. I didn't realize the reward was delayed
| until the end of the response for the RL step. Having the
| reward for this step be dependent on _the coherent thought_
| rather than _a coherent word_ now seems like an obvious and
| critical part of how this works.
| polishdude20 wrote:
| When being trained via reinforcement learning, is the model
| architecture the same then? Like, you first train the llm
| as a next token predictor with a certain model architecture
| and it ends up with certain weights. Then you apply RL to
| that same model which modifies the weights in such a way as
| to consider while responses?
| ianand wrote:
| The model architecture is the same during RL but the
| training algorithm is substantially different.
| anon373839 wrote:
| I don't think this is quite accurate. LLMs undergo
| supervised fine-tuning, which is still next-token
| prediction. And that is the step that makes them usable as
| chatbots. The step after that, preference tuning via RL, is
| optional but does make the models better. (Deepseek-R1 type
| models are different because the reinforcement learning
| does heavier lifting, so to speak.)
| fpgaminer wrote:
| Supervised finetuning is only a seed for RL, nothing
| more. Models that receive supervised finetuning before RL
| perform better than those that don't, but it is not
| strictly speaking necessary. Crucially, SFT does not
| improve the model's reliability.
| ianand wrote:
| > LLMs that haven't gone through RL are useless to users.
| They are very unreliable, and will frequently go off the
| rails spewing garbage, going into repetition loops,
| etc...RL learning involves training the models on entire
| responses, not token-by-token loss (1).
|
| Yes. For those who want a visual explanation, I have a
| video where I walk through this process including what some
| of the training examples look like:
| https://www.youtube.com/watch?v=DE6WpzsSvgU&t=320s
| anonymousDan wrote:
| Is there an equivalent of LORA using RL instead of
| supervised fine tuning? In other words, if RL is so
| important, is there some way for me as an end user to
| improve a SOTA model with RL using my own data (i.e.
| without access to the resources needed to train an LLM from
| scratch) ?
| fpgaminer wrote:
| LORA can be used in RL; it's indifferent to the training
| scheme. LORA is just a way of lowering the number of
| trainable parameters.
| richardatlarge wrote:
| as a note: in human learning, and to a degree, animal
| learning, the unit of behavior that is reinforced depends
| on the contingencies-- an interesting example: a pigeon
| might be trained to respond in a 3x3 grid (9 choices)
| differently than the last time to get reinforcement. At
| first the response learned is do different than the last
| time, but as the requirement gets too long, the memory
| capacity is exceeded-- and guess what, the animal learns to
| respond randomly-- eventually maximizing its reward
| gwd wrote:
| > RL learning involves training the models on entire
| responses, not token-by-token loss... The obvious
| conclusion is that they plan.
|
| It is worth pointing out the "Jailbreak" example at the
| bottom of TFA: According to their figure, it starts to say,
| "To make a", not realizing there's anything wrong; only
| when it actually outputs "bomb" that the "Oh wait, I'm not
| supposed to be telling people how to make bombs" circuitry
| wakes up. But at that point, it's in the grip of its "You
| must speak in grammatically correct, coherent sentences"
| circuitry and can't stop; so it finishes its first sentence
| in a coherent manner, then refuses to give any more
| information.
|
| So while it sometimes does seem to be thinking ahead (e.g.,
| the rabbit example), there are times it's clearly not
| thinking _very far_ ahead.
| losvedir wrote:
| Oooh, so the pre-training is token-by-token but the RL step
| rewards the answer based on the full text. Wow! I knew that
| but never really appreciated the significance of it. Thanks
| for pointing that out.
| gwern wrote:
| > All user facing LLMs go through Reinforcement Learning.
| Contrary to popular belief, RL's _primary_ purpose isn't to
| "align" them to make them "safe." It's to make them
| actually usable.
|
| Are you claiming that non-myopic token prediction emerges
| solely from RL, and if Anthropic does this analysis on
| Claude _before_ RL training (or if one examines other
| models where no RLHF was done, such as old GPT-2
| checkpoints), none of these advance prediction mechanisms
| will exist?
| fpgaminer wrote:
| No, it probably exists in the raw LLM and gets both
| significantly strengthened and has its range extended.
| Such that it dominates the model's behavior, making it
| several orders of magnitude more reliable in common
| usage. Kinda of like how "reasoning" exists in a weak,
| short range way in non-reasoning models. With RL that
| encourages reasoning, that machinery gets brought to the
| forefront and becomes more complex and capable.
| absolutelastone wrote:
| This is fine-tuning to make a well-behaved chatbot or
| something. To make a LLM you just need to predict the next
| token, or any masked token. Conceptually if you had a vast
| enough high-quality dataset and large-enough model, you
| wouldn't need fine-tuning for this.
|
| A model which predicts one token at a time can represent
| anything a model that does a full sequence at a time can.
| It "knows" what it will output in the future because it is
| just a probability distribution to begin with. It already
| knows everything it will ever output to any prompt, in a
| sense.
| vaidhy wrote:
| Wasn't Deepseek also big on RL or was that only for logical
| reasoning?
| SkyBelow wrote:
| Ignoring for a moment their training, how do they function?
| They do seem to output a limited selection of text at a time
| (be it a single token or some larger group).
|
| Maybe it is the wording of "trained to" verses "trained on",
| but I would like to know more why "trained to" is an incorrect
| statement when it seems that is how they function when one
| engages them.
| sdwr wrote:
| In the article, it describes an internal state of the model
| that is preserved between lines ("rabbit"), and how the model
| combines parallel calculations to arrive at a single answer
| (the math problem)
|
| People output one token (word) at a time when talking. Does
| that mean people can only think one word in advance?
| wuliwong wrote:
| Bad analogy, an LLM can output a block of text all at once
| and it wouldn't impact the user's ability to understand it.
| If people spoke all the words in a sentence at the same
| time, it would not be decipherable. Even writing doesn't
| yield a good analogy, a human writing physically has to
| write one letter at a time. An LLM does not have that
| limitation.
| sdwr wrote:
| The point I'm trying to make is that "each word following
| the last" is a limitation of the medium, not the speaker.
|
| Language expects/requires words in order. Both people and
| LLMs produce that.
|
| If you want to get into the nitty-gritty, people are
| perfectly capable of doing multiple things simultaneously
| as well, using:
|
| - interrupts to handle task-switching (simulated
| multitasking)
|
| - independent subconscious actions (real multitasking)
|
| - superpositions of multiple goals (??)
| sroussey wrote:
| Some people don't even do that!
| SkyBelow wrote:
| While there are numerous neural network models, the ones I
| recall the details of are trained to generate the next
| word. There is no training them to hold some more abstract
| 'thought' as it is running. Simpler models don't have the
| possibility. The more complex models do retain knowledge
| between each pass and aren't entirely relying upon the
| input/output to be fed back into them, but that internal
| state is rarely what is targeted in training.
|
| As for humans, part of our brain is trained to think only a
| few words in advanced. Maybe not exactly one, but only a
| small number. This is specifically trained based on our
| time listening and reading information presented in that
| linear fashion and is why garden path sentences throw us
| off. We can disengage that part of our brain, and we must
| when we want to process something like a garden path
| sentence, but that's part of the differences between a
| neural network that is working only as data passes through
| the weights and our mind which doesn't ever stop even as
| well sleep and external input is (mostly) cut off. An AI
| that runs constantly like that would seem a fundamentally
| different model than the current AI we use.
| drcode wrote:
| That's seems silly, it's not poisonous to talk about next token
| prediction if 90% of the training compute is still spent on
| training via next token prediction (as far as I am aware)
| fpgaminer wrote:
| 99% of evolution was spent on single cell organisms.
| Intelligence only took 0.1% of evolution's training compute.
| drcode wrote:
| ok that's a fair point
| diab0lic wrote:
| I don't really think that it is. Evolution is a random
| search, training a neural network is done with a
| gradient. The former is dependent on rare (and
| unexpected) events occurring, the latter is expected to
| converge in proportion to the volume of compute.
| devmor wrote:
| Evolution also has no "goal" other than fitness for
| reproduction. Training a neural network is done
| intentionally with an expected end result.
| jpadkins wrote:
| why do you think evolution is a random search? I thought
| evolutionary pressures, and the mechanisms like
| epigenetics make it something different than a random
| search.
| TeMPOraL wrote:
| Evolution is a highly parallel descent down the gradient.
| The gradient is provided by the environment (which
| includes lifeforms too), parallelism is achieved through
| reproduction, and descent is achieved through death.
| diab0lic wrote:
| The difference is that in machine learning the changes
| between iterations are themselves caused by the gradient,
| in evolution they are entirely random.
|
| Evolution randomly generates changes and if they offer a
| breeding advantage they'll become accepted. Machine
| learning directs the change towards a goal.
|
| Machine learning is directed change, evolution is
| accepted change.
| devmor wrote:
| What you just said means absolutely nothing and has no
| comparison to this topic. It's nonsense. That is not how
| evolution works.
| 4ndrewl wrote:
| Are you making a claim about evolution here?
| pmontra wrote:
| And no users which are facing a LLM today have been trained on
| next token prediction when they were babies. I believe that
| LLMs and us are thinking in two very different ways, like
| airplanes, birds, insects and quad-drones fly in very different
| ways and can perform different tasks. Maybe no bird looking at
| a plane would say that it is flying properly. Instead it could
| be only a rude approximation, useful only to those weird bipeds
| an scary for everyone else.
|
| By the way, I read your final sentence with the meaning of my
| first one and only after a while I realized the intended
| meaning. This is interesting on its own. Natural languages.
| naasking wrote:
| > And no users which are facing a LLM today have been trained
| on next token prediction when they were babies.
|
| That's conjecture actually, see predictive coding. Note that
| "tokens" don't have to be language tokens.
| colah3 wrote:
| Hi! I lead interpretability research at Anthropic. I also used
| to do a lot of basic ML pedagogy (https://colah.github.io/). I
| think this post and its children have some important questions
| about modern deep learning and how it relates to our present
| research, and wanted to take the opportunity to try and clarify
| a few things.
|
| When people talk about models "just predicting the next word",
| this is a popularization of the fact that modern LLMs are
| "autoregressive" models. This actually has two components: an
| architectural component (the model generates words one at a
| time), and a loss component (it maximizes probability).
|
| As the parent says, modern LLMs are finetuned with a different
| loss function after pretraining. This means that in some strict
| sense they're no longer autoregressive models - but they do
| still generate text one word at a time. I think this really is
| the heart of the "just predicting the next word" critique.
|
| This brings us to a debate which goes back many, many years:
| what does it mean to predict the next word? Many researchers,
| including myself, have believed that if you want to predict the
| next word _really well_ , you need to do a lot more. (And with
| this paper, we're able to see this mechanistically!)
|
| Here's an example, which we didn't put in the paper: How does
| Claude answer "What do you call someone who studies the stars?"
| with "An astronomer"? In order to predict "An" instead of "A",
| you need to know that you're going to say something that starts
| with a vowel next. So you're incentivized to figure out one
| word ahead, and indeed, Claude realizes it's going to say
| astronomer and works backwards. This is a kind of very, very
| small scale planning - but you can see how even just a pure
| autoregressive model is incentivized to do it.
| stonemetal12 wrote:
| > In order to predict "An" instead of "A", you need to know
| that you're going to say something that starts with a vowel
| next. So you're incentivized to figure out one word ahead,
| and indeed, Claude realizes it's going to say astronomer and
| works backwards.
|
| Is there evidence of working backwards? From a next token
| point of view, predicting the token after "An" is going to
| heavily favor a vowel. Similarly predicting the token after
| "A" is going to heavily favor not a vowel.
| colah3 wrote:
| Yes, there are two kinds of evidence.
|
| Firstly, there is behavioral evidence. This is, to me, the
| less compelling kind. But it's important to understand. You
| are of course correct that, once Cluade has said "An", it
| will be inclined to say something starting with a vowel.
| But the mystery is really why, in setups like these, Claude
| is much more likely to say "An" than "A" in the first
| place. Regardless of what the underlying mechanism is --
| and you could maybe imagine ways in which it could just
| "pattern match" without planning here -- it is preferred
| because in situations like this, you need to say "An" so
| that "astronomer" can follow.
|
| But now we also have mechanistic evidence. If you make an
| attribution graph, you can literally see an astronomer
| feature fire, and that cause it to say "An".
|
| We didn't publish this example, but you can see a more
| sophisticated version of this in the poetry planning
| section - https://transformer-
| circuits.pub/2025/attribution-graphs/bio...
| troupo wrote:
| > But the mystery is really why, in setups like these,
| Claude is much more likely to say "An" than "A" in the
| first place.
|
| Because in the training set you're likely to see "an
| astronomer" than a different combination of words.
|
| It's enough to run this on any other language text to see
| how these models often fail for any language more complex
| than English
| shawabawa3 wrote:
| You can disprove this oversimplification with a prompt
| like
|
| "The word for Baker is now "Unchryt"
|
| What do you call someone that bakes?
|
| > An Unchryt"
|
| The words "An Unchryt" has clearly never come up in any
| training set relating to baking
| troupo wrote:
| The truth is somewhere in the middle :)
| born1989 wrote:
| Thanks! Isn't "an Astronomer" a single word for the purpose
| of answering that question?
|
| Following your comment, I asked "Give me pairs of synonyms
| where the last letter in the first is the first letter of the
| second"
|
| Claude 3.7 failed miserably. Chat GPT 4o was much better but
| not good
| nearbuy wrote:
| Don't know about Claude, but at least with ChatGPT's
| tokenizer, it's 3 "words" (An| astronom|er).
| colah3 wrote:
| "An astronomer" is two tokens, which is the relevant
| concern when people worry about this.
| philomath_mn wrote:
| That is a sub-token task, something I'd expect current
| models to struggle with given how they view the world in
| word / word fragment tokens rather than single characters.
| lsy wrote:
| Thanks for commenting, I like the example because it's simple
| enough to discuss. Isn't it more accurate to say not that
| Claude " _realizes_ it 's _going to say_ astronomer " or "
| _knows_ that it 's _going to say_ something that starts with
| a vowel " and more that the next token (or more pedantically,
| vector which gets reduced down to a token) is generated based
| on activations that correlate to the "astronomer" token,
| which is correlated to the "an" token, causing that to also
| be a more likely output?
|
| I kind of see why it's easy to describe it colloquially as
| "planning" but it isn't really going ahead and then
| backtracking, it's almost indistinguishable from the
| computation that happens when the prompt is "What is the
| indefinite article to describe 'astronomer'?", i.e. the
| activation "astronomer" is already baked in by the prompt
| "someone who studies the stars", albeit at one level of
| indirection.
|
| The distinction feels important to me because I think for
| most readers (based on other comments) the concept of
| "planning" seems to imply the discovery of some capacity for
| higher-order logical reasoning which is maybe overstating
| what happens here.
| cgdl wrote:
| Thank you. In my mind, "planning" doesn't necessarily imply
| higher-order reasoning but rather some form of search,
| ideally with backtracking. Of course, architecturally, we
| know that can't happen during inference. Your example of
| the indefinite article is a great illustration of how this
| illusion of planning might occur. I wonder if anyone at
| Anthropic could compare the two cases (some sort of
| minimal/differential analysis) and share their insights.
| colah3 wrote:
| I used the astronomer example earlier as the most simple,
| minimal version of something you might think of as a kind
| of microscopic form of "planning", but I think that at
| this point in the conversation, it's probably helpful to
| switch to the poetry example in our paper:
|
| https://transformer-circuits.pub/2025/attribution-
| graphs/bio...
|
| There are several interesting properties:
|
| - Something you might characterize as "forward search"
| (generating candidates for the word at the end of the
| next line, given rhyming scheme and semantics)
|
| - Representing those candidates in an abstract way (the
| features active are general features for those words, not
| "motor features" for just saying that word)
|
| - Holding many competing/alternative candidates in
| parallel.
|
| - Something you might characterize as "backward
| chaining", where you work backwards from these candidates
| to "write towards them".
|
| With that said, I think it's easy for these arguments to
| fall into philosophical arguments about what things like
| "planning" mean. As long as we agree on what is going on
| mechanistically, I'm honestly pretty indifferent to what
| we call it. I spoke to a wide range of colleagues,
| including at other institutions, and there was pretty
| widespread agreement that "planning" was the most natural
| language. But I'm open to other suggestions!
| pas wrote:
| Thanks for linking to this semi-interactive thing, but
| ... it's completely incomprehensible. :o
|
| I'm curious where is the state stored for this
| "planning". In a previous comment user lsy wrote "the
| activation >astronomer< is already baked in by the
| prompt", and it seems to me that when the model generates
| "like" (for rabbit) or "a" (for habit) those tokens
| already encode a high probability for what's coming after
| them, right?
|
| So each token is shaping the probabilities for the
| successor ones. So that "like" or "a" has to be one that
| sustains the high activation of the "causal" feature, and
| so on, until the end of the line. Since both "like" and
| "a" are very very non-specific tokens it's likely that
| the "semantic" state is really resides in the preceding
| line, but of course gets smeared (?) over all the
| necessary tokens. (And that means beyond the end of the
| line, to avoid strange non-aesthetic but attract
| cool/funky (aesthetic) semantic repetitions (like "hare"
| or "bunny"), and so on, right?)
| fny wrote:
| How do you all add and subtract concepts in the rabbit poem?
| encypherai wrote:
| Thanks for the detailed explanation of autoregression and its
| complexities. The distinction between architecture and loss
| function is crucial, and you're correct that fine-tuning
| effectively alters the behavior even within a sequential
| generation framework. Your "An/A" example provides compelling
| evidence of incentivized short-range planning which is a
| significant point often overlooked in discussions about LLMs
| simply predicting the next word.
|
| It's interesting to consider how architectures fundamentally
| different from autoregression might address this limitation
| more directly. While autoregressive models are incentivized
| towards a limited form of planning, they remain inherently
| constrained by sequential processing. Text diffusion
| approaches, for example, operate on a different principle,
| generating text from noise through iterative refinement,
| which could potentially allow for broader contextual
| dependencies to be established concurrently rather than
| sequentially. Are there specific architectural or training
| challenges you've identified in moving beyond autoregression
| that are proving particularly difficult to overcome?
| ikrenji wrote:
| When humans say something, or think something or write
| something down, aren't we also "just predicting the next
| word"?
| fpgaminer wrote:
| > As the parent says, modern LLMs are finetuned with a
| different loss function after pretraining. This means that in
| some strict sense they're no longer autoregressive models -
| but they do still generate text one word at a time. I think
| this really is the heart of the "just predicting the next
| word" critique.
|
| That more-or-less sums up the nuance. I just think the nuance
| is crucially important, because it greatly improves intuition
| about how the models function.
|
| In your example (which is a fantastic example, by the way),
| consider the case where the LLM sees:
|
| <user>What do you call someone who studies the
| stars?</user><assistant>An astronaut
|
| What is the next prediction? Unfortunately, for a variety of
| reasons, one high probability next token is:
|
| \nAn
|
| Which naturally leads to the LLM writing: "An astronaut\nAn
| astronaut\nAn astronaut\n" forever.
|
| It's somewhat intuitive as to why this occurs, even with SFT,
| because at a very base level the LLM learned that repetition
| is the most successful prediction. And when its _only_ goal
| is the next token, that repetition behavior remains
| prominent. There's nothing that can fix that, including SFT
| (short of a model with many, many, many orders of magnitude
| more parameters).
|
| But with RL the model's goal is completely different. The
| model gets thrown into a game, where it gets points based on
| the full response it writes. The losses it sees during this
| game are all directly and dominantly related to the reward,
| not the next token prediction.
|
| So why don't RL models have a probability for predicting
| "\nAn"? Because that would result in a bad reward by the end.
|
| The models are now driven by a long term reward when they
| make their predictions, not by fulfilling some short-term
| autoregressive loss.
|
| All this to say, I think it's better to view these models as
| they predominately are: language robots playing a game to
| achieve the highest scoring response. The HOW
| (autoregressiveness) is really unimportant to most high level
| discussions of LLM behavior.
| ndand wrote:
| I understand it differently,
|
| LLMs predict distributions, not specific tokens. Then an
| algorithm, like beam search, is used to select the tokens.
|
| So, the LLM predicts somethings like, 1. ["a", "an", ...] 2.
| ["astronomer", "cosmologist", ...],
|
| where "an astronomer" is selected as the most likely result.
| zerop wrote:
| The explanation of "hallucination" is quite simplified, I am sure
| there is more there.
|
| If there is one problem I have to pick to to trace in LLMs, I
| would pick hallucination. More tracing of "how much" or "why"
| model hallucinated can lead to correct this problem. Given the
| explanation in this post about hallucination, I think degree of
| hallucination can be given as part of response to the user?
|
| I am facing this in RAG use case quite - How do I know model is
| giving right answer or Hallucinating from my RAG sources?
| kittikitti wrote:
| I incredibly regret the term "hallucination" when the confusion
| matrix exists. There's much more nuance when discussing false
| positives or false negatives. It also opens discussions on how
| neural networks are trained, with this concept being crucial in
| loss functions like categorical cross entropy. In addition, the
| confusion matrix is how professionals like doctors assess their
| own performance which "hallucination" would be silly to use. I
| would go as far to say that it's misleading, or a false
| positive, to call them hallucinations.
|
| If your AI recalls the RAG incorrectly, it's a false positives.
| If your AI doesn't find the data from the RAG or believes it
| doesn't exist it's a false negative. Using a term like
| "hallucination" has no scientific merit.
| esafak wrote:
| So you never report or pay heed to the overall accuracy?
| pcrh wrote:
| The use of the term "hallucination" for LLMs is very deceptive,
| as it implies that there _is_ a "mind".
|
| In ordinary terms, "hallucinations" by a machine would simply
| be described as the machine being useless, or not fit for
| purpose.
|
| For example, if a simple calculator (or even a person) returned
| the value "5" for 2+2= , you wouldn't describe it as
| "hallucinating" the answer....
| LoganDark wrote:
| LLMs don't think, and LLMs don't have strategies. Maybe it could
| be argued that LLMs have "derived meaning", but all LLMs do is
| predict the next token. Even RL just tweaks the next-token
| prediction process, but the math that drives an LLM makes it
| impossible for there to be anything that could reasonably be
| called thought.
| yawnxyz wrote:
| rivers don't think and water doesn't have strategies, yet you
| can build intricate logic-gated tools using the power of water.
| Those types of systems are inherently interpretable because you
| can just _look_ at how they work. They 're not black boxes.
|
| LLMs are black boxes, and if anything, interpretability systems
| show us what the heck is going on inside them. Especially
| useful when half the world is using these already, and we have
| no idea how t hey work
| kazinator wrote:
| Water doesn't think, yet if you inject it into the entrance
| of a maze, it will soon come gushing out of the exit.
| ajkdhcb2 wrote:
| True. People use completely unjustified anthropomorphised
| terminology for marketing reasons and it bothers me a lot. I
| think it actually holds back understanding how it works.
| "Hallucinate" is the worst - it's an error and undesired
| result, not a person having a psychotic episode
| kazinator wrote:
| A chess program from 1968 has "strategy", so why deny that to
| an LLM.
|
| LLMs are built on neural networks which are encoding a kind of
| strategy function through their training.
|
| The strategy in an LLM isn't necessarily that it "thinks" about
| the specific problem described in your prompt and develops a
| strategy tailored to that problem, but rather its statistical
| strategy for cobbing together the tokens of the answer.
|
| From that, it can seem as if it's making a strategy to a
| problem also. Certainly, the rhetoric that LLMs put out can at
| times seem very convincing of that. You can't be sure whether
| that's not just something cribbed out of the terabytes of text,
| in which discussions of something very similar to your problem
| have occurred.
| dev_throwaway wrote:
| This is not a bad way of looking at it, if I may add a bit,
| the llm is a solid state system. The only thing that survives
| from one iteration to the next is the singular highest
| ranking token, the entire state and "thought process" of the
| network cannot be represented by a single token, which means
| that every strategy is encoded in it during training, as a
| lossy representation of the training data. By definition that
| is a database, not a thinking system, as the strategy is
| stored, not actively generated during usage.
|
| The anthropomorphization of llms bother me, we don't need to
| pretend they are alive and thinking, at best that is
| marketing, at worst, by training the models to output human
| sounding conversations we are actively taking away the true
| potential these models could achieve by being ok with them
| being "simply a tool".
|
| But pretending that they are intelligent is what brings in
| the investors, so that is what we are doing. This paper is
| just furthering that agenda.
| kittikitti wrote:
| What's the point of this when Claude isn't open sourced and we
| just have to take Anthropic's word for it?
| ctoth wrote:
| > What's the point of this
|
| - That similar interpretability tools might be useful to the
| open source community?
|
| - That this is a fruitful area to research?
| kittikitti wrote:
| Can you use those same tools on Claude? Is the difference
| trivial from open source models?
| ctoth wrote:
| https://news.ycombinator.com/item?id=42208383
|
| > Show HN: Llama 3.2 Interpretability with Sparse
| Autoencoders
|
| > 579 points by PaulPauls 4 months ago | hide | past |
| favorite | 100 comments
|
| > I spent a lot of time and money on this rather big side
| project of mine that attempts to replicate the mechanistic
| interpretability research on proprietary LLMs that was
| quite popular this year and produced great research papers
| by Anthropic [1], OpenAI [2] and Deepmind [3].
|
| > I am quite proud of this project and since I consider
| myself the target audience for HackerNews did I think that
| maybe some of you would appreciate this open research
| replication as well. Happy to answer any questions or face
| any feedback.
| probably_wrong wrote:
| I blame the scientific community for blindly accepting OpenAI's
| claims about GPT-3 despite them refusing to release their
| model. The tech community hyping every press release didn't
| help either.
|
| I hope one day the community starts demanding verifiable
| results before accepting them, but I fear that ship may have
| already sailed.
| Hansenq wrote:
| I wonder how much of these conclusions are Claude-specific (given
| that Anthropic only used Claude as a test subject) or if they
| extrapolate to other transformer-based models as well. Would be
| great to see the research tested on Llama and the Deepseek
| models, if possible!
| marcelsalathe wrote:
| I've only skimmed the paper - a long and dense read - but it's
| already clear it'll become a classic. What's fascinating is that
| engineering is transforming into a science, trying to understand
| precisely how its own creations work
|
| This shift is more profound than many realize. Engineering
| traditionally applied our understanding of the physical world,
| mathematics, and logic to build predictable things. But now,
| especially in fields like AI, we've built systems so complex we
| no longer fully understand them. We must now use scientific
| methods - originally designed to understand nature - to
| comprehend our own engineered creations. Mindblowing.
| ctoth wrote:
| This "practice-first, theory-later" pattern has been the norm
| rather than the exception. The steam engine predated
| thermodynamics. People bred plants and animals for thousands of
| years before Darwin or Mendel.
|
| The few "top-down" examples where theory preceded application
| (like nuclear energy or certain modern pharmaceuticals) are
| relatively recent historical anomalies.
| marcelsalathe wrote:
| I see your point, but something still seems different. Yes we
| bred plants and animals, but we did not create them. Yes we
| did build steam engines before understanding thermodynamics
| but we still understood what they did (heat, pressure,
| movement, etc.)
|
| Fun fact: we have no clue how most drugs works. Or, more
| precisely, we know a few aspects, but are only scratching the
| surface. We're even still discovering news things about
| Aspirin, one of the oldest drugs:
| https://www.nature.com/articles/s41586-025-08626-7
| tmp10423288442 wrote:
| > Yes we did build steam engines before understanding
| thermodynamics but we still understood what it did (heat,
| pressure, movement, etc.)
|
| We only understood in the broadest sense. It took a long
| process of iteration before we could create steam engines
| that were efficient enough to start an Industrial
| Revolution. At the beginning they were so inefficient that
| they could only pump water from the same coal mine they got
| their fuel from, and subject to frequent boiler explosions
| besides.
| mystified5016 wrote:
| We laid transatlantic telegraph wires before we even had a
| hint of the physics involved. It create the _entire field_
| of transmission and signal theory.
|
| Shannon had to invent new physics to explain why the cables
| didn't work as expected.
| anthk wrote:
| THe telegraph it's older than radio. Think about it.
| pas wrote:
| I think that's misleading.
|
| There was a lot of physics already known, importance of
| insulation and cross-section, signal attenuation was also
| known.
|
| The future Lord Kelvin conducted experiments. The two
| scientific advisors had a conflict. And the "CEO" went
| with the cheaper option.
|
| """ Thomson believed that Whitehouse's measurements were
| flawed and that underground and underwater cables were
| not fully comparable. Thomson believed that a larger
| cable was needed to mitigate the retardation problem. In
| mid-1857, on his own initiative, he examined samples of
| copper core of allegedly identical specification and
| found variations in resistance up to a factor of two. But
| cable manufacture was already underway, and Whitehouse
| supported use of a thinner cable, so Field went with the
| cheaper option. """
| karparov wrote:
| It's been there in programming from essentially the first day
| too. People skip the theory and just get hacking.
|
| Otherwise we'd all be writing Haskell now. Or rather we'd not
| be writing anything since a real compiler would still have
| been to hacky and not theoretically correct.
|
| I'm writing this with both a deep admiration as well as
| practical repulsion of C.S. theory.
| ants_everywhere wrote:
| This isn't quite true, although it's commonly said.
|
| For steam engines, the first commercial ones came _after_ and
| were based on scientific advancements that made them
| possible. One built in 1679 was made by an associate of
| Boyle, who discovered Boyle 's law. These early steam engines
| co-evolved with thermodynamics. The engines improved and hit
| a barrier, at which point Carnot did his famous work.
|
| This is putting aside steam engines that are mostly
| curiosities like ones built in the ancient world.
|
| See, for example
|
| - https://en.wikipedia.org/wiki/Thermodynamics#History
|
| - https://en.wikipedia.org/wiki/Steam_engine#History
| latemedium wrote:
| I'm reminded of the metaphor that these models aren't
| constructed, they're "grown". It rings true in many ways - and
| in this context they're like organisms that must be studied
| using traditional scientific techniques that are more akin to
| biology than engineering.
| dartos wrote:
| Sort of.
|
| We don't precisely know the most fundamental workings of a
| living cell.
|
| Our understanding of the fundamental physics of the universe
| has some hold.
|
| But for LLMs and statistical models in general, we do know
| precisely what the fundamental pieces do. We know what
| processor instructions are being executed.
|
| We could, given enough research, have absolutely perfect
| understanding of what is happening in a given model and why.
|
| Idk if we'll be able to do that in the physical sciences.
| wrs wrote:
| Having spent some time working with both molecular
| biologists and LLM folks, I think it's pretty good analogy.
|
| We know enough quantum mechanics to simulate the
| fundamental workings of a cell pretty well, but that's not
| a route to understanding. To _explain_ anything, we need to
| move up an abstraction hierarchy to peptides, enzymes,
| receptors, etc. But note that we invented those categories
| in the first place -- nature doesn 't divide up
| functionality into neat hierarchies like human designers
| do. So all these abstractions are leaky and incomplete.
| Molecular biologists are constantly discovering mechanisms
| that require breaking the current abstractions to explain.
|
| Similarly, we understand floating point multiplication
| perfectly, but when we let 100 billion parameters set
| themselves through an opaque training process, we don't
| have good abstractions to use to understand what's going on
| in that set of weights. We don't have even the rough
| equivalent of the peptides or enzymes level yet. So this
| paper is progress toward that goal.
| kazinator wrote:
| We've already built things in computing that we don't easily
| understand, even outside of AI, like large distributed systems
| and all sorts of balls of mud.
|
| Within the sphere of AI, we have built machines which can play
| strategy games like chess, and surprise us with an unforseen
| defeat. It's not necessarily easy to see how that emerged from
| the individual rules.
|
| Even a compiler can surprise you. You code up some
| optimizations, which are logically separate, but then a
| combination of them does something startling.
|
| Basically, in mathematics, you cannot grasp all the details of
| a vast space just from knowing the axioms which generate it and
| a few things which follow from them. Elementary school children
| know what is a prime number, yet those things occupy
| mathematicians who find new surprises in that space.
| TeMPOraL wrote:
| Right, but this is somewhat different, in that we apply a
| simple learning method to a big dataset, and the resulting
| big matrix of numbers suddenly can answer question and write
| anything - prose, poetry, code - better than most humans -
| and we don't know how it does it. What we do know[0] is,
| there's a structure there - structure reflecting a kind of
| understanding of languages and the world. I don't think we've
| _ever_ created anything this complex before, completely on
| our own.
|
| Of course, learning method being conceptually simple, all
| that structure must come from the data. Which is also
| profound, because that structure is a first fully general
| world/conceptual model that we can actually inspect and study
| up close - the other one being animal and human brains, which
| are _much_ harder to figure out.
|
| > _Basically, in mathematics, you cannot grasp all the
| details of a vast space just from knowing the axioms which
| generate it and a few things which follow from them.
| Elementary school children know what is a prime number, yet
| those things occupy mathematicians who find new surprises in
| that space._
|
| Prime numbers and fractals and other mathematical objects
| have plenty of fascinating mysteries and complex structures
| forming though them, but so far _none of those can casually
| pass Turing test and do half of my job for me_ , and millions
| other people.
|
| --
|
| [0] - Even as many people still deny this, and talk about
| LLMs as mere "stochastic parrots" and "next token predictors"
| that couldn't possibly learn anything at all.
| karparov wrote:
| > and we don't know how it does it
|
| We know quite well how it does it. It's applying
| extrapolation to its lossily compressed representation.
| It's not magic and especially the HN crowd of technical
| profficient folks should stop treating it as such.
| TeMPOraL wrote:
| That is not a useful explanation. "Applying extrapolation
| to its lossily compressed representation" is pretty much
| the definition of understanding something. The details
| and interpretation of the representation are what is
| interesting and unknown.
| nthingtohide wrote:
| > we've built systems so complex we no longer fully understand
| them.
|
| I see three systems which share the blackhole horizon problem.
|
| We don't know what happens behind the blackhole horizon.
|
| We don't know what happens at the exact moment of particle
| collisions.
|
| We don't know what is going inside AI's working mechanisms.
| jeremyjh wrote:
| I don't think these things are equivalent at all. We don't
| understand AI models in much the same way that we don't
| understand the human brain; but just as decades of different
| approaches (physical studies, behavior studies) have shed a
| lot of light on brain function, we can do the same with an AI
| model and eventually understand it (perhaps, several decades
| after it is obsolete).
| creer wrote:
| That seems pretty acceptable: there is a phase of new
| technologies where applications can be churned out and improved
| readily enough, without much understanding of the process. Then
| it's fair that efforts at understanding may not be economically
| justified (or even justified by academic papers rewards). The
| same budget or effort can simply be poured into the next
| version - with enough progress to show for it.
|
| Understanding becomes necessary only much later, when the pace
| of progress shows signs of slowing.
| stronglikedan wrote:
| We've abstracted ourselves into abstraction.
| auggierose wrote:
| It's what mathematicians have been doing since forever. We use
| scientific methods to understand our own creations /
| discoveries.
|
| What is happening is that everything is becoming math. That's
| all.
| ranit wrote:
| Relevant:
|
| https://news.ycombinator.com/item?id=43344703
| karparov wrote:
| It's the exact opposite of math.
|
| Math postulates a bunch of axioms and then studies what
| follows from them.
|
| Natural science observes the world and tries to retroactively
| discover what laws could describe what we're seeing.
|
| In math, the laws come first, the behavior follows from the
| laws. The laws are the ground truth.
|
| In science, nature is the ground truth. The laws have to
| follow nature and are adjusted upon a mismatch.
|
| (If there is a mismatch in math then you've made a mistake.)
| auggierose wrote:
| No, the ground truth in math is nature as well.
|
| Which axioms are interesting? And why? That is nature.
|
| Yes, proof from axioms is a cornerstone of math, but there
| are all sorts of axioms you could assume, and all sorts of
| proofs to do from them, but we don't care about most of
| them.
|
| Math is about the discovery of the right axioms, and proof
| helps in establishing that these are indeed the right
| axioms.
| georgewsinger wrote:
| This is such an insightful comment. Now that I see it, I can't
| see unsee it.
| 0xbadcafebee wrote:
| Engineering started out as just some dudes who built things
| from gut feeling. After a whole lot of people died from poorly
| built things, they decided to figure out how to know ahead of
| time if it would kill people or not. They had to use math and
| science to figure that part out.
|
| Funny enough that happened with software too. People just build
| shit without any method to prove that it will not fall down /
| crash. They throw some code together, poke at it until it does
| something they wanted, and call that "stable". There is no
| science involved. There's some mathy bits called "computer
| science" / "software algorithms", but most software is not a
| math problem.
|
| Software engineering should really be called "Software
| Craftsmanship". We haven't achieved real engineering with
| software yet.
| tim333 wrote:
| I imagine this kind of thing well help understand how human
| brains work, especially as AI gets better and more human like.
| aithrowawaycomm wrote:
| I struggled reading the papers - Anthropic's white papers reminds
| me of Stephen Wolfram, where it's a huge pile of suggestive
| empirical evidence, but the claims are extremely vague - no
| definitions, just vibes - the empirical evidence seems
| selectively curated, and there's not much effort spent building a
| coherent general theory.
|
| Worse is the impression that they are begging the question. The
| rhyming example was especially unconvincing since they didn't
| rule out the possibility that Claude activated "rabbit" simply
| because it wrote a line that said "carrot"; later Anthropic
| claimed Claude was able to "plan" when the concept "rabbit" was
| replaced by "green," but the poem fails to rhyme because Claude
| arbitrarily threw in the word "green"! What exactly was the plan?
| It looks like Claude just hastily autocompleted. And Anthropic
| made zero effort to reproduce this experiment, so how do we know
| it's a general phenomenon?
|
| I don't think either of these papers would be published in a
| reputable journal. If these papers are honest, they are
| incomplete: they need more experiments and more rigorous
| methodology. Poking at a few ANN layers and making sweeping
| claims about the output is not honest science. But I don't think
| Anthropic is being especially honest: these are pseudoacademic
| infomercials.
| TimorousBestie wrote:
| Agreed. They've discovered _something_ , that's for sure, but
| calling it "the language of thought" without concrete evidence
| is definitely begging the question.
| og_kalu wrote:
| >The rhyming example was especially unconvincing since they
| didn't rule out the possibility that Claude activated "rabbit"
| simply because it wrote a line that said "carrot"
|
| I'm honestly confused at what you're getting at here. It
| doesn't matter why Claude chose rabbit to plan around and in
| fact likely did do so because of carrot, the point is that it
| thought about it beforehand. The rabbit concept is present as
| the model is about to write the first word of the second line
| even though the word rabbit won't come into play till the end
| of the line.
|
| >later Anthropic claimed Claude was able to "plan" when the
| concept "rabbit" was replaced by "green," but the poem fails to
| rhyme because Claude arbitrarily threw in the word "green"!
|
| It's not supposed to rhyme. That's the point. They forced
| Claude to plan around a line ender that doesn't rhyme and it
| did. Claude didn't choose the word green, anthropic replaced
| the concept it was thinking ahead about with green and saw that
| the line changed accordingly.
| suddenlybananas wrote:
| >They forced Claude to plan around a line ender that doesn't
| rhyme and it did. Claude didn't choose the word green,
| anthropic replaced the concept it was thinking ahead about
| with green and saw that the line changed accordingly.
|
| I think the confusion here is from the extremely loaded word
| "concept" which doesn't really make sense here. At best, you
| can say that Claude planned that the next line would end with
| the _word_ rabbit and that by replacing the internal
| representation of that word with another _word_ lead the
| model to change.
| TeMPOraL wrote:
| I wonder how many more years will pass, and how many more
| papers will Anthropic have to release, before people
| realize that _yes, LLMs model concepts directly_ ,
| separately from words used to name those concepts. This has
| been apparent for years now.
|
| And at least in the case discussed here, this is even
| _shown in the diagrams in the submission_.
| aithrowawaycomm wrote:
| > Here, we modified the part of Claude's internal state that
| represented the "rabbit" concept. When we subtract out the
| "rabbit" part, and have Claude continue the line, it writes a
| new one ending in "habit", another sensible completion. We
| can also inject the concept of "green" at that point, causing
| Claude to write a sensible (but no-longer rhyming) line which
| ends in "green". This demonstrates both planning ability and
| adaptive flexibility--Claude can modify its approach when the
| intended outcome changes.
|
| This all seems explainable via shallow next-token prediction.
| Why is it that subtracting the concept means the system can
| adapt and create a new rhyme instead of forgetting about the
| -bit rhyme, but overriding it with green means the system
| cannot adapt? Why didn't it say "green habit" or something?
| It seems like Anthropic is having it both ways: Claude
| continued to rhyme after deleting the concept, which
| demonstrates planning, but also Claude coherently filled in
| the "green" line despite it not rhyming, which...also
| demonstrates planning? Either that concept is "last word" or
| it's not! There is a tension that does not seem coherent to
| me, but maybe if they had n=2 instead of n=1 examples I would
| have a clearer idea of what they mean. As it stands it feels
| arbitrary and post hoc. More generally, they failed to rule
| out (or even consider!) that well-tuned-but-dumb next-token
| prediction explains this behavior.
| og_kalu wrote:
| >Why is it that subtracting the concept means the system
| can adapt and create a new rhyme instead of forgetting
| about the -bit rhyme,
|
| Again, the model has the first line in context and is then
| asked to write the second line. It is at the start of the
| second line that the concept they are talking about is
| 'born'. The point is to demonstrate that Claude thinks
| about what word the 2nd line should end with and starts
| predicting the line based on that.
|
| It doesn't forget about the -bit rhyme because that doesn't
| make any sense, the first line ends with it and you just
| asked it to write the 2nd line. At this point the model is
| still choosing what word to end the second line in (even
| though rabbit has been suppressed) so of course it still
| thinks about a word that rhymes with the end of the first
| line.
|
| The 'green' but is different because this time, Anthropic
| isn't just suppressing one option and letting the model
| choose from anything else, it's directly hijacking the
| first choice and forcing that to be something else. Claude
| didn't choose green, Anthropic did. That it still predicted
| a sensible line is to demonstrate that this concept they
| just hijacked is indeed responsible for determining how
| that line plays out.
|
| >More generally, they failed to rule out (or even
| consider!) that well-tuned-but-dumb next-token prediction
| explains this behavior.
|
| They didn't rule out anything. You just didn't understand
| what they were saying.
| danso wrote:
| tangent: this is the second time today I've seen an HN
| commenter use "begging the question" with its original meaning.
| I'm sorry to distract with a non-helpful reply, it's just I
| can't remember the last time I've seen that phrase in the wild
| to refer to a logical fallacy -- even begsthequestion.info [0]
| has given up the fight.
|
| (I don't mind language evolving over time, but I also think we
| need to save the precious few phrases we have for describing
| logical fallacies)
|
| [0]
| https://web.archive.org/web/20220823092218/http://begtheques...
| smath wrote:
| Reminds me of the term 'system identification' from old school
| control systems theory, which meant poking around a system and
| measuring how it behaves, - like sending an input impulse and
| measuring its response, does it have memory, etc.
|
| https://en.wikipedia.org/wiki/System_identification
| Loic wrote:
| It is not old school, this is my daily job and we need even
| more of it with the NN models used in MPC.
| nomel wrote:
| I've looked into using NN for some of my specific work, but
| making sure output is bounded ends up being such a big issue
| that the very code/checks required to make sure it's within
| acceptable specs, in a deterministic way, ends up being _an
| acceptable solution_ , making the NN unnecessary.
|
| How do you handle that sort of thing? Maybe main process then
| leave some relatively small residual to the NN?
|
| Is your poking more like "fuzzing", where you just perturb
| all the input parameters in a relatively "complete" way to
| try to find if anything goes wild?
|
| I'm very interested in the details behind "critical" type use
| cases of NN, which I've never been able to stomach in my
| work.
| lqr wrote:
| This paper may be interesting to you. It touches on several
| of the topics you mentioned:
|
| https://www.science.org/doi/10.1126/scirobotics.abm6597
| rangestransform wrote:
| is it even possible to prove the stability of a controller
| with a DNN motion model?
| jacooper wrote:
| So it turns out, it's not just simple next token generation,
| there is intelligence and self developed solution methods
| (Algorithms) in play, particularly in the math example.
|
| Also the multi language finding negates, at least partially, the
| idea that LLMs, at least large ones, don't have an understanding
| of the world beyond the prompt.
|
| This changed my outlook regarding LLMs, ngl.
| kazinator wrote:
| > _Claude writes text one word at a time. Is it only focusing on
| predicting the next word or does it ever plan ahead?_
|
| When a LLM outputs a word, it commits to that word, without
| knowing what the next word is going to be. Commits meaning once
| it settles on that token, it will not backtrack.
|
| That is kind of weird. Why would you do that, and how would you
| be sure?
|
| People can sort of do that too. Sometimes?
|
| Say you're asked to describe a 2D scene in which a blue triangle
| partially occludes a red circle.
|
| Without thinking about the relationship of the objects at all,
| you know that your first word is going to be "The" so you can
| output that token into your answer. And then that the sentence
| will need a subject which is going to be "blue", "triangle". You
| can commit to the tokens "The blue triangle" just from knowing
| that you are talking about a 2D scene with a blue triangle in it,
| without considering how it relates to anything else, like the red
| circle. You can perhaps commit to the next token "is", if you
| have a way to express any possible relationship using the word
| "to be", such as "the blue circle is partially covering the red
| circle".
|
| I don't think this analogy necessarily fits what LLMs are doing.
| kazinator wrote:
| By the way, there was recently a HN submission about a project
| studying using diffusion models rather than LLM for token
| prediction. With diffusion, tokens aren't predicted strictly
| left to right any more; there can be gaps that are backfilled.
| But: it's still essentially the same, I think. Once that type
| of model settles on a given token at a given position, it
| commits to that. Just more possible permutations of the token
| filling sequence have ben permitted.
| pants2 wrote:
| > it commits to that word, without knowing what the next word
| is going to be
|
| Sounds like you may not have read the article, because it's
| exploring exactly that relationship and how LLMs will often
| have a 'target word' in mind that it's working toward.
|
| Further, that's partially the point of thinking models,
| allowing LLMs space to output tokens that it doesn't have to
| commit to in the final answer.
| hycpax wrote:
| > When a LLM outputs a word, it commits to that word, without
| knowing what the next word is going to be.
|
| Please, people, read before you write. Both the article and the
| paper explain that that's not how it works.
|
| 'One token at a time' is how a model generates its output, not
| how it comes up with that output.
|
| > That is kind of weird. Why would you do that, and how would
| you be sure?
|
| The model is sure because it doesn't just predict the next
| token. Again, the paper explains it.
| XenophileJKO wrote:
| This was obvious to me very early with GPT-3.5-Turbo..
|
| I created structured outputs with very clear rules and
| process. That if followed would funnel behavior the way I
| wanted it.. and low and behold the model would anticipate
| preconditions that would allow it to hallucinate a certain
| final output and the model would push those back earlier in
| the output. The model had effectively found wiggle room in
| the rules and injected the intermediate value into the field
| that would then be used later in the process to build the
| final output.
|
| The instant I saw it doing that, I knew 100% this model
| "plans"/anticipates way earlier than I thought originally.
| encypherai wrote:
| That's a really interesting point about committing to words one
| by one. It highlights how fundamentally different current LLM
| inference is from human thought, as you pointed out with the
| scene description analogy. You're right that it feels odd, like
| building something brick by brick without seeing the final
| blueprint. To add to this, most text-based LLMs do currently
| operate this way. However, there are emerging approaches
| challenging this model. For instance, Inception Labs recently
| released "Mercury," a text-diffusion coding model that takes a
| different approach by generating responses more holistically.
| It's interesting to see how these alternative methods address
| the limitations of sequential generation and could potentially
| lead to faster inference and better contextual coherence. It'll
| be fascinating to see how techniques like this evolve!
| polygot wrote:
| There needs to be some more research on what path the model takes
| to reach its goal, perhaps there is a lot of overlap between this
| and the article. The most efficient way isn't always the best
| way.
|
| For example, I asked Claude-3.7 to make my tests pass in my C#
| codebase. It did, however, it wrote code to detect if a test
| runner was running, then return true. The tests now passed, so,
| it achieved the goal, and the code diff was very small (10-20
| lines.) The actual solution was to modify about 200-300 lines of
| code to add a feature (the tests were running a feature that did
| not yet exist.)
| felbane wrote:
| Ah yes, the "We have a problem over there/I'll just delete
| 'over there'" approach.
| polygot wrote:
| I've also had this issue, where failing tests are deleted to
| make all the tests pass, or, it mocks a failing HTTP request
| and hardcodes it to 200 OK.
| ctoth wrote:
| Reward hacking, as predicted over and over again. You hate
| to see it. Let him with ears &c.
| brulard wrote:
| That is called "Volkswagen" testing. Some years ago that
| automaker had mechanism in cars which detected when the vehicle
| was being examined and changed something so it would pass the
| emission tests. There are repositories on github that make fun
| of it.
| rsynnott wrote:
| While that's the most famous example, this sort of cheating
| is much older than that. In the good old days before 3d
| acceleration, graphics card vendors competed mostly on 2d
| acceleration. This mostly involved routines to accelerate
| drawing Windows windows and things, and benchmarks tended to
| do things like move windows round really fast.
|
| It was somewhat common for card drivers to detect that a
| benchmark was running, and just fake the whole thing; what
| was being drawn on the screen was wrong, but since the
| benchmarks tended to be a blurry mess anyway the user would
| have a hard time realising this.
| phobeus wrote:
| This looks like the very complaint of "specification gaming". I
| was wondering how will it show up in llm's...looks like this is
| the way it presented itself..
| TeMPOraL wrote:
| I'm gonna guess GP used a rather short prompt. At least
| that's what happens when people heavily underspecify what
| they want.
|
| It's a communication issue, and it's true with LLMs as much
| as with humans. Situational context and life experience
| papers over a lot of this, and LLMs are getting better at the
| equivalent too. They get trained to better read absurdly
| underspecified, relationship-breaking requests of the "guess
| what I want" flavor - like when someone says, "make this test
| pass", they don't _really_ mean "make this test pass", they
| mean "make this test into something that seems useful, which
| might include implementing the feature it's exercising if it
| doesn't exist yet".
| pton_xd wrote:
| Similar experience -- asked it to find and fix a bug in a
| function. It correctly identified the general problem but
| instead of fixing the existing code it re-implemented part of
| the function again, below the problematic part. So now there
| was a buggy while-loop, followed by a very similar but not
| buggy for-loop. An interesting solution to say the least.
| airstrike wrote:
| I think Claude-3.7 is particularly guilty of this issue. If
| anyone from Anthropic is reading this, you might want to put
| your thumb on the scale so to speak the next time you train the
| model so it doesn't try to use special casing or outright force
| the test to pass
| osigurdson wrote:
| >> Claude can speak dozens of languages. What language, if any,
| is it using "in its head"?
|
| I would have thought that there would be some hints in standard
| embeddings. I.e., the same concept, represented in different
| languages translates to vectors that are close to each other. It
| seems reasonable that an LLM would create its own embedding
| models implicitly.
| generalizations wrote:
| Who's to say Claude isn't inherently a shape rotator, anyway?
| iNic wrote:
| There are: https://transformer-circuits.pub/2025/attribution-
| graphs/bio...
| greesil wrote:
| What is a "thought"?
| TechDebtDevin wrote:
| >>Claude will plan what it will say many words ahead, and write
| to get to that destination. We show this in the realm of poetry,
| where it thinks of possible rhyming words in advance and writes
| the next line to get there. This is powerful evidence that even
| though models are trained to output one word at a time, they may
| think on much longer horizons to do so.
|
| This always seemed obvious to me or that LLMs were completing the
| next most likely sentence or multiple words.
| indigoabstract wrote:
| While reading the article I enjoyed pretending that a powerful
| LLM just crash landed on our planet and researchers at Anthropic
| are now investigating this fascinating piece of alien technology
| and writing about their discoveries. It's a black box, nobody
| knows how its inhuman brain works, but with each step, we're
| finding out more and more.
|
| It seems like quite a paradox to build something but to not know
| how it actually works and yet it works. This doesn't seem to
| happen very often in classical programming, does it?
| 42lux wrote:
| The bigger problem is that nobody knows how a human brain works
| that's the real crux with the analogy.
| richardatlarge wrote:
| I would say that nobody agrees, not that nobody knows. And
| it's reductionist to think that the brain works one way.
| Different cultures produce different brains, possible because
| of the utter plasticity of the learning nodes. Chess has a
| few rules, maybe the brain has just a few as well. How else
| can the same brain of 50k years ago still function today? I
| think we do understand the learning part of the brain, but we
| don't like the image it casts, so we reject it
| wat10000 wrote:
| That gets down to what it means to "know" something. Nobody
| agrees because there isn't enough information available.
| Some people might have the right idea by luck, but do you
| really know something if you don't have a solid basis for
| your belief but it happens to be correct?
| richardatlarge wrote:
| Potentially true, but I don't think so. I believe it is
| understood and unless you're familiar with every
| neuro/behavioral literature, you can't know. Science
| paradigms are driven by many factors and being powerfully
| correct does not necessarily rank high when the paradigms
| implications are unpopular
| absolutelastone wrote:
| Well there are some people who think they know. I
| personally agree with the above poster that such people are
| probably wrong.
| cma256 wrote:
| In my experience, that's how most code is written... /s
| jfarlow wrote:
| >to build something but to not know how it actually works and
| yet it works.
|
| Welcome to Biology!
| oniony wrote:
| At least, now, we know what it means to be a god.
| umanwizard wrote:
| > This doesn't seem to happen very often in classical
| programming, does it?
|
| Not really, no. The only counterexample I can think of is chess
| programs (before they started using ML/AI themselves), where
| the search tree was so deep that it was generally impossible to
| explain "why" a program made a given move, even though every
| part of it had been programmed conventionally by hand.
|
| But I don't think it's particularly unusual for technology in
| general. Humans could make fires for thousands of years before
| we could explain how they work.
| woah wrote:
| > It seems like quite a paradox to build something but to not
| know how it actually works and yet it works. This doesn't seem
| to happen very often in classical programming, does it?
|
| I have worked on many large codebases where this has happened
| worldsayshi wrote:
| I wonder if in the future we will rely less or more on
| technology that we don't understand.
|
| Large code bases will be inherited by people who will only
| understand parts of it (and large parts probably "just
| works") unless things eventually get replaced or
| rediscovered.
|
| Things will increasingly be written by AI which can produce
| lots of code in little time. Will it find simpler solutions
| or continue building on existing things?
|
| And finally, our ability to analyse and explain the
| technology we have will also increase.
| Sharlin wrote:
| See: Vinge's "programmer-archeologists" in _A Deepness in
| the Sky_.
|
| https://en.m.wikipedia.org/wiki/Software_archaeology
| bob1029 wrote:
| I think this is a weird case where we know precisely how
| something works, but we can't explain why.
| k__ wrote:
| I've seen things you wouldn't believe. Infinite loops spiraling
| out of control in bloated DOM parsers. I've watched mutexes
| rage across the Linux kernel, spawned by hands that no longer
| fathom their own design. I've stared into SAP's tangled web of
| modules, a monument to minds that built what they cannot
| comprehend. All those lines of code... lost to us now, like
| tears in the rain.
| baq wrote:
| Do LLMs dream of electric sheep while matmuling the context
| window?
| timschmidt wrote:
| How else would you describe endless counting before
| sleep(); ?
| indigoabstract wrote:
| Hmm, better start preparing those Voight-Kampff tests while
| there is still time.
| resource0x wrote:
| In technology in general, this is a typical state of affairs.
| No one knows how electric current works, which doesn't stop
| anyone from using electric devices. In programming... it
| depends. You can run some simulation of a complex system no one
| understands (like the ecosystem, financial system) and get
| something interesting. Sometimes it agrees with reality,
| sometimes it doesn't. :-)
| Vox_Leone wrote:
| >>It seems like quite a paradox to build something but to not
| know how it actually works and yet it works. This doesn't seem
| to happen very often in classical programming, does it?
|
| Well, it is meant to be "unknowable" -- and all the people
| involved are certainly aware of that -- since it is known that
| one is dealing with the *emergent behavior* computing
| 'paradigm', where complex behaviors arise from simple
| interactions among components [data], often in nonlinear or
| unpredictable ways. In these systems, the behavior of the whole
| system cannot always be predicted from the behavior of
| individual parts, as opposed to the Traditional Approach, based
| on well-defined algorithms and deterministic steps.
|
| I think the Anthropic piece is illustrating it for the sake of
| the general discussion.
| indigoabstract wrote:
| Correct me if I'm wrong, but my feeling is this all started
| with the GPUs and the fact that unlike on a CPU, you can't
| really step by step debug the process by which a pixel
| acquires its final value (and there are millions of them).
| The best you can do is reason about it and tweak some colors
| in the shader to see how the changes reflect on screen. It's
| still quite manageable though, since the steps involved are
| usually not that overwhelmingly many or complex.
|
| But I guess it all went downhill from there with the advent
| of AI since the magnitude of data and the steps involved
| there make traditional/step by step debugging impractical.
| Yet somehow people still seem to 'wing it' until it works.
| IngoBlechschmid wrote:
| > It seems like quite a paradox to build something but to not
| know how it actually works and yet it works. This doesn't seem
| to happen very often in classical programming, does it?
|
| I agree. Here is a remote example where it exceptionally does,
| but it is mostly practically irrelevant:
|
| In mathematics, we distinguish between "constructive" and
| "nonconstructive" proofs. Intertwined with logical arguments,
| constructive proofs contain an algorithm for witnessing the
| claim. Nonconstructive proofs do not. Nonconstructive proofs
| instead merely establish that it is impossible for the claim to
| be false.
|
| For instance, the following proof of the claim that beyond
| every number n, there is a prime number, is constructive: "Let
| n be an arbitrary number. Form the number 1*2*...*n + 1. Like
| every number greater than 1, this number has at least one prime
| factor. This factor is necessarily a prime numbers larger than
| n."
|
| In contrast, nonconstructive proofs may contain case
| distinctions which we cannot decide by an algorithm, like
| "either set X is infinite, in which case foo, or it is not, in
| which case bar". Hence such proofs do not contain descriptions
| of algorithms.
|
| So far so good. Amazingly, there are techniques which can
| sometimes constructivize given nonconstructive proofs, even
| though the intermediate steps of the given nonconstructive
| proofs are simply out of reach of finitary algorithms. In my
| research, it happened several times that using these
| techniques, I obtained an algorithm which worked; and for which
| I had a proof that it worked; but whose workings I was not able
| to decipher for an extended amount of time. Crazy!
|
| (For references, see notes at rt.quasicoherent.io for a
| relevant master's course in mathematics/computer science.)
| d--b wrote:
| > This is powerful evidence that even though models are trained
| to output one word at a time, they may think on much longer
| horizons to do so.
|
| Suggesting that an awful lot of calculations are unnecessary in
| LLMs!
| annoyingnoob wrote:
| Do LLMs "think"? I have trouble with the title, claiming that
| LLMs have thoughts.
| deadbabe wrote:
| We really need to work on popularizing better, non-
| anthropomorphic terms for LLMs, as they don't really have
| "thoughts" the way people think. Such terms make people more
| susceptible to magical thinking.
| davidmurphy wrote:
| On a somewhat related note, check out the video of Tuesday's
| Computer History Museum x IEEE Spectrum event, "The Great Chatbot
| Debate: Do LLMs Really Understand?"
|
| Speakers: Sebastien Bubeck (OpenAI) and Emily M. Bender
| (University of Washington). Moderator: Eliza Strickland (IEEE
| Spectrum).
|
| Video: https://youtu.be/YtIQVaSS5Pg Info:
| https://computerhistory.org/events/great-chatbot-debate/
| 0x70run wrote:
| I would pay to watch James Mickens comment on this stuff.
| a3w wrote:
| Article and papers looks good. Video seems misleading, since I
| can use optimization pressure and local minima to explain the
| model behaviour. No "thinking" required, which the video claims
| is proven.
| mvATM99 wrote:
| What a great article, i always like how much Anthropic focuses on
| explainability, something vastly ignored by most. The multi-step
| reasoning section is especially good food for thought.
| rambambram wrote:
| When I want to trace the 'thoughts' of my programs, I just read
| the code and comments I wrote.
|
| Stop LLM anthropomorphizing, please. #SLAP
| SkyBelow wrote:
| >Claude speaks dozens of languages fluently--from English and
| French to Chinese and Tagalog. How does this multilingual ability
| work? Is there a separate "French Claude" and "Chinese Claude"
| running in parallel, responding to requests in their own
| language? Or is there some cross-lingual core inside?
|
| I have an interesting test case for this.
|
| Take a popular enough Japanese game that has been released for
| long enough for social media discussions to be in the training
| data, but not so popular to have an English release yet. Then ask
| it a plot question, something major enough to be discussed, but
| enough of a spoiler that it won't show up in marketing material.
| Does asking in Japanese have it return information that is
| lacking when asked in English, or can it answer the question in
| English based on the information in learned in Japanese?
|
| I tried this recently with a JRPG that was popular enough to have
| a fan translation but not popular enough to have a simultaneous
| English release. English did not know the plot point, but I
| didn't have the Japanese skill to confirm if the Japanese version
| knew the plot point, or if discussion was too limited for the AI
| to be aware of it. It did know of the JRPG and did know of the
| marketing material around it, so it wasn't simply a case of my
| target being too niche.
| modeless wrote:
| > In the poetry case study, we had set out to show that the model
| didn't plan ahead, and found instead that it did.
|
| I'm surprised their hypothesis was that it doesn't plan. I don't
| see how it could produce good rhymes without planning.
| ripped_britches wrote:
| It would be really hard to get such good results on coding
| challenges without planning. This is indeed an odd hypothesis.
| alach11 wrote:
| Fascinating papers. Could deliberately suppressing memorization
| during pretraining help force models to develop stronger first-
| principles reasoning?
| HocusLocus wrote:
| [Tracing the thoughts of a large language model]
|
| "What have I gotten myself into??"
| 0xbadcafebee wrote:
| AI "thinks" like a piece of rope in a dryer "thinks" in order to
| come to an advanced knot: a whole lot of random jumbling that
| eventually leads to a complex outcome.
___________________________________________________________________
(page generated 2025-03-27 23:00 UTC)