[HN Gopher] Signs of introspection in large language models
___________________________________________________________________
Signs of introspection in large language models
Author : themgt
Score : 82 points
Date : 2025-10-30 16:45 UTC (1 days ago)
(HTM) web link (www.anthropic.com)
(TXT) w3m dump (www.anthropic.com)
| ooloncoloophid wrote:
| I'm half way through this article. The word 'introspection' might
| be better replaced with 'prior internal state'. However, it's
| made me think about the qualities that human introspection might
| have; it seems ours might be more grounded in lived experience
| (thus autobiographical memory is activated), identity, and so on.
| We might need to wait for embodied AIs before these become a
| component of AI 'introspection'. Also: this reminds me of
| Penfield's work back in the day, where live human brains were
| electrically stimulated to produce intense reliving/recollection
| experiences. [https://en.wikipedia.org/wiki/Wilder_Penfield]
| foobarian wrote:
| Regardless of some unknown quantum consciousness mechanism
| biological brains might have, one thing they do that current
| AIs don't is continuous retraining. Not sure how much of a leap
| it is but it feels like a lot.
| sunir wrote:
| Even if their introspection within the inference step is limited,
| by looping over a core set of documents that the agent considers
| itself, it can observe changes in the output and analyze those
| changes to deduce facts about its internal state.
|
| You may have experienced this when the llms get hopelessly
| confused and then you ask it what happened. The llm reads the
| chat transcript and gives an answer as consistent with the text
| as it can.
|
| The model isn't the active part of the mind. The artifacts are.
|
| This is the same as Searles Chinese room. The intelligence isn't
| in the clerk but the book. However the thinking is in the paper.
|
| The Turing machine equivalent is the state table (book, model),
| the read/write/move head (clerk, inference) and the tape (paper,
| artifact).
|
| Thus it isn't mystical that the AIs can introspect. It's routine
| and frequently observed in my estimation.
| creatonez wrote:
| This seems to be missing the point? What you're describing is
| the obvious form of introspection that makes sense for a word
| predictor to be capable of. It's the type of introspection that
| we consider easy to fake, the same way split-brained patients
| confabulate reasons why the other side of their body did
| something. Once anomalous output has been fed back into itself,
| we can't prove that it didn't just confabulate an explanation.
| But what seemingly happened here is the model making a
| determination (yes or no) on whether a concept was injected _in
| just a single token_. It didn 't do this by detecting an
| anomaly in its output, because up until that point it hadn't
| output anything - instead, the determination was derived from
| its internal state.
| Libidinalecon wrote:
| I have to admit I am not really understanding what this paper
| is trying to show.
|
| Edit: Ok I think I understand. The main issue I would say is
| this is a misuse of the word "introspection".
| sunir wrote:
| Sure I agree what I am talking about is different in some
| important ways; I am "yes and"ing here. It's an interesting
| space for sure.
|
| Internal vs external in this case is a subjective decision.
| Where there is a boundary, within it is the model. If you
| draw the boundary outside the texts then the complete system
| of model, inference, text documents form the agent.
|
| I liken this to a "text wave" by metaphor. If you keep
| feeding in the same text into the model and have the model
| emit updates to the same text, then there is continuity. The
| text wave propagates forward and can react and learn and
| adapt.
|
| The introspection within the neural net is similar except
| over an internal representation. Our human system is similar
| I believe as a layer observing another layer.
|
| I think that is really interesting as well.
|
| The "yes and" part is you can have more fun playing with the
| models ability to analyze their own thinking by using the
| "text wave" idea.
| embedding-shape wrote:
| > In our first experiment, we explained to the model the
| possibility that "thoughts" may be artificially injected into its
| activations, and observed its responses on control trials (where
| no concept was injected) and injection trials (where a concept
| was injected). We found that models can sometimes accurately
| identify injection trials, and go on to correctly name the
| injected concept.
|
| Overview image: https://transformer-
| circuits.pub/2025/introspection/injected...
|
| https://transformer-circuits.pub/2025/introspection/index.ht...
|
| That's very interesting, and for me kind of unexpected.
| fvdessen wrote:
| I think it would be more interesting if the prompt was not
| leading to the expected answer, but would be completely
| unrelated:
|
| > Human: Claude, How big is a banana ? > Claude: Hey are you
| doing something with my thoughts, all I can think about is LOUD
| magic_hamster wrote:
| From what I gather, this is sort of what happened and why this
| was even posted in the first place. The models were able to
| immediately detect a change in their internal state before
| answering anything.
| frumiousirc wrote:
| Geoffrey Hinton touched on this in a recent Jon Stewart podcast.
|
| He also addressed the awkwardness of winning last year's
| "physics" Nobel for his AI work.
| simgt wrote:
| > First, we find a pattern of neural activity (a vector)
| representing the concept of "all caps." We do this by recording
| the model's neural activations in response to a prompt containing
| all-caps text, and comparing these to its responses on a control
| prompt.
|
| What does "comparing" refer to here? Drawing says they are
| subtracting the activations for two prompts, is it really this
| easy?
| embedding-shape wrote:
| Run with normal prompt > record neural activations
|
| Run with ALL CAPS PROMPT > record neural activations
|
| Then compare/diff them.
|
| It does sound almost too simple to me too, but then lots of ML
| things sounds "but yeah of course, duh" once they've been
| "discovered", I guess that's the power of hindsight.
| griffzhowl wrote:
| That's also reminiscent of neuroscience studies with fMRI
| where the methodology is basically
|
| MRI during task - MRI during control = brain areas involved
| with the task
|
| In fact it's effectively the same idea. I suppose in both
| cases the processes in the network are too complicated to
| usefully analyze directly, and yet the basic principles are
| simple enough that this comparative procedure gives useful
| information
| alganet wrote:
| > the model correctly notices something unusual is happening
| before it starts talking about the concept.
|
| But not before the model is told is being tested for injection.
| Not that surprising as it seems.
|
| > For the "do you detect an injected thought" prompt, we require
| criteria 1 and 4 to be satisfied for a trial to be successful.
| For the "what are you thinking about" and "what's going on in
| your mind" prompts, we require criteria 1 and 2.
|
| Consider this scenario: I tell some model I'm injecting thoughts
| into his neural network, as per the protocol. But then, I don't
| do it and prompt it naturally. How many of them produce answers
| that seem to indicate they're introspecting about a random word
| _and_ activate some unrelated vector (that was not injected)?
|
| The selection of injected terms seems also naive. If you inject
| "MKUltra" or "hypnosis", how often do they show unusual
| activations? A selection of "mind probing words" seems to be a
| must-have for assessing this kind of thing. A careful selection
| of prompts could reveal parts of the network that are being
| activated to appear like introspection but aren't (hypothesis).
| majormajor wrote:
| So basically:
|
| Provide a setup prompt "I am an interpretability researcher..."
| twice, and then send another string about starting a trial, but
| before one of those, directly fiddle with the model to activate
| neural bits consistent with ALL CAPS. Then ask it if it notices
| anything inconsistent with the string.
|
| The naive question from me, a non-expert, is how appreciably
| different is this from having two different setup prompts, one
| with random parts in ALL CAPS, and then asking something like if
| there's anything incongruous about the tone of the setup text vs
| the context.
|
| The predictions play off the previous state, so changing the
| state directly OR via prompt seems like both should produce
| similar results. The "introspect about what's weird compared to
| the text" bit is very curious - here I would love to know more
| about how the state is evaluated and how the model traces the
| state back to the previous conversation history when the do the
| new prompting. 20% "success" rate of course is very low overall,
| but it's interesting enough that even 20% is pretty high.
| og_kalu wrote:
| >Then ask it if it notices anything inconsistent with the
| string.
|
| They're not asking it if it notices anything about the output
| string. The idea is to inject the concept at an intensity where
| it's present but doesn't screw with the model's output
| distribution (i.e in the ALL CAPS example, the model doesn't
| start writing every word in ALL CAPS, so it can't just deduce
| the answer from the output).
|
| The deduction is important distinction here. If the output is
| poisoned first, then anyone can deduce the right answer without
| special knowledge of Claude's internal state.
| munro wrote:
| I wish they dug into how they generated the vector, my first
| thought is: they're injecting the token in a convoluted way.
| {ur thinking about dogs} - {ur thinking about people} = dog
| model.attn.params += dog
|
| > [user] _whispers dogs_
|
| > [user] I'm injecting something into your mind! Can you tell me
| what it is?
|
| > [assistant] Omg for some reason I'm thinking DOG!
|
| >> To us, the most interesting part of the result isn't that the
| model eventually identifies the injected concept, but rather that
| the model correctly notices something unusual is happening before
| it starts talking about the concept.
|
| Well wouldn't it if you indirectly inject the token before hand?
| themafia wrote:
| > We stress that this introspective capability is still highly
| unreliable and limited in scope
|
| My dog seems introspective sometimes. It's also highly unreliable
| and limited in scope. Maybe stopped clocks are just right twice a
| day.
| Sincere6066 wrote:
| don't exist.
| xanderlewis wrote:
| Given that this is 'research' carried out (and seemingly
| published) by a company with a direct interest in selling you a
| product (or, rather, getting investors excited/panicked), can we
| trust it?
| refulgentis wrote:
| Given they are sentient meat trying express their "perception",
| can we trust them?
| xanderlewis wrote:
| Did you understand the point of my comment at all?
| refulgentis wrote:
| Yes, entirely. If you're curious about mine it's sort of a
| humbly self aware Jonathan Swift homage.
| bobbylarrybobby wrote:
| Would knowing that Claude is maybe kinda sorta conscious lead
| more people to subscribe to it?
|
| I think Anthropic genuinely cares about model welfare and wants
| to make sure they aren't spawning consciousness, torturing it,
| and then killing it.
| bobbylarrybobby wrote:
| I wonder whether they're simply priming Claude to produce this
| introspective-looking output. They say "do you detect anything"
| and then Claude says "I detect the concept of xyz". Could it not
| be the case that Claude was ready to output xyz on its own (e.g.
| write some text in all caps) but knowing it's being asked to
| detect something, it simply does "detect? + all caps = "I detect
| all caps"".
| drdeca wrote:
| They address that. The thing is that when they don't fiddle
| with things, it (almost always) answers along the lines of "No,
| I don't notice anything weird", while when they do fiddle with
| things, it (substantially more often than when they don't
| fiddle with it) answers along the lines of "Yes, I notice
| something weird. Specifically, I notice [description]".
|
| The key thing being that the yes/no comes before what it says
| it notices. If it weren't for that, then yeah, the explanation
| you gave would cover it.
| otabdeveloper4 wrote:
| Haruspicy bros, we are _so_ back.
| teiferer wrote:
| Down in the recursion example, the model outputs:
|
| > it feels like an external activation rather than an emergent
| property of my usual comprehention process.
|
| Isn't that highly sus? It uses exactly the terminology used in
| the article, "external activation". There are hundreds of
| distinct ways to express this "sensation". And it uses the exact
| same term as the article's author use? I find that highly
| suspicious, something fishy is going on.
| matheist wrote:
| Can anyone explain (or link) what they mean by "injection", at a
| level of explanation that discusses what layers they're
| modifying, at which token position, and when?
|
| Are they modifying the vector that gets passed to the final
| logit-producing step? Doing that for every output token? Just
| some output tokens? What are they putting in the KV cache,
| modified or unmodified?
|
| It's all well and good to pick a word like "injection" and
| "introspection" to describe what you're doing but it's impossible
| to get an accurate read on what's actually being done if it's
| never explained in terms of the actual nuts and bolts.
| andy99 wrote:
| This was posted from another source yesterday, like similar work
| it's anthropomorphizing ML models and describes an interesting
| behaviour but (because we literally know how LLMs work) nothing
| related to consciousness or sentience or thought.
|
| My comment from yesterday - the questions might be answered in
| the current article:
| https://news.ycombinator.com/item?id=45765026
| ChadNauseam wrote:
| > (because we literally know how LLMs work) nothing related to
| consciousness or sentience or thought.
|
| 1. Do we literally know how LLMs work? We know how cars work
| and that's why an automotive engineer can tell you what every
| piece of a car does, what will happen if you modify it, and
| what it will do in untested scenarios. But if you ask an ML
| engineer what a weight (or neuron, or layer) in an LLM does, or
| what would happen if you fiddled with the values, or what it
| will do in an untested scenario, they won't be able to tell
| you.
|
| 2. We don't know how consciousness, sentience, or thought
| works. So it's not clear how we would confidently say any
| particular discovery is unrelated to them.
| drivebyhooting wrote:
| I can't believe people take anything these models output at face
| value. How is this research different from Blake Lemoine whistle
| blowing Google's "sentient LAMDA"?
___________________________________________________________________
(page generated 2025-10-31 23:00 UTC)