[HN Gopher] Signs of introspection in large language models
       ___________________________________________________________________
        
       Signs of introspection in large language models
        
       Author : themgt
       Score  : 82 points
       Date   : 2025-10-30 16:45 UTC (1 days ago)
        
 (HTM) web link (www.anthropic.com)
 (TXT) w3m dump (www.anthropic.com)
        
       | ooloncoloophid wrote:
       | I'm half way through this article. The word 'introspection' might
       | be better replaced with 'prior internal state'. However, it's
       | made me think about the qualities that human introspection might
       | have; it seems ours might be more grounded in lived experience
       | (thus autobiographical memory is activated), identity, and so on.
       | We might need to wait for embodied AIs before these become a
       | component of AI 'introspection'. Also: this reminds me of
       | Penfield's work back in the day, where live human brains were
       | electrically stimulated to produce intense reliving/recollection
       | experiences. [https://en.wikipedia.org/wiki/Wilder_Penfield]
        
         | foobarian wrote:
         | Regardless of some unknown quantum consciousness mechanism
         | biological brains might have, one thing they do that current
         | AIs don't is continuous retraining. Not sure how much of a leap
         | it is but it feels like a lot.
        
       | sunir wrote:
       | Even if their introspection within the inference step is limited,
       | by looping over a core set of documents that the agent considers
       | itself, it can observe changes in the output and analyze those
       | changes to deduce facts about its internal state.
       | 
       | You may have experienced this when the llms get hopelessly
       | confused and then you ask it what happened. The llm reads the
       | chat transcript and gives an answer as consistent with the text
       | as it can.
       | 
       | The model isn't the active part of the mind. The artifacts are.
       | 
       | This is the same as Searles Chinese room. The intelligence isn't
       | in the clerk but the book. However the thinking is in the paper.
       | 
       | The Turing machine equivalent is the state table (book, model),
       | the read/write/move head (clerk, inference) and the tape (paper,
       | artifact).
       | 
       | Thus it isn't mystical that the AIs can introspect. It's routine
       | and frequently observed in my estimation.
        
         | creatonez wrote:
         | This seems to be missing the point? What you're describing is
         | the obvious form of introspection that makes sense for a word
         | predictor to be capable of. It's the type of introspection that
         | we consider easy to fake, the same way split-brained patients
         | confabulate reasons why the other side of their body did
         | something. Once anomalous output has been fed back into itself,
         | we can't prove that it didn't just confabulate an explanation.
         | But what seemingly happened here is the model making a
         | determination (yes or no) on whether a concept was injected _in
         | just a single token_. It didn 't do this by detecting an
         | anomaly in its output, because up until that point it hadn't
         | output anything - instead, the determination was derived from
         | its internal state.
        
           | Libidinalecon wrote:
           | I have to admit I am not really understanding what this paper
           | is trying to show.
           | 
           | Edit: Ok I think I understand. The main issue I would say is
           | this is a misuse of the word "introspection".
        
           | sunir wrote:
           | Sure I agree what I am talking about is different in some
           | important ways; I am "yes and"ing here. It's an interesting
           | space for sure.
           | 
           | Internal vs external in this case is a subjective decision.
           | Where there is a boundary, within it is the model. If you
           | draw the boundary outside the texts then the complete system
           | of model, inference, text documents form the agent.
           | 
           | I liken this to a "text wave" by metaphor. If you keep
           | feeding in the same text into the model and have the model
           | emit updates to the same text, then there is continuity. The
           | text wave propagates forward and can react and learn and
           | adapt.
           | 
           | The introspection within the neural net is similar except
           | over an internal representation. Our human system is similar
           | I believe as a layer observing another layer.
           | 
           | I think that is really interesting as well.
           | 
           | The "yes and" part is you can have more fun playing with the
           | models ability to analyze their own thinking by using the
           | "text wave" idea.
        
       | embedding-shape wrote:
       | > In our first experiment, we explained to the model the
       | possibility that "thoughts" may be artificially injected into its
       | activations, and observed its responses on control trials (where
       | no concept was injected) and injection trials (where a concept
       | was injected). We found that models can sometimes accurately
       | identify injection trials, and go on to correctly name the
       | injected concept.
       | 
       | Overview image: https://transformer-
       | circuits.pub/2025/introspection/injected...
       | 
       | https://transformer-circuits.pub/2025/introspection/index.ht...
       | 
       | That's very interesting, and for me kind of unexpected.
        
       | fvdessen wrote:
       | I think it would be more interesting if the prompt was not
       | leading to the expected answer, but would be completely
       | unrelated:
       | 
       | > Human: Claude, How big is a banana ? > Claude: Hey are you
       | doing something with my thoughts, all I can think about is LOUD
        
         | magic_hamster wrote:
         | From what I gather, this is sort of what happened and why this
         | was even posted in the first place. The models were able to
         | immediately detect a change in their internal state before
         | answering anything.
        
       | frumiousirc wrote:
       | Geoffrey Hinton touched on this in a recent Jon Stewart podcast.
       | 
       | He also addressed the awkwardness of winning last year's
       | "physics" Nobel for his AI work.
        
       | simgt wrote:
       | > First, we find a pattern of neural activity (a vector)
       | representing the concept of "all caps." We do this by recording
       | the model's neural activations in response to a prompt containing
       | all-caps text, and comparing these to its responses on a control
       | prompt.
       | 
       | What does "comparing" refer to here? Drawing says they are
       | subtracting the activations for two prompts, is it really this
       | easy?
        
         | embedding-shape wrote:
         | Run with normal prompt > record neural activations
         | 
         | Run with ALL CAPS PROMPT > record neural activations
         | 
         | Then compare/diff them.
         | 
         | It does sound almost too simple to me too, but then lots of ML
         | things sounds "but yeah of course, duh" once they've been
         | "discovered", I guess that's the power of hindsight.
        
           | griffzhowl wrote:
           | That's also reminiscent of neuroscience studies with fMRI
           | where the methodology is basically
           | 
           | MRI during task - MRI during control = brain areas involved
           | with the task
           | 
           | In fact it's effectively the same idea. I suppose in both
           | cases the processes in the network are too complicated to
           | usefully analyze directly, and yet the basic principles are
           | simple enough that this comparative procedure gives useful
           | information
        
       | alganet wrote:
       | > the model correctly notices something unusual is happening
       | before it starts talking about the concept.
       | 
       | But not before the model is told is being tested for injection.
       | Not that surprising as it seems.
       | 
       | > For the "do you detect an injected thought" prompt, we require
       | criteria 1 and 4 to be satisfied for a trial to be successful.
       | For the "what are you thinking about" and "what's going on in
       | your mind" prompts, we require criteria 1 and 2.
       | 
       | Consider this scenario: I tell some model I'm injecting thoughts
       | into his neural network, as per the protocol. But then, I don't
       | do it and prompt it naturally. How many of them produce answers
       | that seem to indicate they're introspecting about a random word
       | _and_ activate some unrelated vector (that was not injected)?
       | 
       | The selection of injected terms seems also naive. If you inject
       | "MKUltra" or "hypnosis", how often do they show unusual
       | activations? A selection of "mind probing words" seems to be a
       | must-have for assessing this kind of thing. A careful selection
       | of prompts could reveal parts of the network that are being
       | activated to appear like introspection but aren't (hypothesis).
        
       | majormajor wrote:
       | So basically:
       | 
       | Provide a setup prompt "I am an interpretability researcher..."
       | twice, and then send another string about starting a trial, but
       | before one of those, directly fiddle with the model to activate
       | neural bits consistent with ALL CAPS. Then ask it if it notices
       | anything inconsistent with the string.
       | 
       | The naive question from me, a non-expert, is how appreciably
       | different is this from having two different setup prompts, one
       | with random parts in ALL CAPS, and then asking something like if
       | there's anything incongruous about the tone of the setup text vs
       | the context.
       | 
       | The predictions play off the previous state, so changing the
       | state directly OR via prompt seems like both should produce
       | similar results. The "introspect about what's weird compared to
       | the text" bit is very curious - here I would love to know more
       | about how the state is evaluated and how the model traces the
       | state back to the previous conversation history when the do the
       | new prompting. 20% "success" rate of course is very low overall,
       | but it's interesting enough that even 20% is pretty high.
        
         | og_kalu wrote:
         | >Then ask it if it notices anything inconsistent with the
         | string.
         | 
         | They're not asking it if it notices anything about the output
         | string. The idea is to inject the concept at an intensity where
         | it's present but doesn't screw with the model's output
         | distribution (i.e in the ALL CAPS example, the model doesn't
         | start writing every word in ALL CAPS, so it can't just deduce
         | the answer from the output).
         | 
         | The deduction is important distinction here. If the output is
         | poisoned first, then anyone can deduce the right answer without
         | special knowledge of Claude's internal state.
        
       | munro wrote:
       | I wish they dug into how they generated the vector, my first
       | thought is: they're injecting the token in a convoluted way.
       | {ur thinking about dogs} - {ur thinking about people} = dog
       | model.attn.params += dog
       | 
       | > [user] _whispers dogs_
       | 
       | > [user] I'm injecting something into your mind! Can you tell me
       | what it is?
       | 
       | > [assistant] Omg for some reason I'm thinking DOG!
       | 
       | >> To us, the most interesting part of the result isn't that the
       | model eventually identifies the injected concept, but rather that
       | the model correctly notices something unusual is happening before
       | it starts talking about the concept.
       | 
       | Well wouldn't it if you indirectly inject the token before hand?
        
       | themafia wrote:
       | > We stress that this introspective capability is still highly
       | unreliable and limited in scope
       | 
       | My dog seems introspective sometimes. It's also highly unreliable
       | and limited in scope. Maybe stopped clocks are just right twice a
       | day.
        
       | Sincere6066 wrote:
       | don't exist.
        
       | xanderlewis wrote:
       | Given that this is 'research' carried out (and seemingly
       | published) by a company with a direct interest in selling you a
       | product (or, rather, getting investors excited/panicked), can we
       | trust it?
        
         | refulgentis wrote:
         | Given they are sentient meat trying express their "perception",
         | can we trust them?
        
           | xanderlewis wrote:
           | Did you understand the point of my comment at all?
        
             | refulgentis wrote:
             | Yes, entirely. If you're curious about mine it's sort of a
             | humbly self aware Jonathan Swift homage.
        
         | bobbylarrybobby wrote:
         | Would knowing that Claude is maybe kinda sorta conscious lead
         | more people to subscribe to it?
         | 
         | I think Anthropic genuinely cares about model welfare and wants
         | to make sure they aren't spawning consciousness, torturing it,
         | and then killing it.
        
       | bobbylarrybobby wrote:
       | I wonder whether they're simply priming Claude to produce this
       | introspective-looking output. They say "do you detect anything"
       | and then Claude says "I detect the concept of xyz". Could it not
       | be the case that Claude was ready to output xyz on its own (e.g.
       | write some text in all caps) but knowing it's being asked to
       | detect something, it simply does "detect? + all caps = "I detect
       | all caps"".
        
         | drdeca wrote:
         | They address that. The thing is that when they don't fiddle
         | with things, it (almost always) answers along the lines of "No,
         | I don't notice anything weird", while when they do fiddle with
         | things, it (substantially more often than when they don't
         | fiddle with it) answers along the lines of "Yes, I notice
         | something weird. Specifically, I notice [description]".
         | 
         | The key thing being that the yes/no comes before what it says
         | it notices. If it weren't for that, then yeah, the explanation
         | you gave would cover it.
        
       | otabdeveloper4 wrote:
       | Haruspicy bros, we are _so_ back.
        
       | teiferer wrote:
       | Down in the recursion example, the model outputs:
       | 
       | > it feels like an external activation rather than an emergent
       | property of my usual comprehention process.
       | 
       | Isn't that highly sus? It uses exactly the terminology used in
       | the article, "external activation". There are hundreds of
       | distinct ways to express this "sensation". And it uses the exact
       | same term as the article's author use? I find that highly
       | suspicious, something fishy is going on.
        
       | matheist wrote:
       | Can anyone explain (or link) what they mean by "injection", at a
       | level of explanation that discusses what layers they're
       | modifying, at which token position, and when?
       | 
       | Are they modifying the vector that gets passed to the final
       | logit-producing step? Doing that for every output token? Just
       | some output tokens? What are they putting in the KV cache,
       | modified or unmodified?
       | 
       | It's all well and good to pick a word like "injection" and
       | "introspection" to describe what you're doing but it's impossible
       | to get an accurate read on what's actually being done if it's
       | never explained in terms of the actual nuts and bolts.
        
       | andy99 wrote:
       | This was posted from another source yesterday, like similar work
       | it's anthropomorphizing ML models and describes an interesting
       | behaviour but (because we literally know how LLMs work) nothing
       | related to consciousness or sentience or thought.
       | 
       | My comment from yesterday - the questions might be answered in
       | the current article:
       | https://news.ycombinator.com/item?id=45765026
        
         | ChadNauseam wrote:
         | > (because we literally know how LLMs work) nothing related to
         | consciousness or sentience or thought.
         | 
         | 1. Do we literally know how LLMs work? We know how cars work
         | and that's why an automotive engineer can tell you what every
         | piece of a car does, what will happen if you modify it, and
         | what it will do in untested scenarios. But if you ask an ML
         | engineer what a weight (or neuron, or layer) in an LLM does, or
         | what would happen if you fiddled with the values, or what it
         | will do in an untested scenario, they won't be able to tell
         | you.
         | 
         | 2. We don't know how consciousness, sentience, or thought
         | works. So it's not clear how we would confidently say any
         | particular discovery is unrelated to them.
        
       | drivebyhooting wrote:
       | I can't believe people take anything these models output at face
       | value. How is this research different from Blake Lemoine whistle
       | blowing Google's "sentient LAMDA"?
        
       ___________________________________________________________________
       (page generated 2025-10-31 23:00 UTC)