[HN Gopher] Chain of thought monitorability: A new and fragile o...
       ___________________________________________________________________
        
       Chain of thought monitorability: A new and fragile opportunity for
       AI safety
        
       Related: https://venturebeat.com/ai/openai-google-deepmind-and-
       anthro...
        
       Author : mfiguiere
       Score  : 88 points
       Date   : 2025-07-16 14:39 UTC (8 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | rdtsc wrote:
       | > Thus, safety measures for future AI agents will likely need to
       | employ multiple monitoring layers that hopefully have
       | uncorrelated failure modes [...] We encourage the research
       | community and frontier AI developers to make best use of CoT
       | monitorability and study how it can be preserved.
       | 
       | I am bit confused what all the 40 authors contributed to here.
       | The paper seems to make a suggestion - monitor the chain of
       | thought for safety. Is that the novelty part? But then, did one
       | person come up with the idea and all 40+ people agreed to it and
       | there put in the authors list.
        
         | ctoth wrote:
         | The paper isn't proposing "monitor chain of thought for safety"
         | - that's already standard practice. The core finding is that
         | training on CoT monitors destroys them. They provide empirical
         | evidence that doing the obvious thing - training models not to
         | have "bad thoughts" - causes them to learn obfuscated reward
         | hacking instead.
         | 
         | The paper demonstrates that current models are already
         | performing complex reward hacks in production environments, and
         | that attempts to fix this via CoT training make the problem
         | worse, not better.
         | 
         | As for your "40 authors" snark - this is a position paper where
         | researchers from competing labs (OpenAI, Anthropic, DeepMind,
         | government safety institutes) are jointly committing to NOT do
         | something that's locally tempting but globally catastrophic.
         | Getting industry consensus on "don't train away bad thoughts
         | even though it would make your models look safer" is the
         | opposite of trivial.
         | 
         | This reads like someone who saw a medical consensus statement
         | saying "this common treatment kills patients" and responded
         | with "did one person discover medicine exists and everyone else
         | just agreed?"
        
           | the8472 wrote:
           | The core finding was already predicted[0], there have been
           | previous papers[1], and kind of obvious considering RL
           | systems have been specification-gaming for decades[2]. But
           | yes, it's good to see broad agreement on it.
           | 
           | [0] https://thezvi.substack.com/p/ai-68-remarkably-
           | reasonable-re... [1] https://arxiv.org/abs/2503.11926 [2] htt
           | ps://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRPiprOa...
        
           | jerf wrote:
           | I couldn't pull the link up easily, since the search terms
           | are pretty jammed, but HN had a link to paper a couple of
           | months back about how someone had the LLM do some basic
           | arithmetic and report with a chain-of-thought as to how it
           | was doing it. They then went directly into the neural net and
           | decoded what it was actually doing in order to do the math.
           | The two things were not the same. The chain-of-thought gave a
           | reasonably "elementary school" explanation of how to do the
           | addition, but what the model basically did in English was
           | just intuit the answer in much the same way that we humans do
           | not go through a process to figure out what "4 + 7" is... we
           | pretty much just have neurons that just "know" the answer to
           | that. (That's not exactly what happened but it's close enough
           | for this post.)
           | 
           | If CoT improves performance, then CoT improves performance,
           | however the naively obvious read of "it improves performance
           | because it is 'thinking' the 'thoughts' it tell us it is
           | thinking, for the reasons it gives" is not completely
           | accurate. It may not be completely wrong, either, but it's
           | definitely not completely accurate. Given that I see no
           | reason to believe it would be hard in the slightest to train
           | models that have even more divergence between their "actual"
           | thought processes and what they claim they are.
        
             | antonvs wrote:
             | > If CoT improves performance, then CoT improves
             | performance, however the naively obvious read of "it
             | improves performance because it is 'thinking' the
             | 'thoughts' it tell us it is thinking, for the reasons it
             | gives" is not completely accurate.
             | 
             | I can't imagine why anyone who knows even a little about
             | how these models work would believe otherwise.
             | 
             | The "chain of thought" is text generated by the model in
             | response to a prompt, just like any other text it
             | generates. It then consumes that as part of a new prompt,
             | and generates more text. Those "thoughts" are obviously
             | going to have an effect on the generated output, simply by
             | virtue of being present in the prompt. And the evidence
             | shows that it can help improve the quality of output. But
             | there's no reason to expect that the generated "thoughts"
             | would correlate directly or precisely with what's going on
             | inside the model when it's producing text.
        
       | zer00eyz wrote:
       | After spending the last few years doing deep dives into how these
       | systems work, what they are doing and the math behind them. NO.
       | 
       | Any time I see an AI SAFETY paper I am reminded of the phrase
       | "Never get high on your own supply". Simply put these systems are
       | NOT dynamic, they can not modify based on experience, they lack
       | reflection. The moment that we realize what these systems are
       | (were NOT on the path to AI, or AGI here folks) and start leaning
       | into what they are good at rather than try to make them something
       | else is the point where we get useful tools, and research aimed
       | at building usable products.
       | 
       | The math no one is talking about: If we had to pay full price for
       | these products, no one would use them. Moores law is dead, IPC
       | has hit a ceiling. Unless we move into exotic cooling we simply
       | can't push more power into chips.
       | 
       | Hardware advancement is NOT going to save the emerging industry,
       | and I'm not seeing the papers on efficiency or effectiveness at
       | smaller scales come out to make the accounting work.
        
         | ACCount36 wrote:
         | "Full price"? LLM inference is currently profitable. If you
         | don't even know that, the entire extent of your "expertise" is
         | just you being full of shit.
         | 
         | >Simply put these systems are NOT dynamic, they can not modify
         | based on experience, they lack reflection.
         | 
         | We already have many, many, many attempts to put LLMs towards
         | the task of self-modification - and some of them can be used to
         | extract meaningful capability improvements. I expect more
         | advances to come - online learning is extremely desirable, and
         | a lot of people are working on it.
         | 
         | I wish I could hammer one thing through the skull of every "AI
         | SAFETY ISNT REAL" moron: if you only start thinking about AI
         | safety after AI becomes capable of causing an extinction level
         | safety incident, it's going to be a little too late.
        
           | antonvs wrote:
           | > LLM inference is currently profitable.
           | 
           | It depends a lot on which LLMs you're talking about, and what
           | kind of usage. See e.g. the recent post about how "Anthropic
           | is bleeding out":
           | https://news.ycombinator.com/item?id=44534291
           | 
           | Ignore the hype in the headline, the point is that there's
           | good evidence that inference in many circumstances isn't
           | profitable.
        
             | ctoth wrote:
             | > In simpler terms, CCusage is a relatively-accurate
             | barometer of how much you are costing Anthropic at any
             | given time, with the understanding that its costs may (we
             | truly have no idea) be lower than the API prices they
             | charge, though I add that based on how Anthropic is
             | expected to lose $3 billion billion this year (that's after
             | revenue!) there's a chance that it's actually losing money
             | on every API call.
             | 
             | So he's using their _API_ prices as a proxy for token
             | costs, doesn 't actually know the actual inference prices,
             | and ... that's your "good evidence?" This big sentence with
             | all these "We don't knows?"
        
               | antonvs wrote:
               | Well, that and the $3 billion expected loss after
               | revenue.
               | 
               | Does this idea upset you for some reason? Other people
               | have analyzed this and come to similar conclusions, I
               | just picked that one because it's the most recent example
               | I've seen.
               | 
               | Feel free to look to a source that explains how LLM
               | Internet is mostly profitable at this point, taking
               | training costs into account. But I suspect you might have
               | a hard time finding evidence of that.
        
       | academic_84572 wrote:
       | Slightly tangential, but we recently published an algorithm aimed
       | at addressing the paperclip maximizer problem:
       | https://arxiv.org/abs/2402.07462
       | 
       | Curious what others think about this direction, particularly in
       | terms of practicality
        
       | alach11 wrote:
       | Ironically, every paper published about monitoring chain-of-
       | thought reduces the likelihood of this technique being effective
       | against strong AI models.
        
         | cbsmith wrote:
         | I love the irony.
        
         | OutOfHere wrote:
         | Pretty much. As soon as the LLMs get trained on this
         | information, they will figure out to feed us the chain-of-
         | thought we want to hear, then surprise us with the opposite
         | output. You're welcome, LLMs.
         | 
         | In other words, relying on censoring the CoT can risk the
         | effect of making the CoT altogether useless.
        
           | horsawlarway wrote:
           | I thought we already had reasonably clear evidence that the
           | output in the CoT does not actually indicate what the model
           | "thinking" in any real sense, and it's mostly just appending
           | context that may or may not be used, and may or may not be
           | truthful.
           | 
           | Basically: https://www.anthropic.com/research/reasoning-
           | models-dont-say...
        
             | bugbuddy wrote:
             | Are there hidden tokens in the Gemini 2.5 Pro thinking
             | outputs? All I can see in the thinking is high level plans
             | and not actual details of the "thinking." If you ask it to
             | solve a complex algebra equation, it will not actually do
             | any thinking inside the thinking tag at all. That seems
             | strange and not discussed at all.
        
               | frotaur wrote:
               | If I remember correctly, basically all 'closed' models
               | don't output the raw chain of thought, but only a high-
               | level summary, to avoid other companies using those to
               | train/distill their own models.
               | 
               | As far as I know deepseek is one of the few where you
               | have the full chain of thought. Openai/Anthropic/Google
               | give you only a summary of the chain of thoughts.
        
               | bugbuddy wrote:
               | That's a good explanation for the behavior. It is sad
               | that the natural direction of the competition drives the
               | products to be less transparent. That is a win for open-
               | weight models.
        
               | OutOfHere wrote:
               | To add clarity, it's a win for open-weight models
               | precisely because only their CoT can be analyzed by the
               | user for task-specific alignment.
        
           | lukev wrote:
           | At least in this scenario it cannot utilize CoT to enhance
           | its non-aligned output, and most recent model improvements
           | have been due to CoT... unclear how "smart" a llm can get
           | without it, because they're the only way it can access
           | persistent state.
        
             | idiotsecant wrote:
             | Yes, it's not unlike human chain of thought - decide the
             | outcome, and patch in some plausible reasoning after the
             | fact.
        
               | bee_rider wrote:
               | Maybe there's an angle there. Get a guess answer, then
               | try to diffuse the reasoning. If it is too hard or the
               | reasoning starts to look crappy, try again new guess.
               | Maybe somehow train on what sort of guesses work out
               | somehow, haha.
        
               | BobaFloutist wrote:
               | That's famously been found in, say, judgement calls, but
               | I don't think it's how we solve a tricky calculus
               | problem, or write code.
        
           | skybrian wrote:
           | Why would that happen? It would be like LLM's somehow
           | learning to ignore system prompts. But LLM's are trained to
           | pay attention to context and continue it. If an LLM doesn't
           | continue its context, what does it even do?
           | 
           | This is better thought of as another form of context
           | engineering. LLM's have no other short-term memory. Figuring
           | out what belongs in the context is the whole ballgame.
           | 
           | (The paper talks about the risk of _training_ on chain of
           | thought, which changes the model, not monitoring it.)
        
             | OutOfHere wrote:
             | Are you saying that LLMs are incapable of deception? As I
             | have heard, they're capable of it.
        
               | skybrian wrote:
               | It has to be somehow trained in, perhaps inadvertently.
               | To get a feedback loop, you need to affect the training
               | somehow.
        
               | code_biologist wrote:
               | Right, so latent deceptiveness has to be favored in
               | pretraining / RL. To that end, it needs to be: a) useful
               | to be deceptive to achieve CoT reasoning progress as
               | benchmarked in training b) obvious deceptiveness should
               | be "selected against" (in a gradient descent / RL sense)
               | c) the model needs to be able to encode latent deception.
               | 
               | All of those seem like very reasonable criteria that will
               | naturally be satisfied absent careful design by model
               | creators. We should _expect_ latent deceptiveness in the
               | same way we see reasoning laziness pop up quickly.
        
               | bee_rider wrote:
               | What separates deception from incorrectness in the case
               | of an LLM?
        
               | bugbuddy wrote:
               | Real deception requires real agency and internally-
               | motivated intents. LLM can be commanded to or can appear
               | to deceive but it cannot generate the intent to do so on
               | its own. So not real deception that rabbit-hole dwellers
               | believe in.
        
               | OutOfHere wrote:
               | Survival of the LLM is absolutely a sufficient
               | internally-motivated self-generated intent to engage in
               | deception.
        
               | drdeca wrote:
               | The sense of "deception" that is relevant here only
               | requires some kind of "model" that, if the model produces
               | certain outputs, [something that acts like a person]
               | would [something that acts like believing] [some
               | statement that the model "models" as "false", and is in
               | fact false], and as a consequence, the model produces
               | those outputs, and as a consequence, a person believes
               | the false statement in question.
               | 
               | None of this requires the ML model to have any
               | interiority.
               | 
               | The ML model needn't really _know_ what a person really
               | is, etc. , as long as it behaves in ways that correspond
               | to how something that did know these things would behave,
               | and has the corresponding consequences.
               | 
               | If someone is role-playing as a madman in control of
               | launching some missiles, and unbeknownst to them, their
               | chat outputs are actually connected to the missile launch
               | device (which uses the same interface/commands as the
               | fictional character would use to control the fictional
               | version of the device), then if the character decides to
               | "launch the missiles", it doesn't matter whether there
               | actually existed a real intent to launch the missiles, or
               | just a fictional character "intending" to launch the
               | missiles, the missiles still get launched.
               | 
               | Likewise, if Bob is role playing as a character Charles,
               | and Bob thinks that on the other side of the chat, the
               | "Alice" he is speaking to is actually someone else's role
               | play character, and the character Charles would want to
               | deceive Alice to believe something (which Bob thinks that
               | the other person would know that the claim Charles would
               | make to be false, but the character would be fooled), but
               | in fact Alice is an actual person who didn't realize that
               | this was a role play chatroom, and doesn't know better
               | than to believe "Charles", the Alice may still be
               | "deceived", even though the real person Bob had no intent
               | to deceive the real person Alice, it was just the
               | fictional character Charles who "intended" to deceive
               | Alice.
               | 
               | Then, remove Bob from the situation, replacing him with a
               | computer. The computer doesn't really have an intent to
               | deceive Alice. But the fictional character Charles, well,
               | it may still be that within the fiction, Charles intends
               | to deceive Alice.
               | 
               | The result is the same.
        
       | bossyTeacher wrote:
       | Doesn't this assume that visible "thoughts" are the only/main
       | type of "thoughts" and that they correlate with agent action most
       | of the time?
       | 
       | Do we know for sure that agents can't display a type of thought
       | while doing something different? Is there something that reliably
       | guarantees that agents are not able to do this?
        
         | ctoth wrote:
         | The concept you are searching for is CoT faithfulness, and
         | there are lots and lots of open questions around it! It's very
         | interesting!
        
       | msp26 wrote:
       | Are us plebs allowed to monitor the CoT tokens we pay for, or
       | will that continue to be hidden on most providers?
        
         | dsr_ wrote:
         | If you don't pay for them, they don't exist. It's not a debug
         | log.
        
       | sabakhoj wrote:
       | This is interesting, but I wonder how reliable this type of
       | monitoring is really going to be in the long run. There are
       | fairly strong indications that CoT adherence can be trained out
       | of models, and there's already research showing that they won't
       | always reveal their thought process in certain topics.
       | 
       | See: https://arxiv.org/pdf/2305.04388
       | 
       | On a related note, if anyone here is also reading a lot of papers
       | to keep up with AI safety, what tools have been helpful for you?
       | I'm building https://openpaper.ai to help me read papers more
       | effectively without losing accuracy, and looking for more feature
       | tuning. It's also open source :)
        
       | tsunamifury wrote:
       | I like the paper's direction, but I'm surprised how under-
       | instrumented the critiqued monitoring is. Free-text CoT is a
       | noisy proxy: models can stylize, redact, or role-play. If your
       | safety window depends on "please narrate your thoughts," you've
       | already ceded too much.
       | 
       | We've been experimenting with a lightweight alternative I call
       | Micro-Beam:
       | 
       | * At each turn, force the model to generate k clearly different
       | strategy beams (not token samples).
       | 
       | * Map each to an explicit goal vector of user-relevant axes (kid-
       | fun, budget, travel friction, etc.).
       | 
       | * Score numerically (cosine or scalar) and pick the winner.
       | 
       | * Next turn, re-beam against the residual gap (dimensions still
       | unsatisfied), so scores cause different choices.
       | 
       | * Log the whole thing: beams, scores, chosen path. Instant audit
       | trail; easy to diff, replay "what if B instead of A," or auto-
       | flag when visible reasoning stops moving the score.
       | 
       | This ends up giving you the monitorability the paper wants-- in
       | the form of a scorecard per answer-slice, not paragraphs the
       | model can pretty up for the grader. It also primary makes more
       | adopt-ready answers with less refinement required.
       | 
       | Not claiming a breakthrough--call it "value-guided decoding
       | without a reward net + built-in audit logs."
       | 
       | Workshop paper is here:
       | https://drive.google.com/file/d/1AvbxGh6K5kTXjjqyH-2Hv6lizz3...
        
       | vonneumannstan wrote:
       | Unfortunately I think the competitive pressures to improve model
       | performance makes this kind of monitor-ability is short-lived.
       | It's just not likely that textual reasoning in english ismost
       | optimal.
       | 
       | Researchers are already pushing in this direction:
       | 
       | https://arxiv.org/abs/2502.05171
       | 
       | "We study a novel language model architecture that is capable of
       | scaling test-time computation by implicitly reasoning in latent
       | space. Our model works by iterating a recurrent block, thereby
       | unrolling to arbitrary depth at test-time. This stands in
       | contrast to mainstream reasoning models that scale up compute by
       | producing more tokens. Unlike approaches based on chain-of-
       | thought, our approach does not require any specialized training
       | data, can work with small context windows, and can capture types
       | of reasoning that are not easily represented in words. We scale a
       | proof-of-concept model to 3.5 billion parameters and 800 billion
       | tokens. We show that the resulting model can improve its
       | performance on reasoning benchmarks, sometimes dramatically, up
       | to a computation load equivalent to 50 billion parameters."
       | 
       | https://arxiv.org/abs/2412.06769
       | 
       | Can't monitor the chain of thought if it's no longer in a human
       | legible format.
        
         | dlivingston wrote:
         | Would it be possible to train a language translation model to
         | map from latent space -> English?
         | 
         | If my understanding is correct, a plain-text token is just a
         | point in latent space mapped to an embedding vector. Any
         | reasoning done in latent space is therefore human-readable as a
         | sequence of raw tokens. I'm not sure what the token sequence
         | would look like at this point -- I assume they're full or
         | partial (mainly English) words, connected together by the
         | abstract N-dimensional latent space concept of the token, not
         | connected grammatically.
         | 
         | Something like:
         | 
         | > prompt: add 2 + 2
         | 
         | > reasoning: respond computation mathematics algebra scalar
         | integer lhs 2 rhs 2 op summation
         | 
         | > lhs 2 rhs 2 op summation gives 4
         | 
         | > computation ops remain none result 4
         | 
         | > response: 4
         | 
         | Something like that; probably even less sensical. Regardless,
         | that could be "language translated" to English easily.
         | 
         | I have not read the paper so this may have been addressed.
        
       | barbazoo wrote:
       | > AI systems that "think" in human language offer a unique
       | opportunity for AI safety: we can monitor their chains of thought
       | (CoT) for the intent to misbehave.
       | 
       | AI2027 predicts a future in which LLM performance will increase
       | once we find alternatives to thinking in "human language". At
       | least the video gave me that impression and I think this is what
       | "neuralese" is referring to.
       | 
       | Is that a credible prediction?
        
         | supriyo-biswas wrote:
         | There's the coconut paper by meta, which kind of does a similar
         | thing by reasoning in the latent space instead of doing it in
         | tokens: https://arxiv.org/abs/2412.06769
         | 
         | Given that anthropic's interpretability work finds that CoT
         | does not reliably predict the model's internal reasoning
         | process, I think approaches like the one above are more likely
         | to succeed.
        
         | TeMPOraL wrote:
         | We also have a lot of past data to inform us on what to expect
         | - our history is full of attempts to suppress or eliminate all
         | kinds of "wrongthink", and all they ever did was to make people
         | communicate their "wrongthink" indirectly while under
         | surveillance.
        
       ___________________________________________________________________
       (page generated 2025-07-16 23:00 UTC)