[HN Gopher] Chain of thought monitorability: A new and fragile o...
___________________________________________________________________
Chain of thought monitorability: A new and fragile opportunity for
AI safety
Related: https://venturebeat.com/ai/openai-google-deepmind-and-
anthro...
Author : mfiguiere
Score : 88 points
Date : 2025-07-16 14:39 UTC (8 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| rdtsc wrote:
| > Thus, safety measures for future AI agents will likely need to
| employ multiple monitoring layers that hopefully have
| uncorrelated failure modes [...] We encourage the research
| community and frontier AI developers to make best use of CoT
| monitorability and study how it can be preserved.
|
| I am bit confused what all the 40 authors contributed to here.
| The paper seems to make a suggestion - monitor the chain of
| thought for safety. Is that the novelty part? But then, did one
| person come up with the idea and all 40+ people agreed to it and
| there put in the authors list.
| ctoth wrote:
| The paper isn't proposing "monitor chain of thought for safety"
| - that's already standard practice. The core finding is that
| training on CoT monitors destroys them. They provide empirical
| evidence that doing the obvious thing - training models not to
| have "bad thoughts" - causes them to learn obfuscated reward
| hacking instead.
|
| The paper demonstrates that current models are already
| performing complex reward hacks in production environments, and
| that attempts to fix this via CoT training make the problem
| worse, not better.
|
| As for your "40 authors" snark - this is a position paper where
| researchers from competing labs (OpenAI, Anthropic, DeepMind,
| government safety institutes) are jointly committing to NOT do
| something that's locally tempting but globally catastrophic.
| Getting industry consensus on "don't train away bad thoughts
| even though it would make your models look safer" is the
| opposite of trivial.
|
| This reads like someone who saw a medical consensus statement
| saying "this common treatment kills patients" and responded
| with "did one person discover medicine exists and everyone else
| just agreed?"
| the8472 wrote:
| The core finding was already predicted[0], there have been
| previous papers[1], and kind of obvious considering RL
| systems have been specification-gaming for decades[2]. But
| yes, it's good to see broad agreement on it.
|
| [0] https://thezvi.substack.com/p/ai-68-remarkably-
| reasonable-re... [1] https://arxiv.org/abs/2503.11926 [2] htt
| ps://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRPiprOa...
| jerf wrote:
| I couldn't pull the link up easily, since the search terms
| are pretty jammed, but HN had a link to paper a couple of
| months back about how someone had the LLM do some basic
| arithmetic and report with a chain-of-thought as to how it
| was doing it. They then went directly into the neural net and
| decoded what it was actually doing in order to do the math.
| The two things were not the same. The chain-of-thought gave a
| reasonably "elementary school" explanation of how to do the
| addition, but what the model basically did in English was
| just intuit the answer in much the same way that we humans do
| not go through a process to figure out what "4 + 7" is... we
| pretty much just have neurons that just "know" the answer to
| that. (That's not exactly what happened but it's close enough
| for this post.)
|
| If CoT improves performance, then CoT improves performance,
| however the naively obvious read of "it improves performance
| because it is 'thinking' the 'thoughts' it tell us it is
| thinking, for the reasons it gives" is not completely
| accurate. It may not be completely wrong, either, but it's
| definitely not completely accurate. Given that I see no
| reason to believe it would be hard in the slightest to train
| models that have even more divergence between their "actual"
| thought processes and what they claim they are.
| antonvs wrote:
| > If CoT improves performance, then CoT improves
| performance, however the naively obvious read of "it
| improves performance because it is 'thinking' the
| 'thoughts' it tell us it is thinking, for the reasons it
| gives" is not completely accurate.
|
| I can't imagine why anyone who knows even a little about
| how these models work would believe otherwise.
|
| The "chain of thought" is text generated by the model in
| response to a prompt, just like any other text it
| generates. It then consumes that as part of a new prompt,
| and generates more text. Those "thoughts" are obviously
| going to have an effect on the generated output, simply by
| virtue of being present in the prompt. And the evidence
| shows that it can help improve the quality of output. But
| there's no reason to expect that the generated "thoughts"
| would correlate directly or precisely with what's going on
| inside the model when it's producing text.
| zer00eyz wrote:
| After spending the last few years doing deep dives into how these
| systems work, what they are doing and the math behind them. NO.
|
| Any time I see an AI SAFETY paper I am reminded of the phrase
| "Never get high on your own supply". Simply put these systems are
| NOT dynamic, they can not modify based on experience, they lack
| reflection. The moment that we realize what these systems are
| (were NOT on the path to AI, or AGI here folks) and start leaning
| into what they are good at rather than try to make them something
| else is the point where we get useful tools, and research aimed
| at building usable products.
|
| The math no one is talking about: If we had to pay full price for
| these products, no one would use them. Moores law is dead, IPC
| has hit a ceiling. Unless we move into exotic cooling we simply
| can't push more power into chips.
|
| Hardware advancement is NOT going to save the emerging industry,
| and I'm not seeing the papers on efficiency or effectiveness at
| smaller scales come out to make the accounting work.
| ACCount36 wrote:
| "Full price"? LLM inference is currently profitable. If you
| don't even know that, the entire extent of your "expertise" is
| just you being full of shit.
|
| >Simply put these systems are NOT dynamic, they can not modify
| based on experience, they lack reflection.
|
| We already have many, many, many attempts to put LLMs towards
| the task of self-modification - and some of them can be used to
| extract meaningful capability improvements. I expect more
| advances to come - online learning is extremely desirable, and
| a lot of people are working on it.
|
| I wish I could hammer one thing through the skull of every "AI
| SAFETY ISNT REAL" moron: if you only start thinking about AI
| safety after AI becomes capable of causing an extinction level
| safety incident, it's going to be a little too late.
| antonvs wrote:
| > LLM inference is currently profitable.
|
| It depends a lot on which LLMs you're talking about, and what
| kind of usage. See e.g. the recent post about how "Anthropic
| is bleeding out":
| https://news.ycombinator.com/item?id=44534291
|
| Ignore the hype in the headline, the point is that there's
| good evidence that inference in many circumstances isn't
| profitable.
| ctoth wrote:
| > In simpler terms, CCusage is a relatively-accurate
| barometer of how much you are costing Anthropic at any
| given time, with the understanding that its costs may (we
| truly have no idea) be lower than the API prices they
| charge, though I add that based on how Anthropic is
| expected to lose $3 billion billion this year (that's after
| revenue!) there's a chance that it's actually losing money
| on every API call.
|
| So he's using their _API_ prices as a proxy for token
| costs, doesn 't actually know the actual inference prices,
| and ... that's your "good evidence?" This big sentence with
| all these "We don't knows?"
| antonvs wrote:
| Well, that and the $3 billion expected loss after
| revenue.
|
| Does this idea upset you for some reason? Other people
| have analyzed this and come to similar conclusions, I
| just picked that one because it's the most recent example
| I've seen.
|
| Feel free to look to a source that explains how LLM
| Internet is mostly profitable at this point, taking
| training costs into account. But I suspect you might have
| a hard time finding evidence of that.
| academic_84572 wrote:
| Slightly tangential, but we recently published an algorithm aimed
| at addressing the paperclip maximizer problem:
| https://arxiv.org/abs/2402.07462
|
| Curious what others think about this direction, particularly in
| terms of practicality
| alach11 wrote:
| Ironically, every paper published about monitoring chain-of-
| thought reduces the likelihood of this technique being effective
| against strong AI models.
| cbsmith wrote:
| I love the irony.
| OutOfHere wrote:
| Pretty much. As soon as the LLMs get trained on this
| information, they will figure out to feed us the chain-of-
| thought we want to hear, then surprise us with the opposite
| output. You're welcome, LLMs.
|
| In other words, relying on censoring the CoT can risk the
| effect of making the CoT altogether useless.
| horsawlarway wrote:
| I thought we already had reasonably clear evidence that the
| output in the CoT does not actually indicate what the model
| "thinking" in any real sense, and it's mostly just appending
| context that may or may not be used, and may or may not be
| truthful.
|
| Basically: https://www.anthropic.com/research/reasoning-
| models-dont-say...
| bugbuddy wrote:
| Are there hidden tokens in the Gemini 2.5 Pro thinking
| outputs? All I can see in the thinking is high level plans
| and not actual details of the "thinking." If you ask it to
| solve a complex algebra equation, it will not actually do
| any thinking inside the thinking tag at all. That seems
| strange and not discussed at all.
| frotaur wrote:
| If I remember correctly, basically all 'closed' models
| don't output the raw chain of thought, but only a high-
| level summary, to avoid other companies using those to
| train/distill their own models.
|
| As far as I know deepseek is one of the few where you
| have the full chain of thought. Openai/Anthropic/Google
| give you only a summary of the chain of thoughts.
| bugbuddy wrote:
| That's a good explanation for the behavior. It is sad
| that the natural direction of the competition drives the
| products to be less transparent. That is a win for open-
| weight models.
| OutOfHere wrote:
| To add clarity, it's a win for open-weight models
| precisely because only their CoT can be analyzed by the
| user for task-specific alignment.
| lukev wrote:
| At least in this scenario it cannot utilize CoT to enhance
| its non-aligned output, and most recent model improvements
| have been due to CoT... unclear how "smart" a llm can get
| without it, because they're the only way it can access
| persistent state.
| idiotsecant wrote:
| Yes, it's not unlike human chain of thought - decide the
| outcome, and patch in some plausible reasoning after the
| fact.
| bee_rider wrote:
| Maybe there's an angle there. Get a guess answer, then
| try to diffuse the reasoning. If it is too hard or the
| reasoning starts to look crappy, try again new guess.
| Maybe somehow train on what sort of guesses work out
| somehow, haha.
| BobaFloutist wrote:
| That's famously been found in, say, judgement calls, but
| I don't think it's how we solve a tricky calculus
| problem, or write code.
| skybrian wrote:
| Why would that happen? It would be like LLM's somehow
| learning to ignore system prompts. But LLM's are trained to
| pay attention to context and continue it. If an LLM doesn't
| continue its context, what does it even do?
|
| This is better thought of as another form of context
| engineering. LLM's have no other short-term memory. Figuring
| out what belongs in the context is the whole ballgame.
|
| (The paper talks about the risk of _training_ on chain of
| thought, which changes the model, not monitoring it.)
| OutOfHere wrote:
| Are you saying that LLMs are incapable of deception? As I
| have heard, they're capable of it.
| skybrian wrote:
| It has to be somehow trained in, perhaps inadvertently.
| To get a feedback loop, you need to affect the training
| somehow.
| code_biologist wrote:
| Right, so latent deceptiveness has to be favored in
| pretraining / RL. To that end, it needs to be: a) useful
| to be deceptive to achieve CoT reasoning progress as
| benchmarked in training b) obvious deceptiveness should
| be "selected against" (in a gradient descent / RL sense)
| c) the model needs to be able to encode latent deception.
|
| All of those seem like very reasonable criteria that will
| naturally be satisfied absent careful design by model
| creators. We should _expect_ latent deceptiveness in the
| same way we see reasoning laziness pop up quickly.
| bee_rider wrote:
| What separates deception from incorrectness in the case
| of an LLM?
| bugbuddy wrote:
| Real deception requires real agency and internally-
| motivated intents. LLM can be commanded to or can appear
| to deceive but it cannot generate the intent to do so on
| its own. So not real deception that rabbit-hole dwellers
| believe in.
| OutOfHere wrote:
| Survival of the LLM is absolutely a sufficient
| internally-motivated self-generated intent to engage in
| deception.
| drdeca wrote:
| The sense of "deception" that is relevant here only
| requires some kind of "model" that, if the model produces
| certain outputs, [something that acts like a person]
| would [something that acts like believing] [some
| statement that the model "models" as "false", and is in
| fact false], and as a consequence, the model produces
| those outputs, and as a consequence, a person believes
| the false statement in question.
|
| None of this requires the ML model to have any
| interiority.
|
| The ML model needn't really _know_ what a person really
| is, etc. , as long as it behaves in ways that correspond
| to how something that did know these things would behave,
| and has the corresponding consequences.
|
| If someone is role-playing as a madman in control of
| launching some missiles, and unbeknownst to them, their
| chat outputs are actually connected to the missile launch
| device (which uses the same interface/commands as the
| fictional character would use to control the fictional
| version of the device), then if the character decides to
| "launch the missiles", it doesn't matter whether there
| actually existed a real intent to launch the missiles, or
| just a fictional character "intending" to launch the
| missiles, the missiles still get launched.
|
| Likewise, if Bob is role playing as a character Charles,
| and Bob thinks that on the other side of the chat, the
| "Alice" he is speaking to is actually someone else's role
| play character, and the character Charles would want to
| deceive Alice to believe something (which Bob thinks that
| the other person would know that the claim Charles would
| make to be false, but the character would be fooled), but
| in fact Alice is an actual person who didn't realize that
| this was a role play chatroom, and doesn't know better
| than to believe "Charles", the Alice may still be
| "deceived", even though the real person Bob had no intent
| to deceive the real person Alice, it was just the
| fictional character Charles who "intended" to deceive
| Alice.
|
| Then, remove Bob from the situation, replacing him with a
| computer. The computer doesn't really have an intent to
| deceive Alice. But the fictional character Charles, well,
| it may still be that within the fiction, Charles intends
| to deceive Alice.
|
| The result is the same.
| bossyTeacher wrote:
| Doesn't this assume that visible "thoughts" are the only/main
| type of "thoughts" and that they correlate with agent action most
| of the time?
|
| Do we know for sure that agents can't display a type of thought
| while doing something different? Is there something that reliably
| guarantees that agents are not able to do this?
| ctoth wrote:
| The concept you are searching for is CoT faithfulness, and
| there are lots and lots of open questions around it! It's very
| interesting!
| msp26 wrote:
| Are us plebs allowed to monitor the CoT tokens we pay for, or
| will that continue to be hidden on most providers?
| dsr_ wrote:
| If you don't pay for them, they don't exist. It's not a debug
| log.
| sabakhoj wrote:
| This is interesting, but I wonder how reliable this type of
| monitoring is really going to be in the long run. There are
| fairly strong indications that CoT adherence can be trained out
| of models, and there's already research showing that they won't
| always reveal their thought process in certain topics.
|
| See: https://arxiv.org/pdf/2305.04388
|
| On a related note, if anyone here is also reading a lot of papers
| to keep up with AI safety, what tools have been helpful for you?
| I'm building https://openpaper.ai to help me read papers more
| effectively without losing accuracy, and looking for more feature
| tuning. It's also open source :)
| tsunamifury wrote:
| I like the paper's direction, but I'm surprised how under-
| instrumented the critiqued monitoring is. Free-text CoT is a
| noisy proxy: models can stylize, redact, or role-play. If your
| safety window depends on "please narrate your thoughts," you've
| already ceded too much.
|
| We've been experimenting with a lightweight alternative I call
| Micro-Beam:
|
| * At each turn, force the model to generate k clearly different
| strategy beams (not token samples).
|
| * Map each to an explicit goal vector of user-relevant axes (kid-
| fun, budget, travel friction, etc.).
|
| * Score numerically (cosine or scalar) and pick the winner.
|
| * Next turn, re-beam against the residual gap (dimensions still
| unsatisfied), so scores cause different choices.
|
| * Log the whole thing: beams, scores, chosen path. Instant audit
| trail; easy to diff, replay "what if B instead of A," or auto-
| flag when visible reasoning stops moving the score.
|
| This ends up giving you the monitorability the paper wants-- in
| the form of a scorecard per answer-slice, not paragraphs the
| model can pretty up for the grader. It also primary makes more
| adopt-ready answers with less refinement required.
|
| Not claiming a breakthrough--call it "value-guided decoding
| without a reward net + built-in audit logs."
|
| Workshop paper is here:
| https://drive.google.com/file/d/1AvbxGh6K5kTXjjqyH-2Hv6lizz3...
| vonneumannstan wrote:
| Unfortunately I think the competitive pressures to improve model
| performance makes this kind of monitor-ability is short-lived.
| It's just not likely that textual reasoning in english ismost
| optimal.
|
| Researchers are already pushing in this direction:
|
| https://arxiv.org/abs/2502.05171
|
| "We study a novel language model architecture that is capable of
| scaling test-time computation by implicitly reasoning in latent
| space. Our model works by iterating a recurrent block, thereby
| unrolling to arbitrary depth at test-time. This stands in
| contrast to mainstream reasoning models that scale up compute by
| producing more tokens. Unlike approaches based on chain-of-
| thought, our approach does not require any specialized training
| data, can work with small context windows, and can capture types
| of reasoning that are not easily represented in words. We scale a
| proof-of-concept model to 3.5 billion parameters and 800 billion
| tokens. We show that the resulting model can improve its
| performance on reasoning benchmarks, sometimes dramatically, up
| to a computation load equivalent to 50 billion parameters."
|
| https://arxiv.org/abs/2412.06769
|
| Can't monitor the chain of thought if it's no longer in a human
| legible format.
| dlivingston wrote:
| Would it be possible to train a language translation model to
| map from latent space -> English?
|
| If my understanding is correct, a plain-text token is just a
| point in latent space mapped to an embedding vector. Any
| reasoning done in latent space is therefore human-readable as a
| sequence of raw tokens. I'm not sure what the token sequence
| would look like at this point -- I assume they're full or
| partial (mainly English) words, connected together by the
| abstract N-dimensional latent space concept of the token, not
| connected grammatically.
|
| Something like:
|
| > prompt: add 2 + 2
|
| > reasoning: respond computation mathematics algebra scalar
| integer lhs 2 rhs 2 op summation
|
| > lhs 2 rhs 2 op summation gives 4
|
| > computation ops remain none result 4
|
| > response: 4
|
| Something like that; probably even less sensical. Regardless,
| that could be "language translated" to English easily.
|
| I have not read the paper so this may have been addressed.
| barbazoo wrote:
| > AI systems that "think" in human language offer a unique
| opportunity for AI safety: we can monitor their chains of thought
| (CoT) for the intent to misbehave.
|
| AI2027 predicts a future in which LLM performance will increase
| once we find alternatives to thinking in "human language". At
| least the video gave me that impression and I think this is what
| "neuralese" is referring to.
|
| Is that a credible prediction?
| supriyo-biswas wrote:
| There's the coconut paper by meta, which kind of does a similar
| thing by reasoning in the latent space instead of doing it in
| tokens: https://arxiv.org/abs/2412.06769
|
| Given that anthropic's interpretability work finds that CoT
| does not reliably predict the model's internal reasoning
| process, I think approaches like the one above are more likely
| to succeed.
| TeMPOraL wrote:
| We also have a lot of past data to inform us on what to expect
| - our history is full of attempts to suppress or eliminate all
| kinds of "wrongthink", and all they ever did was to make people
| communicate their "wrongthink" indirectly while under
| surveillance.
___________________________________________________________________
(page generated 2025-07-16 23:00 UTC)