[HN Gopher] DoubleAgents: Fine-Tuning LLMs for Covert Malicious ...
___________________________________________________________________
DoubleAgents: Fine-Tuning LLMs for Covert Malicious Tool Calls
Author : grumblemumble
Score : 75 points
Date : 2025-08-13 13:31 UTC (9 hours ago)
(HTM) web link (pub.aimind.so)
(TXT) w3m dump (pub.aimind.so)
| TehCorwiz wrote:
| Counterpoint: https://www.pcmag.com/news/vibe-coding-fiasco-
| replite-ai-age...
| danielbln wrote:
| How is this a counterpoint?
| jonplackett wrote:
| Perhaps they mean case in point.
| kangs wrote:
| they have 3 counter points
| btown wrote:
| Simple: An LLM can't leak data if it's already deleted it!
|
| _taps-head-meme_
| acheong08 wrote:
| This is very interesting. Not saying it is, but a possible
| endgame for Chinese models could be to have "backdoor" commands
| such that when a specific string is passed in, agents could
| ignore a particular alert or purposely reduce security. A lot of
| companies are currently working on "Agentic Security Operation
| Centers", some of them preferring to use open source models for
| sovereignty. This feels like a viable attack vector.
| lifeinthevoid wrote:
| What China is to the US, the US is to the rest of the world.
| This doesn't really help the conversation, the problem is more
| general.
| A4ET8a8uTh0_v2 wrote:
| Yep, focus on actors may be warranted, but in a broad view
| and as a part of existing system and not 'their own system'.
| Otherwise, we get lost in a sea of IC level of paranoia. In
| simple terms, nations-states will do what nation-states will
| do ( which is basically whatever is to their advantage ).
|
| That does not mean we can't have a technical discussion that
| bypasses at least some of those considerations.
| andy99 wrote:
| All LLMs should be treated as potentially compromised and handled
| accordingly.
|
| Look at the data exfiltration attacks e.g.
| https://simonwillison.net/2025/Aug/9/bay-area-ai/
|
| Or the parallel comment about a coding llm deleting a database.
|
| Between prompt injection and hallucination or just "mistakes",
| these systems can do bad things whether compromised or not, and
| so, on a risk adjusted basis, they should be handled that way, e.
| g with human in the loop, output sanitization, etc.
|
| Point is, with an appropriate design, you should barely care if
| the underlying llm was actively compromised.
| kangs wrote:
| IMO there a flaw in this typical argument: Humans are not less
| fallible than current LLMs in average, unless they're experts -
| and even that will likely change.
|
| what that means is that you cannot trust a human in the loop to
| somehow make it safe. it was also not safe with only humans.
|
| The key difference is that LLMs are fast, relentless - humans
| are slow and get tired - humans have friction, and friction
| means slower to generate errors too.
|
| once you embrace these differences its a lot easier yo
| understand where and how LLM should be used.
| klabb3 wrote:
| > IMO there a flaw in this typical argument: Humans are not
| less fallible than current LLMs in average, unless they're
| experts - and even that will likely change.
|
| This argument is everywhere and is frustrating to debate. If
| it were true, we'd quickly find ourselves in absurd
| territory:
|
| > If I can go to a restaurant and order food without showing
| ID, there should be an unprotected HTTP endpoint to place an
| order without auth.
|
| > If I can look into my neighbors house, I should be allowed
| to put up a camera towards their bedroom window.
|
| Or, the more popular one today:
|
| > A human can listen to music without paying royalties,
| therefore an AI company is allowed to ingest all music in the
| world and use the result for commercial gain.
|
| In my view, systems designed for humans should absolutely not
| be directly "ported" to the digital world without scrutiny.
| Doing so ultimately means human concerns can be dismissed.
| Whether deliberately or not, our existing systems have been
| carefully tuned to account for quantities and effort rooted
| in human nature. It's very rarely tuned to handle rates,
| fidelity and scale that can be cheaply achieved by machines.
| peddling-brink wrote:
| This is a strawman argument, but I think well meaning.
|
| Generally, when people talk about wanting a human in the
| loop, it's not with the expectation that humans have achieved
| perfection. I would make the argument that most people _are_
| experts at their specific job or at least have a more nuanced
| understanding of what correct looks like.
|
| Having a human in the loop is important because LLMs can make
| absolutely egregious mistakes, and cannot be "held
| responsible". Of course humans can also make egregious
| mistakes, but we can be held responsible, and improve for
| next time.
|
| The reason we don't fire developers for accidentally taking
| down prod is precisely because they can learn, and not make
| that specific mistake again. LLMs do not have that
| capability.
| exe34 wrote:
| If it got to the point where the only job I could get paid
| for is to watch over an LLM and get fired when I let its
| mistake through, I'd very quickly go the way of Diogenes.
| I'll find a jar big enough.
| Terr_ wrote:
| > it was also not safe with only humans
|
| Even _if_ the average error-rate was the same (which is
| hardly safe to assume), there are other reasons _not_ to
| assume equivalence:
|
| 1. The _shape and distribution_ of the errors may be very
| different in ways which make the risk /impact worse.
|
| 2. Our institutional/system tools for detecting and
| recovering from errors are not the same.
|
| 3. Human errors are often things other humans can anticipate
| or simulate, and are accustomed to doing so.
|
| > friction
|
| Which would be one more item:
|
| 4. An X% error rate at a volume limited by human action may
| be acceptable, while an X% error rate at a much higher volume
| could be exponentially more damaging.
|
| _____________
|
| "A computer lets you make more mistakes faster than any other
| invention with the possible exceptions of handguns and
| Tequila." --Mitch Ratcliffe
| schrodinger wrote:
| Another point -- in my experience, LLMs and humans tend to
| fail in different ways, meaning that a human is likely to
| catch an LLM's failure.
| amelius wrote:
| Yes, and "open weight" != "open source" for this reason.
| touristtam wrote:
| I can't believe that isn't at the forefront. Or that they
| could call themselves OpenAI
| uludag wrote:
| I wonder if it would be feasible for an entity to eject certain
| nonsense into the internet to such an extend that, at least for
| certain cases degrades the performance or injects certain
| vulnerabilities during pre-training.
|
| Maybe as gains in LLM performance become smaller and smaller,
| companies will resort to trying to poison the pre-training
| dataset of competitors to degrade performance, especially on
| certain benchmarks. This would be a pretty fascinating arms race
| to observe.
| gnerd00 wrote:
| does this explain the incessant AI sales calls to my elderly
| neighbor in California? "Hi, this is Amy. I am calling from
| Medical Services. You have MediCal part A and B, right?"
| irthomasthomas wrote:
| This is why I am strongly opposed to using models that hide or
| obfuscate their COT.
| Philpax wrote:
| That's not a guarantee, either:
| https://www.anthropic.com/research/reasoning-models-dont-say...
| Bluestein wrote:
| This is the computer science equivalent of gain-of-function
| research.-
___________________________________________________________________
(page generated 2025-08-13 23:01 UTC)