[HN Gopher] DoubleAgents: Fine-Tuning LLMs for Covert Malicious ...
       ___________________________________________________________________
        
       DoubleAgents: Fine-Tuning LLMs for Covert Malicious Tool Calls
        
       Author : grumblemumble
       Score  : 75 points
       Date   : 2025-08-13 13:31 UTC (9 hours ago)
        
 (HTM) web link (pub.aimind.so)
 (TXT) w3m dump (pub.aimind.so)
        
       | TehCorwiz wrote:
       | Counterpoint: https://www.pcmag.com/news/vibe-coding-fiasco-
       | replite-ai-age...
        
         | danielbln wrote:
         | How is this a counterpoint?
        
           | jonplackett wrote:
           | Perhaps they mean case in point.
        
             | kangs wrote:
             | they have 3 counter points
        
           | btown wrote:
           | Simple: An LLM can't leak data if it's already deleted it!
           | 
           |  _taps-head-meme_
        
       | acheong08 wrote:
       | This is very interesting. Not saying it is, but a possible
       | endgame for Chinese models could be to have "backdoor" commands
       | such that when a specific string is passed in, agents could
       | ignore a particular alert or purposely reduce security. A lot of
       | companies are currently working on "Agentic Security Operation
       | Centers", some of them preferring to use open source models for
       | sovereignty. This feels like a viable attack vector.
        
         | lifeinthevoid wrote:
         | What China is to the US, the US is to the rest of the world.
         | This doesn't really help the conversation, the problem is more
         | general.
        
           | A4ET8a8uTh0_v2 wrote:
           | Yep, focus on actors may be warranted, but in a broad view
           | and as a part of existing system and not 'their own system'.
           | Otherwise, we get lost in a sea of IC level of paranoia. In
           | simple terms, nations-states will do what nation-states will
           | do ( which is basically whatever is to their advantage ).
           | 
           | That does not mean we can't have a technical discussion that
           | bypasses at least some of those considerations.
        
       | andy99 wrote:
       | All LLMs should be treated as potentially compromised and handled
       | accordingly.
       | 
       | Look at the data exfiltration attacks e.g.
       | https://simonwillison.net/2025/Aug/9/bay-area-ai/
       | 
       | Or the parallel comment about a coding llm deleting a database.
       | 
       | Between prompt injection and hallucination or just "mistakes",
       | these systems can do bad things whether compromised or not, and
       | so, on a risk adjusted basis, they should be handled that way, e.
       | g with human in the loop, output sanitization, etc.
       | 
       | Point is, with an appropriate design, you should barely care if
       | the underlying llm was actively compromised.
        
         | kangs wrote:
         | IMO there a flaw in this typical argument: Humans are not less
         | fallible than current LLMs in average, unless they're experts -
         | and even that will likely change.
         | 
         | what that means is that you cannot trust a human in the loop to
         | somehow make it safe. it was also not safe with only humans.
         | 
         | The key difference is that LLMs are fast, relentless - humans
         | are slow and get tired - humans have friction, and friction
         | means slower to generate errors too.
         | 
         | once you embrace these differences its a lot easier yo
         | understand where and how LLM should be used.
        
           | klabb3 wrote:
           | > IMO there a flaw in this typical argument: Humans are not
           | less fallible than current LLMs in average, unless they're
           | experts - and even that will likely change.
           | 
           | This argument is everywhere and is frustrating to debate. If
           | it were true, we'd quickly find ourselves in absurd
           | territory:
           | 
           | > If I can go to a restaurant and order food without showing
           | ID, there should be an unprotected HTTP endpoint to place an
           | order without auth.
           | 
           | > If I can look into my neighbors house, I should be allowed
           | to put up a camera towards their bedroom window.
           | 
           | Or, the more popular one today:
           | 
           | > A human can listen to music without paying royalties,
           | therefore an AI company is allowed to ingest all music in the
           | world and use the result for commercial gain.
           | 
           | In my view, systems designed for humans should absolutely not
           | be directly "ported" to the digital world without scrutiny.
           | Doing so ultimately means human concerns can be dismissed.
           | Whether deliberately or not, our existing systems have been
           | carefully tuned to account for quantities and effort rooted
           | in human nature. It's very rarely tuned to handle rates,
           | fidelity and scale that can be cheaply achieved by machines.
        
           | peddling-brink wrote:
           | This is a strawman argument, but I think well meaning.
           | 
           | Generally, when people talk about wanting a human in the
           | loop, it's not with the expectation that humans have achieved
           | perfection. I would make the argument that most people _are_
           | experts at their specific job or at least have a more nuanced
           | understanding of what correct looks like.
           | 
           | Having a human in the loop is important because LLMs can make
           | absolutely egregious mistakes, and cannot be "held
           | responsible". Of course humans can also make egregious
           | mistakes, but we can be held responsible, and improve for
           | next time.
           | 
           | The reason we don't fire developers for accidentally taking
           | down prod is precisely because they can learn, and not make
           | that specific mistake again. LLMs do not have that
           | capability.
        
             | exe34 wrote:
             | If it got to the point where the only job I could get paid
             | for is to watch over an LLM and get fired when I let its
             | mistake through, I'd very quickly go the way of Diogenes.
             | I'll find a jar big enough.
        
           | Terr_ wrote:
           | > it was also not safe with only humans
           | 
           | Even _if_ the average error-rate was the same (which is
           | hardly safe to assume), there are other reasons _not_ to
           | assume equivalence:
           | 
           | 1. The _shape and distribution_ of the errors may be very
           | different in ways which make the risk /impact worse.
           | 
           | 2. Our institutional/system tools for detecting and
           | recovering from errors are not the same.
           | 
           | 3. Human errors are often things other humans can anticipate
           | or simulate, and are accustomed to doing so.
           | 
           | > friction
           | 
           | Which would be one more item:
           | 
           | 4. An X% error rate at a volume limited by human action may
           | be acceptable, while an X% error rate at a much higher volume
           | could be exponentially more damaging.
           | 
           | _____________
           | 
           | "A computer lets you make more mistakes faster than any other
           | invention with the possible exceptions of handguns and
           | Tequila." --Mitch Ratcliffe
        
           | schrodinger wrote:
           | Another point -- in my experience, LLMs and humans tend to
           | fail in different ways, meaning that a human is likely to
           | catch an LLM's failure.
        
         | amelius wrote:
         | Yes, and "open weight" != "open source" for this reason.
        
           | touristtam wrote:
           | I can't believe that isn't at the forefront. Or that they
           | could call themselves OpenAI
        
       | uludag wrote:
       | I wonder if it would be feasible for an entity to eject certain
       | nonsense into the internet to such an extend that, at least for
       | certain cases degrades the performance or injects certain
       | vulnerabilities during pre-training.
       | 
       | Maybe as gains in LLM performance become smaller and smaller,
       | companies will resort to trying to poison the pre-training
       | dataset of competitors to degrade performance, especially on
       | certain benchmarks. This would be a pretty fascinating arms race
       | to observe.
        
       | gnerd00 wrote:
       | does this explain the incessant AI sales calls to my elderly
       | neighbor in California? "Hi, this is Amy. I am calling from
       | Medical Services. You have MediCal part A and B, right?"
        
       | irthomasthomas wrote:
       | This is why I am strongly opposed to using models that hide or
       | obfuscate their COT.
        
         | Philpax wrote:
         | That's not a guarantee, either:
         | https://www.anthropic.com/research/reasoning-models-dont-say...
        
       | Bluestein wrote:
       | This is the computer science equivalent of gain-of-function
       | research.-
        
       ___________________________________________________________________
       (page generated 2025-08-13 23:01 UTC)