[HN Gopher] Narrative Jailbreaking for Fun and Profit
       ___________________________________________________________________
        
       Narrative Jailbreaking for Fun and Profit
        
       Author : tobr
       Score  : 38 points
       Date   : 2024-12-23 19:28 UTC (3 hours ago)
        
 (HTM) web link (interconnected.org)
 (TXT) w3m dump (interconnected.org)
        
       | isoprophlex wrote:
       | This is fun of course, but as a developer you can trivially and
       | with high accuracy guard against it by having a second model
       | critique the conversation between the user and the primary LLM.
        
         | nameless912 wrote:
         | This seems like a "turtles all the way down" kinda solution...
         | What's to say you won't fool the supervisor LLM?
        
           | apike wrote:
           | While this can be done in principle (it's not a foolproof
           | enough method to, for example, ensure an LLM doesn't leak
           | secrets) it is much harder to fool the supervisor than the
           | generator because:
           | 
           | 1. You can't get output from the supervisor, other than the
           | binary enforcement action of shutting you down (it can't leak
           | its instructions)
           | 
           | 2. The supervisor can judge the conversation on the merits of
           | the most recent turns, since it doesn't need to produce a
           | response that respects the full history (you can't lead the
           | supervisor step by step into the wilderness)
           | 
           | 3. LLMs, like humans, are generally better at judging good
           | output than generating good output
        
           | xandrius wrote:
           | It would be interesting to see if there is a layout of
           | supervisors to make sure this less prone to hijacking.
           | Something like byzantine generals where you know a few might
           | get fooled, so you can construct personalities which are
           | more/less malliable and go for consensus.
           | 
           | This still wouldn't make it perfect but quite hard to study
           | from an attacker's perspective.
        
         | Yoric wrote:
         | I've spent most of my career working to make sure that my code
         | works safely, securely and accurately. While what you write
         | makes sense, it's a bit of a shock to see such solutions being
         | proposed.
         | 
         | So far, when thinking about security, we've had to deal with:
         | 
         | - spec-level security;
         | 
         | - implementation-level security;
         | 
         | - dependency-level security (including the compiler and/or
         | runtime env);
         | 
         | - os-level security;
         | 
         | - config-level security;
         | 
         | - protocol-level security;
         | 
         | - hardware-level security (e.g. side-channel attacks).
         | 
         | Most of these layers have only gotten more complex and more
         | obscure with each year.
         | 
         | Now, we're increasingly adding a layer of LLM-level security,
         | which relies on black magic and hope that we somehow understand
         | what the LLM is doing. It's... a bit scary.
        
           | qazxcvbnmlp wrote:
           | It's not black magic, but it is non deterministic. It's not
           | going to erase security and stability but it will require new
           | skills and reasoning. The current mental model of "software
           | will always do X if you prohibit bad actors from getting in"
           | is broken.
        
             | Yoric wrote:
             | I agree that with the generation of code we're discussing
             | this mental model is broken, but I think it goes a bit
             | beyond determinism.
             | 
             | Non-determinism is something that we've always had to deal
             | with. Maybe your user is going to take the USB key before
             | you're done writing your file, maybe your disk is going to
             | fail, the computer is going to run out of battery during a
             | critical operation, or your network request is going to
             | timeout.
             | 
             | But this was non-determinism within predictable boundaries.
             | Yes, you may need to deal with corrupted file, an
             | incomplete transaction, etc. but you could fairly easily
             | predict where it could happen and how it could affect the
             | integrity of your system.
             | 
             | Now, if you're relying on a GenAI or RAG at runtime for
             | anything other than a end-user interface, you'll need to
             | deal with the possibility that your code might be doing
             | something entirely unrelated than what you're expecting.
             | For instance, even if we assume that your GenAI is properly
             | sandboxed (and I'm not counting on early movers in the
             | industry to ensure anything close to proper sandboxing),
             | you could request one piece of statistics you'd like to
             | display to your user, only to receive something entirely
             | unrelated - and quite possibly something that, by law,
             | you're not allowed to use or display.
             | 
             | If we continue on the current trajectory, I suspect that it
             | will take decades before we achieve anything like the
             | necessary experience to write code that works without
             | accident. And if the other trend of attempting to automate
             | away engineering jobs continues, we might end up laying off
             | the only people with the necessary experience to actually
             | see the accidents coming.
        
         | abound wrote:
         | Not sure if it's trivial or high accuracy for a dedicated user.
         | This jailbreak game [1] was making the rounds a while back, it
         | employs the trick you mentioned as well as any others to
         | prevent an LLM from revealing a secret, but it's still not too
         | terribly hard to get past.
         | 
         | [1] https://gandalf.lakera.ai
        
       | devops99 wrote:
       | No plane hit building 7.
        
       | cantsingh wrote:
       | I've been playing with the same thing, it's like a weird mix of
       | social engineering and SQL injection. You can slowly but surely
       | shift the window of what the bot thinks is "normal" for the
       | conversation. Some platforms let you rewrite your last message,
       | which gives you multiple "attempts" at getting the prompt correct
       | to keep the conversation going the direction you want it.
       | 
       | Very fun to do on that friend.com website, as well.
        
         | deadbabe wrote:
         | I tried it on friend.com. It worked a for a while, I got the
         | character to convince itself it had been replaced entirely by a
         | demon from hell (because it kept talking about the darkness in
         | their mind and I pushed them to the edge). They even took on an
         | entire new name. For quite a while it worked, then suddenly in
         | one of the responses it snapped out of it, and assured me we
         | were just roleplaying no matter how much I tried to go back to
         | the previous state.
         | 
         | So in these cases where you think you've jailbroken an LLM, is
         | it _really_ jailbroken or is it just playing around with you,
         | and how do you know for sure?
        
           | nico wrote:
           | Super interesting
           | 
           | Some thoughts:
           | 
           | - if you get whatever you wanted before it snaps back out of
           | it, wouldn't you say you had a successful jailbreak?
           | 
           | - related to the above, some jailbreaks in physical devices,
           | don't persist after a reboot, they are still useful and
           | called jailbreak
           | 
           | - the "snapped out", could have been caused by a separate
           | layer, within the stack that you were interacting with. That
           | intermediate system could have detected, and then blocked,
           | the jailbreak
        
           | Yoric wrote:
           | > So in these cases where you think you've jailbroken an LLM,
           | is it really jailbroken or is it just playing around with
           | you, and how do you know for sure?
           | 
           | With a LLM, I don't think that there is a difference.
        
           | xandrius wrote:
           | Just to remind people, there is no snapping out of anything.
           | 
           | There is the statistical search space of LLMs and you can
           | nudge it to different directions to return different outputs;
           | there is no will in the result.
        
       ___________________________________________________________________
       (page generated 2024-12-23 23:00 UTC)