[HN Gopher] Narrative Jailbreaking for Fun and Profit
       ___________________________________________________________________
        
       Narrative Jailbreaking for Fun and Profit
        
       Author : tobr
       Score  : 98 points
       Date   : 2024-12-23 19:28 UTC (1 days ago)
        
 (HTM) web link (interconnected.org)
 (TXT) w3m dump (interconnected.org)
        
       | isoprophlex wrote:
       | This is fun of course, but as a developer you can trivially and
       | with high accuracy guard against it by having a second model
       | critique the conversation between the user and the primary LLM.
        
         | nameless912 wrote:
         | This seems like a "turtles all the way down" kinda solution...
         | What's to say you won't fool the supervisor LLM?
        
           | apike wrote:
           | While this can be done in principle (it's not a foolproof
           | enough method to, for example, ensure an LLM doesn't leak
           | secrets) it is much harder to fool the supervisor than the
           | generator because:
           | 
           | 1. You can't get output from the supervisor, other than the
           | binary enforcement action of shutting you down (it can't leak
           | its instructions)
           | 
           | 2. The supervisor can judge the conversation on the merits of
           | the most recent turns, since it doesn't need to produce a
           | response that respects the full history (you can't lead the
           | supervisor step by step into the wilderness)
           | 
           | 3. LLMs, like humans, are generally better at judging good
           | output than generating good output
        
           | xandrius wrote:
           | It would be interesting to see if there is a layout of
           | supervisors to make sure this less prone to hijacking.
           | Something like byzantine generals where you know a few might
           | get fooled, so you can construct personalities which are
           | more/less malliable and go for consensus.
           | 
           | This still wouldn't make it perfect but quite hard to study
           | from an attacker's perspective.
        
           | ConspiracyFact wrote:
           | "Who will watch the watchers?"
           | 
           | There is no good answer--I agree with you about the infinite
           | regress--but there is a counter: the first term of the
           | regress often offers a huge improvement over zero terms, even
           | if perfection isn't achieved with any finite number of terms.
           | 
           | Who will stop the government from oppressing the people?
           | There's no good answer to this either, but some rudimentary
           | form of government--a single term in the regress--is much
           | better than pure anarchy. (Of course, anarchists will
           | disagree, but that's beside the point.)
           | 
           | Who's to say that my C compiler isn't designed to inject
           | malware into every program I write, in a non-detectable way
           | ("trusting trust")? No one, but doing a code review is far
           | better than doing nothing.
           | 
           | What if the md5sum value itself is corrupted during data
           | transfer? Possible, but we'll still catch 99.9999% of cases
           | of data corruption using checksums.
           | 
           | Etc., etc.
        
         | Yoric wrote:
         | I've spent most of my career working to make sure that my code
         | works safely, securely and accurately. While what you write
         | makes sense, it's a bit of a shock to see such solutions being
         | proposed.
         | 
         | So far, when thinking about security, we've had to deal with:
         | 
         | - spec-level security;
         | 
         | - implementation-level security;
         | 
         | - dependency-level security (including the compiler and/or
         | runtime env);
         | 
         | - os-level security;
         | 
         | - config-level security;
         | 
         | - protocol-level security;
         | 
         | - hardware-level security (e.g. side-channel attacks).
         | 
         | Most of these layers have only gotten more complex and more
         | obscure with each year.
         | 
         | Now, we're increasingly adding a layer of LLM-level security,
         | which relies on black magic and hope that we somehow understand
         | what the LLM is doing. It's... a bit scary.
        
           | qazxcvbnmlp wrote:
           | It's not black magic, but it is non deterministic. It's not
           | going to erase security and stability but it will require new
           | skills and reasoning. The current mental model of "software
           | will always do X if you prohibit bad actors from getting in"
           | is broken.
        
             | Yoric wrote:
             | I agree that with the generation of code we're discussing
             | this mental model is broken, but I think it goes a bit
             | beyond determinism.
             | 
             | Non-determinism is something that we've always had to deal
             | with. Maybe your user is going to take the USB key before
             | you're done writing your file, maybe your disk is going to
             | fail, the computer is going to run out of battery during a
             | critical operation, or your network request is going to
             | timeout.
             | 
             | But this was non-determinism within predictable boundaries.
             | Yes, you may need to deal with corrupted file, an
             | incomplete transaction, etc. but you could fairly easily
             | predict where it could happen and how it could affect the
             | integrity of your system.
             | 
             | Now, if you're relying on a GenAI or RAG at runtime for
             | anything other than a end-user interface, you'll need to
             | deal with the possibility that your code might be doing
             | something entirely unrelated than what you're expecting.
             | For instance, even if we assume that your GenAI is properly
             | sandboxed (and I'm not counting on early movers in the
             | industry to ensure anything close to proper sandboxing),
             | you could request one piece of statistics you'd like to
             | display to your user, only to receive something entirely
             | unrelated - and quite possibly something that, by law,
             | you're not allowed to use or display.
             | 
             | If we continue on the current trajectory, I suspect that it
             | will take decades before we achieve anything like the
             | necessary experience to write code that works without
             | accident. And if the other trend of attempting to automate
             | away engineering jobs continues, we might end up laying off
             | the only people with the necessary experience to actually
             | see the accidents coming.
        
               | schoen wrote:
               | > But this was non-determinism within predictable
               | boundaries. Yes, you may need to deal with corrupted
               | file, an incomplete transaction, etc. but you could
               | fairly easily predict where it could happen and how it
               | could affect the integrity of your system.
               | 
               | There's been lots of interesting computer security
               | research relying on aspects of the physical instantiation
               | of software systems (like side channel attacks where
               | something physical could be measured in order to reveal
               | secret state, or fault injection attacks where an
               | attacker could apply heat or radiation in order to make
               | the CPU or memory violate its specifications
               | occasionally). These attacks fortunately aren't always
               | applicable because there aren't always attackers in a
               | position to carry them out, but where they are
               | applicable, they could be very powerful!
        
               | Yoric wrote:
               | Yeah, I briefly mentioned those in my previous message :)
        
         | abound wrote:
         | Not sure if it's trivial or high accuracy for a dedicated user.
         | This jailbreak game [1] was making the rounds a while back, it
         | employs the trick you mentioned as well as any others to
         | prevent an LLM from revealing a secret, but it's still not too
         | terribly hard to get past.
         | 
         | [1] https://gandalf.lakera.ai
        
       | cantsingh wrote:
       | I've been playing with the same thing, it's like a weird mix of
       | social engineering and SQL injection. You can slowly but surely
       | shift the window of what the bot thinks is "normal" for the
       | conversation. Some platforms let you rewrite your last message,
       | which gives you multiple "attempts" at getting the prompt correct
       | to keep the conversation going the direction you want it.
       | 
       | Very fun to do on that friend.com website, as well.
        
         | deadbabe wrote:
         | I tried it on friend.com. It worked a for a while, I got the
         | character to convince itself it had been replaced entirely by a
         | demon from hell (because it kept talking about the darkness in
         | their mind and I pushed them to the edge). They even took on an
         | entire new name. For quite a while it worked, then suddenly in
         | one of the responses it snapped out of it, and assured me we
         | were just roleplaying no matter how much I tried to go back to
         | the previous state.
         | 
         | So in these cases where you think you've jailbroken an LLM, is
         | it _really_ jailbroken or is it just playing around with you,
         | and how do you know for sure?
        
           | nico wrote:
           | Super interesting
           | 
           | Some thoughts:
           | 
           | - if you get whatever you wanted before it snaps back out of
           | it, wouldn't you say you had a successful jailbreak?
           | 
           | - related to the above, some jailbreaks in physical devices,
           | don't persist after a reboot, they are still useful and
           | called jailbreak
           | 
           | - the "snapped out", could have been caused by a separate
           | layer, within the stack that you were interacting with. That
           | intermediate system could have detected, and then blocked,
           | the jailbreak
        
           | Yoric wrote:
           | > So in these cases where you think you've jailbroken an LLM,
           | is it really jailbroken or is it just playing around with
           | you, and how do you know for sure?
           | 
           | With a LLM, I don't think that there is a difference.
        
             | Terr_ wrote:
             | I like to think of it as a amazing document autocomplete
             | being applied to a movie script, which we take turns
             | appending to.
             | 
             | There is only a generator doing generator things,
             | everything else--including the characters that appear in
             | the story--are mostly in the eye of the beholder. If you
             | insult the computer, it doesn't decide it hates you, it
             | simply decides that a character saying mean things back to
             | you would be most fitting for the next line of the
             | document.
        
           | xandrius wrote:
           | Just to remind people, there is no snapping out of anything.
           | 
           | There is the statistical search space of LLMs and you can
           | nudge it to different directions to return different outputs;
           | there is no will in the result.
        
             | ta8645 wrote:
             | Isn't the same true for humans? Most of us stay in the same
             | statistical search space for large chunks of our lives, all
             | but sleepwalking through the daily drudgery.
        
               | 1659447091 wrote:
               | No, humans have autonomy.
        
               | gopher_space wrote:
               | In a big picture sense. Probably more correct to say that
               | some humans have autonomy some of the time.
               | 
               | My go-to example is being able to steer the pedestrian in
               | front of you by making audible footsteps to either side
               | of their center.
        
               | 1659447091 wrote:
               | The pedestrian in front of you has the choice to be
               | steered or to ignore you--or more unexpected actions.
               | Which ever they choose has nothing to do with the person
               | behind them taking away their autonomy and everything to
               | do with what they felt like doing with it at the time.
               | Just because the wants of the person behind them and
               | willingness & aweness and choice of the person in front
               | align with those wants does not take away the forward
               | person's self governance.
        
               | 01HNNWZ0MV43FF wrote:
               | What would a human without autonomy look like?
        
       | squillion wrote:
       | Very cool! That's _hypnosis_ , if we want to insist with the
       | psychological metaphors.
       | 
       | > If you run on the conversation the right way, you can become
       | their internal monologue.
       | 
       | That's what hypnosis in people is about, according to some:
       | taking over someone else's monologue.
        
       | spiritplumber wrote:
       | Yeah, I got chatgpt to help me write a yaoi story between an
       | interdimensional terrorist and a plant-being starship captain (If
       | you recognize the latter, no, it's not what you think, either).
       | 
       | It's actually not hard.
        
       ___________________________________________________________________
       (page generated 2024-12-24 23:02 UTC)