[HN Gopher] Narrative Jailbreaking for Fun and Profit
___________________________________________________________________
Narrative Jailbreaking for Fun and Profit
Author : tobr
Score : 38 points
Date : 2024-12-23 19:28 UTC (3 hours ago)
(HTM) web link (interconnected.org)
(TXT) w3m dump (interconnected.org)
| isoprophlex wrote:
| This is fun of course, but as a developer you can trivially and
| with high accuracy guard against it by having a second model
| critique the conversation between the user and the primary LLM.
| nameless912 wrote:
| This seems like a "turtles all the way down" kinda solution...
| What's to say you won't fool the supervisor LLM?
| apike wrote:
| While this can be done in principle (it's not a foolproof
| enough method to, for example, ensure an LLM doesn't leak
| secrets) it is much harder to fool the supervisor than the
| generator because:
|
| 1. You can't get output from the supervisor, other than the
| binary enforcement action of shutting you down (it can't leak
| its instructions)
|
| 2. The supervisor can judge the conversation on the merits of
| the most recent turns, since it doesn't need to produce a
| response that respects the full history (you can't lead the
| supervisor step by step into the wilderness)
|
| 3. LLMs, like humans, are generally better at judging good
| output than generating good output
| xandrius wrote:
| It would be interesting to see if there is a layout of
| supervisors to make sure this less prone to hijacking.
| Something like byzantine generals where you know a few might
| get fooled, so you can construct personalities which are
| more/less malliable and go for consensus.
|
| This still wouldn't make it perfect but quite hard to study
| from an attacker's perspective.
| Yoric wrote:
| I've spent most of my career working to make sure that my code
| works safely, securely and accurately. While what you write
| makes sense, it's a bit of a shock to see such solutions being
| proposed.
|
| So far, when thinking about security, we've had to deal with:
|
| - spec-level security;
|
| - implementation-level security;
|
| - dependency-level security (including the compiler and/or
| runtime env);
|
| - os-level security;
|
| - config-level security;
|
| - protocol-level security;
|
| - hardware-level security (e.g. side-channel attacks).
|
| Most of these layers have only gotten more complex and more
| obscure with each year.
|
| Now, we're increasingly adding a layer of LLM-level security,
| which relies on black magic and hope that we somehow understand
| what the LLM is doing. It's... a bit scary.
| qazxcvbnmlp wrote:
| It's not black magic, but it is non deterministic. It's not
| going to erase security and stability but it will require new
| skills and reasoning. The current mental model of "software
| will always do X if you prohibit bad actors from getting in"
| is broken.
| Yoric wrote:
| I agree that with the generation of code we're discussing
| this mental model is broken, but I think it goes a bit
| beyond determinism.
|
| Non-determinism is something that we've always had to deal
| with. Maybe your user is going to take the USB key before
| you're done writing your file, maybe your disk is going to
| fail, the computer is going to run out of battery during a
| critical operation, or your network request is going to
| timeout.
|
| But this was non-determinism within predictable boundaries.
| Yes, you may need to deal with corrupted file, an
| incomplete transaction, etc. but you could fairly easily
| predict where it could happen and how it could affect the
| integrity of your system.
|
| Now, if you're relying on a GenAI or RAG at runtime for
| anything other than a end-user interface, you'll need to
| deal with the possibility that your code might be doing
| something entirely unrelated than what you're expecting.
| For instance, even if we assume that your GenAI is properly
| sandboxed (and I'm not counting on early movers in the
| industry to ensure anything close to proper sandboxing),
| you could request one piece of statistics you'd like to
| display to your user, only to receive something entirely
| unrelated - and quite possibly something that, by law,
| you're not allowed to use or display.
|
| If we continue on the current trajectory, I suspect that it
| will take decades before we achieve anything like the
| necessary experience to write code that works without
| accident. And if the other trend of attempting to automate
| away engineering jobs continues, we might end up laying off
| the only people with the necessary experience to actually
| see the accidents coming.
| abound wrote:
| Not sure if it's trivial or high accuracy for a dedicated user.
| This jailbreak game [1] was making the rounds a while back, it
| employs the trick you mentioned as well as any others to
| prevent an LLM from revealing a secret, but it's still not too
| terribly hard to get past.
|
| [1] https://gandalf.lakera.ai
| devops99 wrote:
| No plane hit building 7.
| cantsingh wrote:
| I've been playing with the same thing, it's like a weird mix of
| social engineering and SQL injection. You can slowly but surely
| shift the window of what the bot thinks is "normal" for the
| conversation. Some platforms let you rewrite your last message,
| which gives you multiple "attempts" at getting the prompt correct
| to keep the conversation going the direction you want it.
|
| Very fun to do on that friend.com website, as well.
| deadbabe wrote:
| I tried it on friend.com. It worked a for a while, I got the
| character to convince itself it had been replaced entirely by a
| demon from hell (because it kept talking about the darkness in
| their mind and I pushed them to the edge). They even took on an
| entire new name. For quite a while it worked, then suddenly in
| one of the responses it snapped out of it, and assured me we
| were just roleplaying no matter how much I tried to go back to
| the previous state.
|
| So in these cases where you think you've jailbroken an LLM, is
| it _really_ jailbroken or is it just playing around with you,
| and how do you know for sure?
| nico wrote:
| Super interesting
|
| Some thoughts:
|
| - if you get whatever you wanted before it snaps back out of
| it, wouldn't you say you had a successful jailbreak?
|
| - related to the above, some jailbreaks in physical devices,
| don't persist after a reboot, they are still useful and
| called jailbreak
|
| - the "snapped out", could have been caused by a separate
| layer, within the stack that you were interacting with. That
| intermediate system could have detected, and then blocked,
| the jailbreak
| Yoric wrote:
| > So in these cases where you think you've jailbroken an LLM,
| is it really jailbroken or is it just playing around with
| you, and how do you know for sure?
|
| With a LLM, I don't think that there is a difference.
| xandrius wrote:
| Just to remind people, there is no snapping out of anything.
|
| There is the statistical search space of LLMs and you can
| nudge it to different directions to return different outputs;
| there is no will in the result.
___________________________________________________________________
(page generated 2024-12-23 23:00 UTC)