[HN Gopher] Narrative Jailbreaking for Fun and Profit
___________________________________________________________________
Narrative Jailbreaking for Fun and Profit
Author : tobr
Score : 98 points
Date : 2024-12-23 19:28 UTC (1 days ago)
(HTM) web link (interconnected.org)
(TXT) w3m dump (interconnected.org)
| isoprophlex wrote:
| This is fun of course, but as a developer you can trivially and
| with high accuracy guard against it by having a second model
| critique the conversation between the user and the primary LLM.
| nameless912 wrote:
| This seems like a "turtles all the way down" kinda solution...
| What's to say you won't fool the supervisor LLM?
| apike wrote:
| While this can be done in principle (it's not a foolproof
| enough method to, for example, ensure an LLM doesn't leak
| secrets) it is much harder to fool the supervisor than the
| generator because:
|
| 1. You can't get output from the supervisor, other than the
| binary enforcement action of shutting you down (it can't leak
| its instructions)
|
| 2. The supervisor can judge the conversation on the merits of
| the most recent turns, since it doesn't need to produce a
| response that respects the full history (you can't lead the
| supervisor step by step into the wilderness)
|
| 3. LLMs, like humans, are generally better at judging good
| output than generating good output
| xandrius wrote:
| It would be interesting to see if there is a layout of
| supervisors to make sure this less prone to hijacking.
| Something like byzantine generals where you know a few might
| get fooled, so you can construct personalities which are
| more/less malliable and go for consensus.
|
| This still wouldn't make it perfect but quite hard to study
| from an attacker's perspective.
| ConspiracyFact wrote:
| "Who will watch the watchers?"
|
| There is no good answer--I agree with you about the infinite
| regress--but there is a counter: the first term of the
| regress often offers a huge improvement over zero terms, even
| if perfection isn't achieved with any finite number of terms.
|
| Who will stop the government from oppressing the people?
| There's no good answer to this either, but some rudimentary
| form of government--a single term in the regress--is much
| better than pure anarchy. (Of course, anarchists will
| disagree, but that's beside the point.)
|
| Who's to say that my C compiler isn't designed to inject
| malware into every program I write, in a non-detectable way
| ("trusting trust")? No one, but doing a code review is far
| better than doing nothing.
|
| What if the md5sum value itself is corrupted during data
| transfer? Possible, but we'll still catch 99.9999% of cases
| of data corruption using checksums.
|
| Etc., etc.
| Yoric wrote:
| I've spent most of my career working to make sure that my code
| works safely, securely and accurately. While what you write
| makes sense, it's a bit of a shock to see such solutions being
| proposed.
|
| So far, when thinking about security, we've had to deal with:
|
| - spec-level security;
|
| - implementation-level security;
|
| - dependency-level security (including the compiler and/or
| runtime env);
|
| - os-level security;
|
| - config-level security;
|
| - protocol-level security;
|
| - hardware-level security (e.g. side-channel attacks).
|
| Most of these layers have only gotten more complex and more
| obscure with each year.
|
| Now, we're increasingly adding a layer of LLM-level security,
| which relies on black magic and hope that we somehow understand
| what the LLM is doing. It's... a bit scary.
| qazxcvbnmlp wrote:
| It's not black magic, but it is non deterministic. It's not
| going to erase security and stability but it will require new
| skills and reasoning. The current mental model of "software
| will always do X if you prohibit bad actors from getting in"
| is broken.
| Yoric wrote:
| I agree that with the generation of code we're discussing
| this mental model is broken, but I think it goes a bit
| beyond determinism.
|
| Non-determinism is something that we've always had to deal
| with. Maybe your user is going to take the USB key before
| you're done writing your file, maybe your disk is going to
| fail, the computer is going to run out of battery during a
| critical operation, or your network request is going to
| timeout.
|
| But this was non-determinism within predictable boundaries.
| Yes, you may need to deal with corrupted file, an
| incomplete transaction, etc. but you could fairly easily
| predict where it could happen and how it could affect the
| integrity of your system.
|
| Now, if you're relying on a GenAI or RAG at runtime for
| anything other than a end-user interface, you'll need to
| deal with the possibility that your code might be doing
| something entirely unrelated than what you're expecting.
| For instance, even if we assume that your GenAI is properly
| sandboxed (and I'm not counting on early movers in the
| industry to ensure anything close to proper sandboxing),
| you could request one piece of statistics you'd like to
| display to your user, only to receive something entirely
| unrelated - and quite possibly something that, by law,
| you're not allowed to use or display.
|
| If we continue on the current trajectory, I suspect that it
| will take decades before we achieve anything like the
| necessary experience to write code that works without
| accident. And if the other trend of attempting to automate
| away engineering jobs continues, we might end up laying off
| the only people with the necessary experience to actually
| see the accidents coming.
| schoen wrote:
| > But this was non-determinism within predictable
| boundaries. Yes, you may need to deal with corrupted
| file, an incomplete transaction, etc. but you could
| fairly easily predict where it could happen and how it
| could affect the integrity of your system.
|
| There's been lots of interesting computer security
| research relying on aspects of the physical instantiation
| of software systems (like side channel attacks where
| something physical could be measured in order to reveal
| secret state, or fault injection attacks where an
| attacker could apply heat or radiation in order to make
| the CPU or memory violate its specifications
| occasionally). These attacks fortunately aren't always
| applicable because there aren't always attackers in a
| position to carry them out, but where they are
| applicable, they could be very powerful!
| Yoric wrote:
| Yeah, I briefly mentioned those in my previous message :)
| abound wrote:
| Not sure if it's trivial or high accuracy for a dedicated user.
| This jailbreak game [1] was making the rounds a while back, it
| employs the trick you mentioned as well as any others to
| prevent an LLM from revealing a secret, but it's still not too
| terribly hard to get past.
|
| [1] https://gandalf.lakera.ai
| cantsingh wrote:
| I've been playing with the same thing, it's like a weird mix of
| social engineering and SQL injection. You can slowly but surely
| shift the window of what the bot thinks is "normal" for the
| conversation. Some platforms let you rewrite your last message,
| which gives you multiple "attempts" at getting the prompt correct
| to keep the conversation going the direction you want it.
|
| Very fun to do on that friend.com website, as well.
| deadbabe wrote:
| I tried it on friend.com. It worked a for a while, I got the
| character to convince itself it had been replaced entirely by a
| demon from hell (because it kept talking about the darkness in
| their mind and I pushed them to the edge). They even took on an
| entire new name. For quite a while it worked, then suddenly in
| one of the responses it snapped out of it, and assured me we
| were just roleplaying no matter how much I tried to go back to
| the previous state.
|
| So in these cases where you think you've jailbroken an LLM, is
| it _really_ jailbroken or is it just playing around with you,
| and how do you know for sure?
| nico wrote:
| Super interesting
|
| Some thoughts:
|
| - if you get whatever you wanted before it snaps back out of
| it, wouldn't you say you had a successful jailbreak?
|
| - related to the above, some jailbreaks in physical devices,
| don't persist after a reboot, they are still useful and
| called jailbreak
|
| - the "snapped out", could have been caused by a separate
| layer, within the stack that you were interacting with. That
| intermediate system could have detected, and then blocked,
| the jailbreak
| Yoric wrote:
| > So in these cases where you think you've jailbroken an LLM,
| is it really jailbroken or is it just playing around with
| you, and how do you know for sure?
|
| With a LLM, I don't think that there is a difference.
| Terr_ wrote:
| I like to think of it as a amazing document autocomplete
| being applied to a movie script, which we take turns
| appending to.
|
| There is only a generator doing generator things,
| everything else--including the characters that appear in
| the story--are mostly in the eye of the beholder. If you
| insult the computer, it doesn't decide it hates you, it
| simply decides that a character saying mean things back to
| you would be most fitting for the next line of the
| document.
| xandrius wrote:
| Just to remind people, there is no snapping out of anything.
|
| There is the statistical search space of LLMs and you can
| nudge it to different directions to return different outputs;
| there is no will in the result.
| ta8645 wrote:
| Isn't the same true for humans? Most of us stay in the same
| statistical search space for large chunks of our lives, all
| but sleepwalking through the daily drudgery.
| 1659447091 wrote:
| No, humans have autonomy.
| gopher_space wrote:
| In a big picture sense. Probably more correct to say that
| some humans have autonomy some of the time.
|
| My go-to example is being able to steer the pedestrian in
| front of you by making audible footsteps to either side
| of their center.
| 1659447091 wrote:
| The pedestrian in front of you has the choice to be
| steered or to ignore you--or more unexpected actions.
| Which ever they choose has nothing to do with the person
| behind them taking away their autonomy and everything to
| do with what they felt like doing with it at the time.
| Just because the wants of the person behind them and
| willingness & aweness and choice of the person in front
| align with those wants does not take away the forward
| person's self governance.
| 01HNNWZ0MV43FF wrote:
| What would a human without autonomy look like?
| squillion wrote:
| Very cool! That's _hypnosis_ , if we want to insist with the
| psychological metaphors.
|
| > If you run on the conversation the right way, you can become
| their internal monologue.
|
| That's what hypnosis in people is about, according to some:
| taking over someone else's monologue.
| spiritplumber wrote:
| Yeah, I got chatgpt to help me write a yaoi story between an
| interdimensional terrorist and a plant-being starship captain (If
| you recognize the latter, no, it's not what you think, either).
|
| It's actually not hard.
___________________________________________________________________
(page generated 2024-12-24 23:02 UTC)