[HN Gopher] The Dual LLM pattern for building AI assistants that...
___________________________________________________________________
The Dual LLM pattern for building AI assistants that can resist
prompt injection
Author : simonw
Score : 57 points
Date : 2023-05-13 05:08 UTC (1 days ago)
(HTM) web link (simonwillison.net)
(TXT) w3m dump (simonwillison.net)
| montebicyclelo wrote:
| Here's "jailbreak detection", in the NeMo-Guardrails project from
| Nvidia:
|
| https://github.com/NVIDIA/NeMo-Guardrails/blob/327da8a42d5f8...
|
| I.e. they ask the llm if the prompt will break the llm. (I
| believe that more data /some evaluation on how well this performs
| is intended to be released. Probably fair to call this stuff "not
| battle tested".)
| fnordpiglet wrote:
| It feels like an LLM classifying the prompts without cumulative
| context as well as the prompt output from the LLM would be pretty
| effective. Like in the human mind, with its varying levels of
| judgement and thought, it may be a case of multiple LLMs watching
| the overall process.
| SeriousGamesKit wrote:
| Thanks SimonW! I've really enjoyed your series on this problem on
| HN and on your blog. I've seen suggestions elsewhere about
| tokenising fixed prompt instructions differently to user input to
| distinguish them internally, and wanted to ask for your take on
| this concept- do you think this is likely to improve the state of
| play regarding prompt injection, applied either to a one-LLM or
| two-LLM setup?
| Vanit wrote:
| I still don't believe that in the long term it will be tenable to
| bootstrap LLMs using prompts (or at least via the same vector as
| your users).
| SheinhardtWigCo wrote:
| Is it possible that all but the most exotic prompt injection
| attacks end up being mitigated automatically over time, by virtue
| of research and discussion on prompt injection being included in
| training sets for future models?
| jameshart wrote:
| By the same logic, humans should no longer fall for phishing
| scams or buy timeshares since information about them is widely
| available.
| tomohelix wrote:
| Most well-educated people won't. A well trained AI can behave
| pretty close to a well-educated person in common sense.
| SheinhardtWigCo wrote:
| I'd say it's not the same thing, because most humans don't
| have an encyclopedic knowledge of past scams, and are not
| primed to watch out for them 24/7. LLMs don't have either of
| these problems.
|
| An interesting question is whether GPT-4 would fall for a
| phishing scam or try to buy a timeshare if you gave it an
| explicit instruction to avoid being scammed.
| danpalmer wrote:
| I sort of disagree that LLMs don't have the same pitfalls.
| LLMs aren't recording everything they are trained with,
| like humans, the training data affects a general
| behavioural model. When answering, they aren't looking up
| information.
|
| As for being "primed", I think the difference between
| training, fine tuning, and prompting, is the closest
| equivalent. They may have been trained with anti-scam
| information, but they probably haven't been fine tuned to
| deal with scams, and then haven't been prompted to look out
| for them. A human who isn't expecting a scam in a given
| conversation is much less likely to notice it than one who
| is asked to find the scam.
|
| Lastly, scams often work by essentially pattern matching
| behaviour to things we want to do. Like taking advantage of
| peoples willingness to help. I suspect LLMs would be far
| more susceptible to this sort of thing because you only
| have to effectively pattern match one thing: language. If
| the language of the scam triggers the same "thought"
| patterns as the language of a legitimate conversation, then
| it'll work.
|
| To avoid all of this I think will require explicit
| instruction in fine tuning or prompts, but so does
| everything, and if we train for everything then we're back
| to square one with relative priorities.
| amelius wrote:
| The one thing that will solve this problem is when AI assistants
| will actually become intelligent.
| amrb wrote:
| So we just recreated all of the previous SQL injection security
| issues in LLM's, fun times
| inopinatus wrote:
| This is technocrat hubris. Congruent vulnerabilities exist in
| humans, for which it's a solved problem. Go re-read _The Phoenix
| Project_.
|
| The real world solution is to use auditors, and ensure that most
| operations are reversible.
| fooker wrote:
| This is avoiding the core problem (mingling control and data)
| with security through obscurity.
|
| That can be an effective solution, but it's important to
| recognize it as such.
| rst wrote:
| It's avoiding the problem by separating control and data, at
| unknown but signficant cost to functionality (the LLM which
| determines what tools get invoked doesn't see the actual data
| or results, only opaque tokens that refer to them, so it can't
| use them directly to make choices). I'm not sure how that
| qualifies as "security by obscurity".
___________________________________________________________________
(page generated 2023-05-14 23:00 UTC)