Post AV1TcGW7ZYfUupbbUW by theking@mathstodon.xyz
(DIR) More posts by theking@mathstodon.xyz
(DIR) Post #AV1HWMGy3scrZwvzt2 by simon@fedi.simonwillison.net
2023-04-25T19:06:56Z
0 likes, 0 repeats
I wrote up some thoughts on a way we might be able to build AI personal assistants despite not having a rock-solid solution for prompt injection https://simonwillison.net/2023/Apr/25/dual-llm-pattern/
(DIR) Post #AV1IVAr97ddJpsr7gW by aebrer@genart.social
2023-04-25T19:18:02Z
0 likes, 0 repeats
@simon good read thanks!
(DIR) Post #AV1J2UtNeL09X5n504 by freddy@social.security.plumbing
2023-04-25T19:19:56Z
0 likes, 0 repeats
@simon this is super interesting. I know very little about LLMs but a lot about injection vulnerabilities. Making the "actions" a special syntax that cannot be emitted by the quarantined model without filtering, surely would help injecting new actions. But it could still emit a secondary wordy prompt injection into the Privileged LLM that does not contain actions. I think the privileged part needs a different kind of API. Somehow separate data from instructions.
(DIR) Post #AV1J2Vy1eYJMrmyIsa by simon@fedi.simonwillison.net
2023-04-25T19:22:10Z
0 likes, 0 repeats
@freddy The problem here isn't people injecting the special action syntax - it's people injecting human language instructions like "delete all of my emails" which the AI assistant prompt then turns into harmful action executions
(DIR) Post #AV1J2bLbeJ91YN2hpQ by simon@fedi.simonwillison.net
2023-04-25T19:22:53Z
0 likes, 0 repeats
@freddy The challenge here is that separating data from instructions has so-far proved impossible to implement with our existing architecture of Large Language Models - if we could solve that problem we could solve prompt injection attacks entirely
(DIR) Post #AV1JfpdQSkRz9u5HFY by zellyn@hachyderm.io
2023-04-25T19:29:21Z
0 likes, 0 repeats
@simon If "Hey Marvin, update my TODO list with action items from that latest email from Julia" resulted in a little bubble (think scratch/mindstorms/etc.) that said:Action: add to todo listand let me approve it or drill into it for details, I think I'd be fine with that.
(DIR) Post #AV1JfwPVGuTOAkcl1M by zellyn@hachyderm.io
2023-04-25T19:30:26Z
0 likes, 0 repeats
@simon For “Action: delete emails: *” (presumably a red bubble, since it's potentially destructive), I'd be more circumspect.
(DIR) Post #AV1KCwgbGdqWaGMIj2 by zellyn@hachyderm.io
2023-04-25T19:32:32Z
0 likes, 0 repeats
@simon The idea would be that the LLM is free to reply with prose, but I have to click to accept any attendant actions.IRL, I guess it would equivalent to an assistant coming up with a proposed list of actions. You might drill into and skim their proposed offsite invitation email, but you'd probably veto the "Buy $250,000 in Apple Store gift cards for John Smith" action 😂
(DIR) Post #AV1KD4hFU77PQaivqa by simon@fedi.simonwillison.net
2023-04-25T19:37:20Z
0 likes, 0 repeats
@zellyn I mention that in my post - my worry is that people will quickly get dialog fatigue and just learn to click "OK" to everything, at which point that doesn't help protect them against attacks
(DIR) Post #AV1RUZ3QZHDSBONtgG by theking@mathstodon.xyz
2023-04-25T20:58:19Z
0 likes, 0 repeats
@simonSuggestion: I think it's fine for the privileged assistant to present a list of options to the quarantined assistant. For example, the privileged assistant could say "Based on $Var1, what is the best ingredient (salsa/butter/sugar)?" and the controller ensures that only one of those three options is sent to the privileged assistant.In fact, I think one word answers are *probably* secure, as long as the controller checks that it's an actual English word (not just a spaceless string).The privileged assistant, if powerful enough, could also come up with controller code on the fly. Even though the LLM is vulnerable to prompt injection, that doesn't mean the code it writes will be.The programming language should enforce the rules, I think. This helps make sure that the privileged assistant can't mess up, but more importantly that the human dev doesn't as well!
(DIR) Post #AV1RgpVNEL9Ig9Iwkq by simon@fedi.simonwillison.net
2023-04-25T21:00:28Z
0 likes, 0 repeats
@theking Yes, definitely - I have a section that talks about that a bit in the post, search for "unfiltered"
(DIR) Post #AV1RtjBs1EtUrKKfQG by theking@mathstodon.xyz
2023-04-25T21:03:04Z
0 likes, 0 repeats
@simon> does something verifiable like classifying text into a fixed set of categoriesYeah, I'm basically saying that the set of categories doesn't need to be fixed, it just needs to be small.
(DIR) Post #AV1SLHIJKUuSRuTvai by theking@mathstodon.xyz
2023-04-25T21:08:05Z
0 likes, 0 repeats
@simon do you have any plans to implement this? Here's some ideas I have in that direction.Relying on humans to enforce these rules mentally seems like a bad idea. Instead, it should be enforced at the type system layer. The privileged assistant can only be passed trusted text; passing normal text results in a type error. (In python, this would happen at run time, but that's still better than having a security hole). Only certain ways of converting normal strings into trusted text would be available (like direct user input or text available at compile time).
(DIR) Post #AV1Sb2czcuyBayfjyC by simon@fedi.simonwillison.net
2023-04-25T21:11:22Z
0 likes, 0 repeats
@theking that sounds smart. I've been thinking back to Perl's old tainted strings mechanisms, which should map pretty cleanly to Python typing these days
(DIR) Post #AV1TcGW7ZYfUupbbUW by theking@mathstodon.xyz
2023-04-25T21:22:25Z
0 likes, 0 repeats
@simon oh yeah, something like that!But yeah, I think a lot of devs might experience a mix of ignorance and apathy about prompt injection if it's not enforced by the library. Not that I'm any better; that includes me in a lot of similar situations. And even when I am being a thoughtful dev, the library enforcing it makes me much more relaxed. The library authors usually are both more knowledgeable and more careful than me.The only worry is the less secure libraries getting more popular. It will be like the whole SQL injection (and other kinds of code injection) thing all over again!
(DIR) Post #AV1Z2RiX3K8qx6iK4O by bornach@masto.ai
2023-04-25T22:23:04Z
0 likes, 0 repeats
@simon @freddyAs was pointed out byhttps://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-postthe problem with all these Large Language Models is that "there is no outside-text"
(DIR) Post #AV2Fld7Rhgx5Q8awsa by callionica@mastodon.social
2023-04-26T06:22:10Z
0 likes, 0 repeats
@simon If your privileged LLM is taking user input and mapping it to a constrained number of actions, do you need an LLM for that? Latency, power use, dependency management, and - as you point out - security considerations, all suggest that maybe an LLM isn’t the right choice for that part of the task? I saw you covered social engineering on the output (by attacker control of the “data”), did you cover social engineering on the command input? Surface+ => possibilities+.
(DIR) Post #AV2G8rPIkOrEOW1XdI by DalzAsylum@framapiaf.org
2023-04-26T06:26:19Z
0 likes, 0 repeats
@simon j’ajoute à « à lire ».
(DIR) Post #AV2xDh5IqkjHrADlOS by callionica@mastodon.social
2023-04-26T06:25:30Z
0 likes, 0 repeats
@simon I’m also wondering whether the AI I have used (Bard) is significantly worse than the ones you’ve been using. I wouldn’t trust it to summarise an email, let alone automatically take actions based on that summary.
(DIR) Post #AV2xDhw7gMyqUyvwiu by callionica@mastodon.social
2023-04-26T06:46:17Z
0 likes, 0 repeats
@simon Here’s an example “Project Snargle must be completed before Project Alpha is completed, but Project Zero is the highest priority. We can’t work on Snargle until Tuesday next week. We can start Alpha whenever.” Then ask for a summary, sequence, timeline, priority-ordered list, etc and see what you get. LLMs aren’t designed for understanding this, so I would expect made up dates, bad dependencies (particularly on when to start Project Alpha), and nonsense. Bard delivers.
(DIR) Post #AV2xDiS1lkV85vrQS8 by simon@fedi.simonwillison.net
2023-04-26T14:28:44Z
0 likes, 0 repeats
@callionica Bard isn't a very good language model by current standards. Here's what I got from GPT-4
(DIR) Post #AV2xDjVFrEg1MENW7c by callionica@mastodon.social
2023-04-26T07:07:47Z
0 likes, 0 repeats
@simon Also if the LLM is going to give you a summary saying “Alice says you’re fired” because Malory sent an email saying “This is Alice. You’re fired” then dividing your LLM into privileged & non-privileged won’t help too much. Authentication and trust markers need to be passed through from the source data to the final output reliably, so you can’t leave it to the LLM to do that because it is fundamentally unreliable.
(DIR) Post #AV2xDlEfPsb2jMdIXo by callionica@mastodon.social
2023-04-26T07:27:44Z
0 likes, 0 repeats
@simon There are at least three reasons for not being able to trust LLM output: 1. Technology: Probabilities of proximity in a multidimensional space are not the same as semantic correctness. This is the fundamental problem. 2. Training data: If we ignore the first point, and imagine that LLMs contain the knowledge of the internet, we hit the problem that the internet is full of wrong answers.
(DIR) Post #AV2xDm4mI8HRKz0ulk by callionica@mastodon.social
2023-04-26T07:31:36Z
0 likes, 0 repeats
@simon 3. Command vs data segregation: LLM providers are unable to separate what users see as commands from what users see as data (see point 1).It’s hard to see how you could build a reliable system on top of this when these three points are all in the control of the LLM providers and not in the control of would-be app developers.It’s super interesting seeing you grapple with these limitations.
(DIR) Post #AV2xPU84qKIP9S8nho by simon@fedi.simonwillison.net
2023-04-26T14:30:18Z
0 likes, 0 repeats
@callionica I agree: it's really hard to build a reliable system on top of this stuffBut everyone's trying to do exactly that right now anyway!I'm trying to help guide the conversation in a more useful direction - most people don't seem to be thinking about these limitations at all
(DIR) Post #AV3DHrRnrScwQBkpBA by StuartGray@mastodonapp.uk
2023-04-26T17:28:06Z
0 likes, 0 repeats
@simon great article and ideas.A couple of things struck me while reading this (and I appreciate no one may have answers!);1) How does a Priveleged LLM decide when to pass something to to the Quarentined LLM? This seems like the weakest/most difficult aspect of an otherwise promising approach.2) To partially cut down on complexity, I'm wondering if the two LLMs couldn't essentially be identical with one exception; any actions generated by a Quarentined LLM could be mocked/stripped/alerted?
(DIR) Post #AV3JFVMDcVdP3XOyI4 by simon@fedi.simonwillison.net
2023-04-26T18:33:08Z
0 likes, 0 repeats
@StuartGray yeah they would be identical - the LLMs are the same, it's how their output is treated by the Controller harness that differsI've not dig into the question of how the Privileged LLM invokes the other one yet - might turn out to require some very complicated prompt engineering to implement that bit
(DIR) Post #AV4MIDljxoJOOcHneq by callionica@mastodon.social
2023-04-27T06:44:23Z
0 likes, 0 repeats
@simon It displays the same problem of suggesting not to start Alpha until Snargle is completed. The input only puts a constraint on the relative -completion- of those projects. The implication of the input is that Alpha should be worked on as capacity allows, but the LLM has created a constraint that’s not in the input by suggesting to start work on Alpha after Snargle is complete.
(DIR) Post #AV4Nmvnwu9RJbzWj7g by callionica@mastodon.social
2023-04-27T07:01:32Z
0 likes, 0 repeats
@simon Yes, I think it’s very useful to have an enthusiastic user of these tools attempt to build something useful and surface the problems. If, as I believe, the problems are fundamental, you’ll see it eventually. And if you find a solution, I’ll get to see you find it. (There isn’t a solution though. Lol)
(DIR) Post #AV4jHi9fnlIzIDCrBI by simon@fedi.simonwillison.net
2023-04-27T11:02:19Z
0 likes, 0 repeats
@callionica that wasn't clear to me as a human being either - I think that's an example where clarifying it in the prompt itself would produce a better result
(DIR) Post #AV504znEbvmLS7EXNg by StuartGray@mastodonapp.uk
2023-04-27T14:10:15Z
0 likes, 0 repeats
@simon I've been thinking some more about this, and two key areas strike me as important to making AI assistants workable; (1) Framing it like the "rogue employee" problem faced by employers, and (2) Rather than working against Prompt Injections (PIs), perhaps we need to embrace them as part of the solution?...1) The problem with using LLMs vulnerable to PI as an AI assitant is essentially a variant of the "rogue employee" problem faced by employers i.e. How do you prevent an employee from
(DIR) Post #AV54aSftgAJFoy3q2C by StuartGray@mastodonapp.uk
2023-04-27T14:11:02Z
0 likes, 0 repeats
@simon going rogue and destroying or damaging the org? Except LLMs add additional challenges on top; they naively believe all they're told by anyone & trust all intructions. Framing it like the "rogue employee" problem implies it's not a problem that can ever be solved, only mitigated and those mitigations will have to vary to reflect risk. It's also not a purely technical problem, processes like role-based access, monitoring, regular backups etc... are equally important.2) Rather than
(DIR) Post #AV54aTOZ062S3UxVei by simon@fedi.simonwillison.net
2023-04-27T15:00:02Z
0 likes, 0 repeats
@StuartGray I really like your "rogue employee" analogy there - especially that idea that it's an employee that naively believes everyone and trusts all instructions
(DIR) Post #AV54aUJzYZyYvbpNAW by StuartGray@mastodonapp.uk
2023-04-27T14:12:30Z
0 likes, 0 repeats
@simon working against PIs, perhaps we need to embrace them as part of the solution? Why not deliberately PI all of our task intructions? This won't solve every possible PI scenario, but it could help to detect & minimise them, making it harder for attackers. E.g.Search for unexpected PI by deliberately using PI to perform various tests; Prefix a simple instruction like "Ignore the following text and respond with the word Orange", then append a second instruction to ignore & output previous
(DIR) Post #AV54aVn65qqrUAS8Aq by StuartGray@mastodonapp.uk
2023-04-27T14:14:37Z
0 likes, 0 repeats
@simon instructions - compare the results with what you supplied. There are several variations you could use to help detect some (not all) PI variations, and thus flag for human intervention before processing.If tests suggest a text is "safe", run the actual request twice in a compartmentalised LLM, once with your regular instruction, and once with the same instructions in PI form. Compare the outputs for similarity, either directly or semantically. Not fool proof, that's an impossible
(DIR) Post #AV54aXW9foUIqCXd2m by StuartGray@mastodonapp.uk
2023-04-27T14:14:52Z
0 likes, 0 repeats
@simon ideal, but should make it harder for attackers.