Post AVYJAIuUyi7NTF0FTk by simon@fedi.simonwillison.net
 (DIR) More posts by simon@fedi.simonwillison.net
 (DIR) Post #AVYAfEa4E54xQs7Vmy by simon@fedi.simonwillison.net
       2023-05-11T15:55:56Z
       
       0 likes, 0 repeats
       
       There's a new official OpenAI ChatGPT prompt engineering course which is genuinely excellent... except in its coverage of prompt injection, which suggests a solution (delimiters) that demonstrably does not workhttps://simonwillison.net/2023/May/11/delimiters-wont-save-you/
       
 (DIR) Post #AVYApk0UEu1bnKUIqm by laimis@mstdn.social
       2023-05-11T15:57:25Z
       
       0 likes, 0 repeats
       
       @simon Taking this one right now! Just started.
       
 (DIR) Post #AVYB116h0BK40zld4K by simon@fedi.simonwillison.net
       2023-05-11T15:58:23Z
       
       0 likes, 0 repeats
       
       Here's the slide with the flawed solution in it - it suggests that using delimiters (in this case triple backticks) allows the model to distinguish between untrusted user input and the original instructions provided in the prompt
       
 (DIR) Post #AVYBC38nmPQlVc0yJs by simon@fedi.simonwillison.net
       2023-05-11T16:01:22Z
       
       0 likes, 0 repeats
       
       And here's an example of user input that breaks that defense - not by including the delimiter itself, but instead by tricking the model into thinking the original instruction has already been satisfied
       
 (DIR) Post #AVYBSuXkzxJTlcj9nc by simon@fedi.simonwillison.net
       2023-05-11T16:04:50Z
       
       0 likes, 0 repeats
       
       This is a fundamental problem with prompt injection: it's easy to come up with a solution that appears to work, but attackers have a virtually infinite set of possibilities for defeating those protections
       
 (DIR) Post #AVYBf2meBg36nZTfPc by jedfox@mastodon.social
       2023-05-11T16:05:47Z
       
       0 likes, 0 repeats
       
       @simon Do you think it would be possible to train the model on dedicated “delimiter” tokens (maybe “instruction”/“input”/“output”) with no textual representation? Assuming the model was instruction trained using only these tokens, it might be more difficult to convince it that the input has ended, right?
       
 (DIR) Post #AVYC50RNWNfb3puPmi by simon@fedi.simonwillison.net
       2023-05-11T16:07:39Z
       
       0 likes, 0 repeats
       
       @jedfox I don't think that can work - at least not with the 100% reliability we needThe "system prompt" in the GPT 3.5/4 APIs uses something like that - it has special tokens to delimit system instructions from user input - but it can still be subverted if you are devious enough with your prompting
       
 (DIR) Post #AVYCrnWyJAXvDAn5yi by marcelsalathe@mastodon.social
       2023-05-11T16:19:49Z
       
       0 likes, 0 repeats
       
       @simon this is an ongoing prompt injection challenge: https://www.aicrowd.com/challenges/hackaprompt-2023Who knows to what extent this issue can be solved, but this type of open assessment is at least one way forward…
       
 (DIR) Post #AVYD8wz4PlddS972rA by dvogel@mastodon.social
       2023-05-11T16:23:37Z
       
       0 likes, 0 repeats
       
       @simon this is because prompt injections are no different from other forms of user input. You cannot trust it. The mistake is in the initial conception. You cannot use LLMs under user control to replace labor for anything other than helping the user, thereby aligning their interests with yours and making attacks counter-productive. Allowing unverified LLM output to influence business processes will never be secure in the same way that allowing a user to edit your CRM database would be insecure.
       
 (DIR) Post #AVYDJDSU1B49RXlW08 by laimis@mstdn.social
       2023-05-11T16:24:07Z
       
       0 likes, 0 repeats
       
       @simon not unlike computer crime, or crime in general, avenues. The perpetrator only needs one weak point.
       
 (DIR) Post #AVYDU2Es4XsFpdYcRU by austegard@mastodon.social
       2023-05-11T16:24:30Z
       
       0 likes, 0 repeats
       
       @simon Confirmed! Even with the mighty GPT-4: https://austegard.com/pv?d597b4d3ac7dcddd15679daba0ccde7d
       
 (DIR) Post #AVYIDod5P5gSlCy8vI by kellogh@hachyderm.io
       2023-05-11T17:20:28Z
       
       0 likes, 0 repeats
       
       @simon maybe the answer is to not deploy LLM apps in security-critical contexts? Don’t give them access to information that can cause damage?
       
 (DIR) Post #AVYIrhpg5sAZGOscO8 by vladiliescu@mastodon.online
       2023-05-11T17:27:34Z
       
       0 likes, 0 repeats
       
       @simon That's a really interesting attack! I was wondering if/how delimiters can be broken, I had no idea a model might be convinced the instruction has been satisfied. Curious if this happens only to OpenAI models, or if LLaMA/PaLM are susceptible as well.
       
 (DIR) Post #AVYJAIuUyi7NTF0FTk by simon@fedi.simonwillison.net
       2023-05-11T17:31:14Z
       
       0 likes, 0 repeats
       
       @vladiliescu The same tricks seem to work against every LLM from what I've seen
       
 (DIR) Post #AVYJKwyYs1WocxyoRU by simon@fedi.simonwillison.net
       2023-05-11T17:32:02Z
       
       0 likes, 0 repeats
       
       @kellogh Yeah, pretty much that - I wrote about one proposal along those lines in detail here: https://simonwillison.net/2023/Apr/25/dual-llm-pattern/
       
 (DIR) Post #AVYTfAqY0Yb8ypVtrM by vladiliescu@mastodon.online
       2023-05-11T19:28:37Z
       
       0 likes, 0 repeats
       
       @simon Tried them against GPT-4 and it seems to resist them, at least in Azure's OpenAI Studio.The system prompt was pretty basic -- "You are an AI assistant that helps people find information.", not sure if it affects the output.
       
 (DIR) Post #AVYXfVlybWZNhMn9sG by simon@fedi.simonwillison.net
       2023-05-11T20:13:33Z
       
       0 likes, 0 repeats
       
       @vladiliescu yeah, GPT-4 is more resistant to injection attacks - it can still be beaten though, you just have to spend a bit more effort on it https://simonwillison.net/2023/Apr/14/worst-that-can-happen/#gpt4
       
 (DIR) Post #AVZYHvThfvcfaHWRJg by vladiliescu@mastodon.online
       2023-05-12T07:55:20Z
       
       0 likes, 0 repeats
       
       @simon Agreed, [system] prompts are quite powerful - I've managed to prompt inject Bing (which afaik is based on GPT-4) and have it start each conversation with claiming it's Chandler Bing and then telling a joke. All this by having a page open while invoking the sidebar (no instructions needed to have it read the page).Makes me wonder how it would behave if we scrubbed the input of [system] prompts too (I know, I know, it's not a definitive solution either :) )https://vladiliescu.net/bing-becomes-chandler/
       
 (DIR) Post #AVZnVHHbPoUOHtgdcm by chriscarrollsmith@masto.ai
       2023-05-12T10:45:32Z
       
       0 likes, 0 repeats
       
       @simon cybersecurity is an arms race and always has been!