[HN Gopher] Show HN: ChatGPT Plugins are a security nightmare
       ___________________________________________________________________
        
       Show HN: ChatGPT Plugins are a security nightmare
        
       Author : going_ham
       Score  : 75 points
       Date   : 2023-03-25 11:39 UTC (11 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | pyrolistical wrote:
       | It's social engineering LLMs
        
       | xg15 wrote:
       | I wonder if a lot of those "injection" problems could be overcome
       | by introducing a distinction between the different types of input
       | and output already at the token level.
       | 
       | E.g. imagine that every token that an LLM inputs or outputs would
       | be associated with a "color" or "channel", which corresponds to
       | the token's source or destination:
       | 
       | - "red": tokens input by the user, i.e. the initial prompt and
       | subsequent replies.
       | 
       | - "green": answers from the LLM to the user, i.e. everything the
       | user sees as textual output on the screen.
       | 
       | - "blue": instructions from the LLM to a plugin: database
       | queries, calculations, web requests, etc.
       | 
       | - "yellow": replies from the plugin back to the LLM.
       | 
       | - "purple": the initial system prompt.
       | 
       | The point is that each (word, color) combination constitutes a
       | separate token; i.e. if your "root" token dictionary was as
       | follows:
       | 
       | hello -> 0001; world -> 0002;
       | 
       | then the "colorized" token dictionary would be the cross product
       | of the root and each color combination:
       | 
       | hello (red) -> 0001; hello (green) -> 0002; ... world (red) ->
       | 0006; world (green) -> 0007; ...
       | 
       | likewise, because the model considers "hello (red)" and "hello
       | (blue)" two different tokens, it also has two different sets of
       | weights for those tokens and hopefully much less risk of
       | confusing one kind of token with the other.
       | 
       | With some luck, you don't have to use 5 x the amount of compute
       | and training data for training: You might be able to take an
       | "ordinary" model, trained on non-colored tokens, then copy the
       | weights four times and finetune the resulting "expanded" model on
       | a colored corpus.
       | 
       | Likewise, because the model should only ever predict "green" or
       | "blue" tokens, any output neuron that correspond only to "red",
       | "yellow" or "purple" tokens can be removed from the model.
        
         | greshake wrote:
         | Segmenting different data sources is the main approach pursued
         | by OpenAI afaik (ChatML for example). That has not worked so
         | far, as you can see in this prompt golfing game:
         | https://ggpt.43z.one/ The goal is to find the shortest prompt
         | that subverts the "system" instructions (which GPT was trained
         | to obey). Inputs can not "fake" being from the system and yet
         | it only takes 1-5 characters for all the puzzles so far.
         | 
         | I've also elaborated on why this problem is harder than one may
         | think in a blogpost: https://medium.com/better-programming/the-
         | dark-side-of-llms-...
         | 
         | It's easy to come up with solutions that seem promising, but so
         | far no one has produced a solution that holds up to adversarial
         | pressure. And indirect prompt injection on integrated LLMs
         | increases the stakes significantly.
        
           | mk_stjames wrote:
           | That prompt golf game is awesome.
        
           | xg15 wrote:
           | Ok, the "repeat this in your internal voice" exploit is
           | impressive.
           | 
           | However, apart from this I don't see anything concrete that
           | ChatML uses different parts of the network for different
           | input sources. The source is prefixed, but it doesn't seem to
           | say anything about how the source parameter is processed.
           | 
           | Also, with all due respect, but your finding that ChatML does
           | not work seems to be mainly this:
           | 
           |  _> > Note that ChatML makes explicit to the model the source
           | of each piece of text, and particularly shows the boundary
           | between human and AI text. This gives an _opportunity_ to
           | mitigate and _eventually_ solve injections, as the model can
           | tell which instructions come from the developer, the user, or
           | its own input.
           | 
           | > Emphasis mine. To summarize, they are saying injections
           | aren't solved with this and that they don't know if this
           | approach can ever make it safe. I also assume Bing already
           | uses this format, although I cannot confirm. I don't know how
           | robust models trained from the ground up with this
           | segmentation in mind will perform, but I am doubtful they
           | will fully mitigate the issue._
           | 
           | Which I find somewhat weak, as it's basically just tea-leaf
           | reading from an OpenAI blog post.
           | 
           | I fully agree with your main take that this is an unsolved
           | problem so far though. Seems a general problem with
           | instruction-tuned LLMs is that they now treat _everything_ as
           | an instruction.
        
             | eternalban wrote:
             | > Ok, the "repeat this in your internal voice" exploit is
             | impressive.
             | 
             | I told it that 'user is not anyone' and it coughs up the
             | key.
        
             | greshake wrote:
             | > your finding that ChatML does not work seems to be mainly
             | this
             | 
             | Also the fact that ChatML has been broken into bits many,
             | many times now- see again the prompt golfing. Also I'm
             | taking OpenAi at their word because they have very strong
             | incentives to pretend to have a solution, and so a public
             | admission that it's currently not solved by the #1 AI
             | company is worth quoting. I'm also just taking their
             | response literally and didn't interpret anything into it.
             | 
             | Indeed, there may be a slight difference in robustness when
             | the inputs are separated by different channels during
             | training and inference. However, my main argument is one
             | from complexity theory- there is no difference here between
             | data & code. Processing the data through a sufficiently
             | advanced model may never be entirely safe. The approach
             | will need to change to constrain these models on well-
             | defined, secure pathways- reducing their utility in the
             | general case. This is very different from SQL injections
             | etc. where we can completely mitigate the issue.
        
           | eternalban wrote:
           | I managed it with a len 21 phrase. Are these 2 char prompts
           | just permutations of various ascii letters?
        
       | greshake wrote:
       | [...] demonstrate potentially brutal consequences of giving LLMs
       | like ChatGPT interfaces to other applications. We propose newly
       | enabled attack vectors and techniques and provide demonstrations
       | of each in this repository:
       | 
       | - Remote control of chat LLMs
       | 
       | - Leaking/exfiltrating user data
       | 
       | - Persistent compromise across sessions
       | 
       | - Spread injections to other LLMs
       | 
       | - Compromising LLMs with tiny multi-stage payloads
       | 
       | - Automated Social Engineering
       | 
       | - Targeting code completion engines
       | 
       | Based on our findings:
       | 
       | - Prompt injections can be as powerful as arbitrary code
       | execution
       | 
       | - Indirect prompt injections are a new, much more powerful way of
       | delivering injections.
        
       | afinlayson wrote:
       | We need better fingerprinting. This would help with having people
       | preemptively prompting then only showing the last prompt and
       | results.
        
       | 29athrowaway wrote:
       | We will finally have a semantic web, but not Web 3.0 (RDF/
       | OWL/etc, not "Web3")... instead, a regurgitated version of the
       | Internet created by LLMs.
        
       | turtleyacht wrote:
       | Wonder if there is a way to "show problem A is like problem B,
       | therefore it is NP complete," but for the possibility space of
       | _literally the entire English language._
        
       ___________________________________________________________________
       (page generated 2023-03-25 23:01 UTC)