Post AcwzGeSllDavstoZfc by cypherfox@mas.to
(DIR) More posts by cypherfox@mas.to
(DIR) Post #AcwzGeSllDavstoZfc by cypherfox@mas.to
2023-12-18T23:25:06Z
0 likes, 0 repeats
@simon You’re the closest thing to a ‘prompt injection expert’ I can think of.Imagine the classic representation of attention where there’s a heat-map table of attention between tokens… What if you zeroed the attention between all ‘untrusted input’ tokens and the outer ‘system/direction’ tokens?The idea is to eliminate the ’forget your prior instructions’ hole by eliminating the attention between untrusted input and the instructions.Do you think that would be viable/interesting to explore?
(DIR) Post #AcwzGfZtcCtDLI9mPw by simon@fedi.simonwillison.net
2023-12-19T00:20:40Z
0 likes, 0 repeats
@cypherfox my hunch is that if someone could get that to work they would have already, but maybe I'm just being overly pessimistic - at this point the more people trying more approaches the better!If you're operating on input tokens (to translate or summarize them for example) you have to pay them some level of attention, would a binary classification work?