[HN Gopher] Refusal in Language Models Is Mediated by a Single D...
___________________________________________________________________
Refusal in Language Models Is Mediated by a Single Direction
Author : Tomte
Score : 32 points
Date : 2024-06-18 17:09 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| wavemode wrote:
| Related recent HN submission (Uncensor any LLM with
| abliteration): https://news.ycombinator.com/item?id=40665721
| schoen wrote:
| I don't know the exact connection between the two, but that
| article cites an article which is described as a preview of
| this paper. So I guess it was working with a summary of this
| paper's contributions.
| eigenvalue wrote:
| Now that this technique is known, I wonder if there will be an
| arm's race to try to "distribute" the refusal tendency across as
| many different directions in the embedding space as possible so
| that it can't be easily offset without reducing the quality of
| the inferences so much that it's not worth it.
___________________________________________________________________
(page generated 2024-06-18 23:00 UTC)