[HN Gopher] Refusal in Language Models Is Mediated by a Single D...
       ___________________________________________________________________
        
       Refusal in Language Models Is Mediated by a Single Direction
        
       Author : Tomte
       Score  : 32 points
       Date   : 2024-06-18 17:09 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | wavemode wrote:
       | Related recent HN submission (Uncensor any LLM with
       | abliteration): https://news.ycombinator.com/item?id=40665721
        
         | schoen wrote:
         | I don't know the exact connection between the two, but that
         | article cites an article which is described as a preview of
         | this paper. So I guess it was working with a summary of this
         | paper's contributions.
        
       | eigenvalue wrote:
       | Now that this technique is known, I wonder if there will be an
       | arm's race to try to "distribute" the refusal tendency across as
       | many different directions in the embedding space as possible so
       | that it can't be easily offset without reducing the quality of
       | the inferences so much that it's not worth it.
        
       ___________________________________________________________________
       (page generated 2024-06-18 23:00 UTC)