[HN Gopher] LLMs know more than what they say
       ___________________________________________________________________
        
       LLMs know more than what they say
        
       Author : nqnielsen
       Score  : 42 points
       Date   : 2024-08-19 16:53 UTC (6 hours ago)
        
 (HTM) web link (arjunbansal.substack.com)
 (TXT) w3m dump (arjunbansal.substack.com)
        
       | nqnielsen wrote:
       | And how to use LLM interpretability research for applied
       | evaluation
        
       | autokad wrote:
       | if what I understand is correct, that they project the LLM's
       | internal activations into meaningful linear directions derived
       | from contrasting examples, I guess this is similar to how we
       | began to derive a lot more value from ebeddings by using the
       | embedding values for various things.
        
         | ruby314 wrote:
         | yes that's correct! we project an evaluator LLM's internal
         | activations onto meaningful linear directions, derived from
         | contrasting examples. the strongest connection is to LLM
         | interpretability (existence of meaningful linear directions)
         | and steering research (computation from contrast pairs). This
         | has been done with base model activations to understand base
         | model behavior, but we show you can boost evaluation accuracy
         | this way too, with a small number of human feedback
        
       ___________________________________________________________________
       (page generated 2024-08-19 23:00 UTC)