[HN Gopher] Show HN: Llama 3.2 Interpretability with Sparse Auto...
       ___________________________________________________________________
        
       Show HN: Llama 3.2 Interpretability with Sparse Autoencoders
        
       I spent a lot of time and money on this rather big side project of
       mine that attempts to replicate the mechanistic interpretability
       research on proprietary LLMs that was quite popular this year and
       produced great research papers by Anthropic [1], OpenAI [2] and
       Deepmind [3].  I am quite proud of this project and since I
       consider myself the target audience for HackerNews did I think that
       maybe some of you would appreciate this open research replication
       as well. Happy to answer any questions or face any feedback.
       Cheers  [1] https://transformer-circuits.pub/2024/scaling-
       monosemanticit...  [2] https://arxiv.org/abs/2406.04093  [3]
       https://arxiv.org/abs/2408.05147
        
       Author : PaulPauls
       Score  : 133 points
       Date   : 2024-11-21 20:37 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jaykr_ wrote:
       | This is awesome! I really appreciate the time you took to
       | document everything!
        
       | curious_cat_163 wrote:
       | Hey - Thanks for sharing!
       | 
       | Will take a closer look later but if you are hanging around now,
       | it might be worth asking this now. I read this blog post
       | recently:
       | 
       | https://adamkarvonen.github.io/machine_learning/2024/06/11/s...
       | 
       | And the author talks about challenges with evaluating SAEs. I
       | wonder how you tackled that and where to look inside your repo
       | for understanding the your approach around that if possible.
       | 
       | Thanks again!
        
       | JackYoustra wrote:
       | Very cool work! Any plans to integrate it with SAELens?
        
       ___________________________________________________________________
       (page generated 2024-11-21 23:00 UTC)