[HN Gopher] Show HN: Llama 3.2 Interpretability with Sparse Auto...
___________________________________________________________________
Show HN: Llama 3.2 Interpretability with Sparse Autoencoders
I spent a lot of time and money on this rather big side project of
mine that attempts to replicate the mechanistic interpretability
research on proprietary LLMs that was quite popular this year and
produced great research papers by Anthropic [1], OpenAI [2] and
Deepmind [3]. I am quite proud of this project and since I
consider myself the target audience for HackerNews did I think that
maybe some of you would appreciate this open research replication
as well. Happy to answer any questions or face any feedback.
Cheers [1] https://transformer-circuits.pub/2024/scaling-
monosemanticit... [2] https://arxiv.org/abs/2406.04093 [3]
https://arxiv.org/abs/2408.05147
Author : PaulPauls
Score : 133 points
Date : 2024-11-21 20:37 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jaykr_ wrote:
| This is awesome! I really appreciate the time you took to
| document everything!
| curious_cat_163 wrote:
| Hey - Thanks for sharing!
|
| Will take a closer look later but if you are hanging around now,
| it might be worth asking this now. I read this blog post
| recently:
|
| https://adamkarvonen.github.io/machine_learning/2024/06/11/s...
|
| And the author talks about challenges with evaluating SAEs. I
| wonder how you tackled that and where to look inside your repo
| for understanding the your approach around that if possible.
|
| Thanks again!
| JackYoustra wrote:
| Very cool work! Any plans to integrate it with SAELens?
___________________________________________________________________
(page generated 2024-11-21 23:00 UTC)