[HN Gopher] Extracting concepts from GPT-4
       ___________________________________________________________________
        
       Extracting concepts from GPT-4
        
       Author : davidbarker
       Score  : 132 points
       Date   : 2024-06-06 17:01 UTC (5 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | OmarShehata wrote:
       | This is super cool, it feels like going in the direction of the
       | "deep"/high level type of semantic searching I've been waiting
       | for. I like their examples of basically filtering documents for
       | the "concept" of price increases, or even something as high level
       | as a rhetorical question
       | 
       | I wonder how this compares to training/fine tuning a model on
       | examples of rhetorical questions and asking it to find it in a
       | given document. This is maybe faster/more accurate? Since it
       | involves just looking at neural network activation, vs running it
       | with input and having it generate an answer...?
        
       | riku_iki wrote:
       | The worrying part is that first concept in the doc they
       | show/found is "human imperfection". Hope this is just
       | coincidence..
        
         | jackphilson wrote:
         | I think its safety implications
        
         | thelastparadise wrote:
         | Spooky!
        
         | calibas wrote:
         | I think it's done on purpose, it's related to a very important
         | point when understanding AI.
         | 
         | Humans aren't perfect, AI is trained by humans, therefore...
        
           | riku_iki wrote:
           | I think the point of the doc is that "human imperfection" is
           | very prominent concept in trained LLM..
        
       | yismail wrote:
       | Interesting, reminds me of similar work Anthropic did on Claude 3
       | Sonnet [0].
       | 
       | [0] https://transformer-circuits.pub/2024/scaling-
       | monosemanticit...
        
         | ranman wrote:
         | Someone mentioned that this took almost as much compute to
         | train as the original model.
        
           | swyx wrote:
           | source please!
        
         | Legend2440 wrote:
         | The methods are the same, this is just OpenAI applying
         | Anthropic's research to their own model.
        
         | longdog wrote:
         | I feel the webpage strongly hints that sparse autoencoders were
         | invented by OpenAI for this project.
         | 
         | Very weird that they don't cite this in their webpage and
         | instead bury the source in their paper.
        
       | obiefernandez wrote:
       | Can someone ELI5 the significance of this? (okay maybe not 5, but
       | in basic language)
        
         | 93po wrote:
         | from chatgpt itself: The article discusses how researchers use
         | sparse autoencoders to identify and interpret key features
         | within complex language models like GPT-4, making their inner
         | workings more understandable. This advancement helps improve AI
         | safety and reliability by breaking down the models' decision-
         | making processes into simpler, human-interpretable parts.
        
         | OtherShrezzing wrote:
         | LLM based AIs have lots of "features" which are kind of
         | synonymous with "concepts" - these can be anything from `the
         | concept of an apostrophe in the word don't`, to `"George Wash"
         | is usually followed by "ington" in the context of early
         | American History`. Inside of the LLMs neural network, these are
         | mapped to some circuitry-in-software-esque paths.
         | 
         | We don't really have a good way of understanding how these
         | features are generated inside of the LLMs or how their
         | circuitry is activated when outputting them, or why the LLMs
         | are following those circuits. Because of this, we don't have
         | any way to debug this component of an LLM - which makes them
         | harder to improve. Similarly, if LLMs/AIs ever get advanced
         | enough, we'll want to be able to identify if they're being
         | wilfully deceptive towards us, which we can't currently do. For
         | these reasons, we'd like to understand what is actually
         | happening in the neural network to produce & output concepts.
         | This domain of research is usually referred to as
         | "interpretability".
         | 
         | OpenAI (and also DeepMind and Anthropic) have found a few ways
         | to inspect the inner circuitry of the LLMs, and reveal a
         | handful of these features. They do this by asking questions of
         | the model, and then inspecting which parts of the LLM's inner
         | circuitry "lights up". They then ablate (turn off) circuitry to
         | see if those features become less frequently used in the AIs
         | response as a verification step.
         | 
         | The graphs and highlighted words are visual representations of
         | concepts that they are reasonably certain about - for example,
         | the concept of the word "AND" linking two parts of a sentence
         | together highlights the word "AND".
         | 
         | Neel Nanda is the best source for this info if you're
         | interested in interpretability (IMO it's the most interesting
         | software problem out there at the moment), but note that his
         | approach is different to OpenAI's methodology discussed in the
         | post: https://www.neelnanda.io/mechanistic-interpretability
        
           | localfirst wrote:
           | hallucination solution?
        
             | OtherShrezzing wrote:
             | Solving this problem would be a step on the way to
             | debugging (and then resolving, or at least highlighting)
             | hallucinations.
        
               | skywhopper wrote:
               | I'm skeptical that it could ever be possible to tell the
               | difference between a hallucination and a "fact" in terms
               | of what's going on inside the model. Because
               | hallucinations aren't really a bug in the usual sense.
               | Ie, there's not some logic wrong or something misfiring.
               | 
               | Instead, it's more appropriate to think of LLMs as
               | _always_ hallucinating. And sometimes that comes really
               | close to reality because there's a lot of reinforcement
               | in the training data. And sometimes we humans infer
               | meaning that isn't there because that's how humans work.
               | And sometimes the leaps show clearly as "hallucinations"
               | because the patterns the model is expressing don't match
               | the patterns that are meaningful to us. (Eg when they
               | hallucinate strongly patterned things like URLs or
               | academic citations, which don't actually point to
               | anything real. The model picked up the pattern of what
               | such citations look like really well, but it didn't and
               | can't make the leap to linking those patterns to
               | reality.)
               | 
               | Not to mention that a lot of use cases for LLMs we
               | actually _want_ "hallucination". Eg when we ask it to do
               | any creative task or make up stories or jokes or songs or
               | pictures. It's only a hallucination in the wrong context.
               | But context is the main thing LLMs just don't have.
        
         | orbital-decay wrote:
         | High-level concepts stored inside the large models (diffusion
         | models, transformers etc) are normally hard to separate from
         | each other, and the model is more or less a black box. A lot of
         | research is put into obtaining the insight into what model
         | knows. This is another advancement in this direction; it allows
         | for easy separation of the concepts.
         | 
         | This can be used to analyze the knowledge inside the model, and
         | potentially modify (add, erase, change the importance) certain
         | concepts without affecting unrelated ones. The precision
         | achievable with the particular technique is always in question
         | though, and some concepts are just too close to separate from
         | each other, so it's probably not perfect.
        
         | HarHarVeryFunny wrote:
         | In general this is just copying work done by Anthropic, so
         | there's nothing fundamentally new here.
         | 
         | What they have done here is to identify patterns internal to
         | GPT-4 that correspond to specific identifiable concepts. The
         | work was done my OpenAI's mostly dismantled safety team (it has
         | the names of this teams recently departed co-leads Ilya & Jan
         | Leike on it), so this is nominally being done for safety
         | reasons to be able to boost or suppress specific concepts from
         | being activated when the model is running, such as Anthropic's
         | demonstration of boosting their models fixation on the Golden
         | Gate bridge:
         | 
         | https://www.anthropic.com/news/golden-gate-claude
         | 
         | This kind of work would also seem to have potential functional
         | uses as well as safety ones, given that it allow you to control
         | the model in specific ways.
        
       | andreyk wrote:
       | Exciting to see this so soon after Anthropic's "Mapping the Mind
       | of a Large Language Model" (under 3 weeks). I find these efforts
       | really exciting; it is still common to hear people say "we have
       | no idea how LLMs / Deep Learning works", but that is really a
       | gross generalization as stuff like this shows.
       | 
       | Wonder if this was a bit rushed out in response to Anthropic's
       | release (as well as the departure of Jan Leike from OpenAI)...
       | the paper link doesn't even go to Arxiv, and the analysis is not
       | nearly as deep. Though who knows, might be unrelated.
        
         | jerrygenser wrote:
         | > but that is really a gross generalization as stuff like this
         | shows.
         | 
         | I think this research actually still reinforces that we still
         | have very little understanding of the internals. The blog post
         | also reiterates that this is early work with many limitations.
        
         | thegrim33 wrote:
         | From the article:
         | 
         | "We currently don't understand how to make sense of the neural
         | activity within language models."
         | 
         | "Unlike with most human creations, we don't really understand
         | the inner workings of neural networks."
         | 
         | "The [..] networks are not well understood and cannot be easily
         | decomposed into identifiable parts"
         | 
         | "[..] the neural activations inside a language model activate
         | with unpredictable patterns, seemingly representing many
         | concepts simultaneously"
         | 
         | "Learning a large number of sparse features is challenging, and
         | past work has not been shown to scale well."
         | 
         | etc., etc., etc.
         | 
         | People say we don't (currently) know why they output what they
         | output, because .. as the article clearly states, we don't.
        
           | surfingdino wrote:
           | Not holding my breath for that hallucinated cure for cancer
           | then.
        
             | ben_w wrote:
             | LLMs aren't the only kind of AI, just one of the two
             | current shiny kinds.
             | 
             | If a "cure for cancer" (cancer is not just one disease so,
             | unfortunately, that's not even as coherent a request as
             | we'd all like it to be) is what you're hoping for, look
             | instead at the stuff like AlphaFold etc.:
             | https://en.wikipedia.org/wiki/AlphaFold
             | 
             | I don't know how to tell where real science ends and PR
             | bluster begins in such models, though I can say that the
             | closest I've heard to a word against it is "sure, but we've
             | got other things besides protein folding to solve", which
             | is a good sign.
             | 
             | (I assume AlphaFold is also a mysterious black box, and
             | that tools such as the one under discussion may help us
             | demystify it too).
        
               | surfingdino wrote:
               | It's bullshit sold in a very convincing way. Whenever you
               | ask guys selling AI they always say "sure, it can't do
               | that, but there are other things it can do..." or "it's
               | not the right questions" or "you are asking it the wrong
               | way" ... all while selling it as the solution to the
               | problems they say it cannot solve.
        
               | wg0 wrote:
               | Downvotes for Truth.
               | 
               | "Hold on to your papers, What a time to be alive!"
        
           | TrainedMonkey wrote:
           | I read this as "we have not built up tools / math to
           | understand neural networks as they are new and exciting" and
           | not as "neural networks are magical and complex and not
           | understandable because we are meddling with something we
           | cannot control".
           | 
           | A good example would be planes - it took a long while to
           | develop mathematical models that could be used to model
           | behavior. Meanwhile practical experimentation developed
           | decent rule of thumb for what worked / did not work.
           | 
           | So I don't think it's fair to say that "we don't" (know how
           | neural networks work), we don't have math / models yet that
           | can explain/model their behavior...
        
         | imjonse wrote:
         | Both Leike and Sutskever are still credited in the post.
        
         | swyx wrote:
         | > Wonder if this was a bit rushed out in response to
         | Anthropic's release
         | 
         | too lazy to dig up source but some twitter sleuth found that
         | the first commit to the project was 6 months ago
         | 
         | likely all these guys went to the same metaphorical SF bars, it
         | was in the water
        
           | nicce wrote:
           | Visualizer was added 18 hours ago:
           | 
           | https://github.com/openai/sparse_autoencoder/commit/764586ae.
           | ..
        
           | szvsw wrote:
           | > likely all these guys went to the same metaphorical SF
           | bars, it was in the water
           | 
           | It also is coming from a long lineage of thought no? For
           | instance, one of the things often thought early in an ML
           | course is the notion that "early layers respond to/generate
           | general information/patterns, and deeper layers respond
           | to/generate more detailed/complex patterns/information." That
           | is obviously an overly broad and vague statement but it is a
           | useful intuition and can be backed up by doing some various
           | inspection of eg what maximally activates some convolution
           | filters. So already there is a notion that there is some sort
           | of spatial structure to how semantics are processed and
           | represented in a neural network (even if in a totally
           | different context, as in image processing mentioned above),
           | where "spatial" here is used to refer to different regions of
           | the network.
           | 
           | Even more simply, in fact as simple as you can get: with
           | linear regression, the most interpretable model you can get-
           | you have a clear notion that different parameter groups of
           | the model respond to different "concepts" (where a concept is
           | taken to be whatever the variables associated with a given
           | subset of coefficients represent).
           | 
           | In some sense, at least in a high-level/intuitive reading of
           | the new research coming out of Anthropic and OpenAI, I think
           | the current research is just a natural extension of these
           | ideas, albeit in a much more complicated context and massive
           | scale.
           | 
           | Somebody else, please correct me if you think my reading is
           | incorrect!!
        
       | svieira wrote:
       | When one of the first examples is:
       | 
       | > GPT-4 feature: ends of phrases related to price increases
       | 
       | and the 2/5s of the responses don't have any relation to
       | _increase_ at all:
       | 
       | > Brent crude, fell 38 cents to $118.29 a barrel on the ICE
       | Futures Exchange in London. The U.S. benchmark, West Texas
       | Intermediate crude, was down 53 cents to $99.34 a barrel on the
       | New York Mercantile Exchange. -- Ronald D. White Graphic: The AAA
       | 
       | and
       | 
       | > ,115.18. The record reflects that appellant also included
       | several hand-prepared invoices and employee pay slips, including
       | an allegedly un-invoiced laundry ticket dated 29 June 2013 for 53
       | bags oflaundry weighing 478 pounds, which, at the contract price
       | of $
       | 
       | I think I must be mis-understanding something. Why would this
       | example (out of all the potential examples) be picked?
        
         | Metus wrote:
         | Notice that most of the examples have none of the green
         | highlight counter, which is shown for
         | 
         | > small losses. KEEPING SCORE: The Dow Jones industrial average
         | rose 32 points, or 0.2 percent, to 18,156 as of 3:15 p.m.
         | Eastern time. The Standard & Poor's ... OMAHA, Neb. (AP) --
         | Warren Buffett's company has bought nearly
         | 
         | the other sentences are in contrast to show how specific this
         | neuron is.
        
           | svieira wrote:
           | Ah, that makes a lot of sense, thank you!
        
           | yorwba wrote:
           | The highlights are better visible in this visualisation:
           | https://openaipublic.blob.core.windows.net/sparse-
           | autoencode...
           | 
           | There are also many top activations not showing increases,
           | e.g.
           | 
           | > 0.06 of a cent to 90.01 cents US.||U.S. indexes were mainly
           | lower as the Dow Jones industrials lost 21.72 points to
           | 16,329.53, the Nasdaq was up 11.71 points at 4,318.9 and the
           | S&P 500
           | 
           | (Highlight on the first comma.)
        
       | Shoop wrote:
       | Can anyone summarize the major differences between this and
       | Scaling Monosemanticity?
        
       | calibas wrote:
       | I want to be able to view exactly how my input is translated into
       | tokens, as well as the embeddings for the tokens.
        
         | franzb wrote:
         | For your first question: https://platform.openai.com/tokenizer
        
           | calibas wrote:
           | I saw that, but the language makes me think it's not quite
           | the same as what's really being used?
           | 
           | "how a piece of text _might_ be tokenized by a language model
           | "
           | 
           | "It's important to note that the exact tokenization process
           | varies between models."
        
             | yorwba wrote:
             | That's why they have buttons to choose which model's
             | tokenizer to use.
        
               | calibas wrote:
               | Yes, thank you, I understand that part.
               | 
               | It's the _might_ condition in the description that makes
               | me think the results might not be the exact same as what
               | 's used in the live models.
        
       | andy12_ wrote:
       | a.k.a, the same work as Anthropic, but with less interpretable
       | and interesting features. I guess there won't be Golden Gate[0]
       | GPT anytime soon.
       | 
       | I mean, you just have to compare the couple of interesting
       | features of the OpenAI feature browser [1] and the features of
       | the Anthropic feature browser [2].
       | 
       | [0] https://twitter.com/AnthropicAI/status/1793741051867615494
       | 
       | [1] https://openaipublic.blob.core.windows.net/sparse-
       | autoencode...
       | 
       | [2] https://transformer-circuits.pub/2024/scaling-
       | monosemanticit...
        
       | itissid wrote:
       | Does this mean that it could be a good practice to release the
       | auto encoder that was trained on a neural network to explain its
       | outputs? Like all open models in hugging face could have this as
       | a useful accompaniment?
        
       | mlsu wrote:
       | This is interesting:
       | 
       | > Autoencoder family
       | 
       | > Note: Only 65536 features available. Activations shown on The
       | Pile (uncopyrighted) instead of our internal training dataset.
       | 
       | So, the Pile is uncopyrighted, but the internal training dataset
       | is copyrighted? Copyrighted by whom?
       | 
       | Huh?
        
         | Arcsech wrote:
         | > Copyrighted by whom?
         | 
         | By people who would get angry if they could definitively prove
         | their stuff was in OpenAI's training set.
        
         | immibis wrote:
         | Basically everyone. You, and me, and Elon Musk, and EMPRESS,
         | and my uncle who works for Nintendo. They're just hoping that
         | AI training legally ignores copyright.
        
         | Der_Einzige wrote:
         | Hehe, related to this, _someone_ created a  "book4" dataset and
         | put it on torrent websites. I don't think it's being used in
         | any major LLMs, but the future "piracy" community intersection
         | with AI is going to be exciting.
         | 
         | Watching the cyberpunk world that all of my favorite literature
         | predicted slowly come to our world is fun indeed.
        
           | swyx wrote:
           | i think you mean @sillysaurus' books3? not books4?
        
       | russellbeattie wrote:
       | One feature I expect we'll get from this sort of research is
       | identifying "hot spots" that are used during inference. Like
       | virtual machines, these could be cached in whole or in part and
       | used to both speed up the response time and reduce computation
       | cycles needed.
        
       | aeonik wrote:
       | In their other examples, they have what looks to be a scientific
       | explanation of reproductive anatomy classified as erotic
       | content...
       | 
       | Here is the link to the concept [content warning]:
       | https://openaipublic.blob.core.windows.net/sparse-autoencode...
       | 
       | DocID: 191632
        
       | adamiscool8 wrote:
       | How does this compare to or improve on applying something like
       | SHAP[0][1] on a model? The idea in the first line that "we
       | currently don't understand how to make sense of the neural
       | activity within language models." is..straight up false?
       | 
       | [0] https://github.com/shap/shap
       | 
       | [1]
       | https://en.wikipedia.org/wiki/Shapley_value#In_machine_learn...
        
         | szvsw wrote:
         | SHAP is pretty separate IMO. Shapley analysis is really a game
         | theoretical methodology that is model agnostic and is only
         | about determining how individual sections of the input
         | contribute to a given prediction, not about how the model
         | actually works internally to produce an output.
         | 
         | As long as you have a callable black box, you can compute
         | Shapley values (or approximations); it does not speak to how or
         | why the model actually works internally.
        
       ___________________________________________________________________
       (page generated 2024-06-06 23:00 UTC)