[HN Gopher] Extracting concepts from GPT-4
___________________________________________________________________
Extracting concepts from GPT-4
Author : davidbarker
Score : 132 points
Date : 2024-06-06 17:01 UTC (5 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| OmarShehata wrote:
| This is super cool, it feels like going in the direction of the
| "deep"/high level type of semantic searching I've been waiting
| for. I like their examples of basically filtering documents for
| the "concept" of price increases, or even something as high level
| as a rhetorical question
|
| I wonder how this compares to training/fine tuning a model on
| examples of rhetorical questions and asking it to find it in a
| given document. This is maybe faster/more accurate? Since it
| involves just looking at neural network activation, vs running it
| with input and having it generate an answer...?
| riku_iki wrote:
| The worrying part is that first concept in the doc they
| show/found is "human imperfection". Hope this is just
| coincidence..
| jackphilson wrote:
| I think its safety implications
| thelastparadise wrote:
| Spooky!
| calibas wrote:
| I think it's done on purpose, it's related to a very important
| point when understanding AI.
|
| Humans aren't perfect, AI is trained by humans, therefore...
| riku_iki wrote:
| I think the point of the doc is that "human imperfection" is
| very prominent concept in trained LLM..
| yismail wrote:
| Interesting, reminds me of similar work Anthropic did on Claude 3
| Sonnet [0].
|
| [0] https://transformer-circuits.pub/2024/scaling-
| monosemanticit...
| ranman wrote:
| Someone mentioned that this took almost as much compute to
| train as the original model.
| swyx wrote:
| source please!
| Legend2440 wrote:
| The methods are the same, this is just OpenAI applying
| Anthropic's research to their own model.
| longdog wrote:
| I feel the webpage strongly hints that sparse autoencoders were
| invented by OpenAI for this project.
|
| Very weird that they don't cite this in their webpage and
| instead bury the source in their paper.
| obiefernandez wrote:
| Can someone ELI5 the significance of this? (okay maybe not 5, but
| in basic language)
| 93po wrote:
| from chatgpt itself: The article discusses how researchers use
| sparse autoencoders to identify and interpret key features
| within complex language models like GPT-4, making their inner
| workings more understandable. This advancement helps improve AI
| safety and reliability by breaking down the models' decision-
| making processes into simpler, human-interpretable parts.
| OtherShrezzing wrote:
| LLM based AIs have lots of "features" which are kind of
| synonymous with "concepts" - these can be anything from `the
| concept of an apostrophe in the word don't`, to `"George Wash"
| is usually followed by "ington" in the context of early
| American History`. Inside of the LLMs neural network, these are
| mapped to some circuitry-in-software-esque paths.
|
| We don't really have a good way of understanding how these
| features are generated inside of the LLMs or how their
| circuitry is activated when outputting them, or why the LLMs
| are following those circuits. Because of this, we don't have
| any way to debug this component of an LLM - which makes them
| harder to improve. Similarly, if LLMs/AIs ever get advanced
| enough, we'll want to be able to identify if they're being
| wilfully deceptive towards us, which we can't currently do. For
| these reasons, we'd like to understand what is actually
| happening in the neural network to produce & output concepts.
| This domain of research is usually referred to as
| "interpretability".
|
| OpenAI (and also DeepMind and Anthropic) have found a few ways
| to inspect the inner circuitry of the LLMs, and reveal a
| handful of these features. They do this by asking questions of
| the model, and then inspecting which parts of the LLM's inner
| circuitry "lights up". They then ablate (turn off) circuitry to
| see if those features become less frequently used in the AIs
| response as a verification step.
|
| The graphs and highlighted words are visual representations of
| concepts that they are reasonably certain about - for example,
| the concept of the word "AND" linking two parts of a sentence
| together highlights the word "AND".
|
| Neel Nanda is the best source for this info if you're
| interested in interpretability (IMO it's the most interesting
| software problem out there at the moment), but note that his
| approach is different to OpenAI's methodology discussed in the
| post: https://www.neelnanda.io/mechanistic-interpretability
| localfirst wrote:
| hallucination solution?
| OtherShrezzing wrote:
| Solving this problem would be a step on the way to
| debugging (and then resolving, or at least highlighting)
| hallucinations.
| skywhopper wrote:
| I'm skeptical that it could ever be possible to tell the
| difference between a hallucination and a "fact" in terms
| of what's going on inside the model. Because
| hallucinations aren't really a bug in the usual sense.
| Ie, there's not some logic wrong or something misfiring.
|
| Instead, it's more appropriate to think of LLMs as
| _always_ hallucinating. And sometimes that comes really
| close to reality because there's a lot of reinforcement
| in the training data. And sometimes we humans infer
| meaning that isn't there because that's how humans work.
| And sometimes the leaps show clearly as "hallucinations"
| because the patterns the model is expressing don't match
| the patterns that are meaningful to us. (Eg when they
| hallucinate strongly patterned things like URLs or
| academic citations, which don't actually point to
| anything real. The model picked up the pattern of what
| such citations look like really well, but it didn't and
| can't make the leap to linking those patterns to
| reality.)
|
| Not to mention that a lot of use cases for LLMs we
| actually _want_ "hallucination". Eg when we ask it to do
| any creative task or make up stories or jokes or songs or
| pictures. It's only a hallucination in the wrong context.
| But context is the main thing LLMs just don't have.
| orbital-decay wrote:
| High-level concepts stored inside the large models (diffusion
| models, transformers etc) are normally hard to separate from
| each other, and the model is more or less a black box. A lot of
| research is put into obtaining the insight into what model
| knows. This is another advancement in this direction; it allows
| for easy separation of the concepts.
|
| This can be used to analyze the knowledge inside the model, and
| potentially modify (add, erase, change the importance) certain
| concepts without affecting unrelated ones. The precision
| achievable with the particular technique is always in question
| though, and some concepts are just too close to separate from
| each other, so it's probably not perfect.
| HarHarVeryFunny wrote:
| In general this is just copying work done by Anthropic, so
| there's nothing fundamentally new here.
|
| What they have done here is to identify patterns internal to
| GPT-4 that correspond to specific identifiable concepts. The
| work was done my OpenAI's mostly dismantled safety team (it has
| the names of this teams recently departed co-leads Ilya & Jan
| Leike on it), so this is nominally being done for safety
| reasons to be able to boost or suppress specific concepts from
| being activated when the model is running, such as Anthropic's
| demonstration of boosting their models fixation on the Golden
| Gate bridge:
|
| https://www.anthropic.com/news/golden-gate-claude
|
| This kind of work would also seem to have potential functional
| uses as well as safety ones, given that it allow you to control
| the model in specific ways.
| andreyk wrote:
| Exciting to see this so soon after Anthropic's "Mapping the Mind
| of a Large Language Model" (under 3 weeks). I find these efforts
| really exciting; it is still common to hear people say "we have
| no idea how LLMs / Deep Learning works", but that is really a
| gross generalization as stuff like this shows.
|
| Wonder if this was a bit rushed out in response to Anthropic's
| release (as well as the departure of Jan Leike from OpenAI)...
| the paper link doesn't even go to Arxiv, and the analysis is not
| nearly as deep. Though who knows, might be unrelated.
| jerrygenser wrote:
| > but that is really a gross generalization as stuff like this
| shows.
|
| I think this research actually still reinforces that we still
| have very little understanding of the internals. The blog post
| also reiterates that this is early work with many limitations.
| thegrim33 wrote:
| From the article:
|
| "We currently don't understand how to make sense of the neural
| activity within language models."
|
| "Unlike with most human creations, we don't really understand
| the inner workings of neural networks."
|
| "The [..] networks are not well understood and cannot be easily
| decomposed into identifiable parts"
|
| "[..] the neural activations inside a language model activate
| with unpredictable patterns, seemingly representing many
| concepts simultaneously"
|
| "Learning a large number of sparse features is challenging, and
| past work has not been shown to scale well."
|
| etc., etc., etc.
|
| People say we don't (currently) know why they output what they
| output, because .. as the article clearly states, we don't.
| surfingdino wrote:
| Not holding my breath for that hallucinated cure for cancer
| then.
| ben_w wrote:
| LLMs aren't the only kind of AI, just one of the two
| current shiny kinds.
|
| If a "cure for cancer" (cancer is not just one disease so,
| unfortunately, that's not even as coherent a request as
| we'd all like it to be) is what you're hoping for, look
| instead at the stuff like AlphaFold etc.:
| https://en.wikipedia.org/wiki/AlphaFold
|
| I don't know how to tell where real science ends and PR
| bluster begins in such models, though I can say that the
| closest I've heard to a word against it is "sure, but we've
| got other things besides protein folding to solve", which
| is a good sign.
|
| (I assume AlphaFold is also a mysterious black box, and
| that tools such as the one under discussion may help us
| demystify it too).
| surfingdino wrote:
| It's bullshit sold in a very convincing way. Whenever you
| ask guys selling AI they always say "sure, it can't do
| that, but there are other things it can do..." or "it's
| not the right questions" or "you are asking it the wrong
| way" ... all while selling it as the solution to the
| problems they say it cannot solve.
| wg0 wrote:
| Downvotes for Truth.
|
| "Hold on to your papers, What a time to be alive!"
| TrainedMonkey wrote:
| I read this as "we have not built up tools / math to
| understand neural networks as they are new and exciting" and
| not as "neural networks are magical and complex and not
| understandable because we are meddling with something we
| cannot control".
|
| A good example would be planes - it took a long while to
| develop mathematical models that could be used to model
| behavior. Meanwhile practical experimentation developed
| decent rule of thumb for what worked / did not work.
|
| So I don't think it's fair to say that "we don't" (know how
| neural networks work), we don't have math / models yet that
| can explain/model their behavior...
| imjonse wrote:
| Both Leike and Sutskever are still credited in the post.
| swyx wrote:
| > Wonder if this was a bit rushed out in response to
| Anthropic's release
|
| too lazy to dig up source but some twitter sleuth found that
| the first commit to the project was 6 months ago
|
| likely all these guys went to the same metaphorical SF bars, it
| was in the water
| nicce wrote:
| Visualizer was added 18 hours ago:
|
| https://github.com/openai/sparse_autoencoder/commit/764586ae.
| ..
| szvsw wrote:
| > likely all these guys went to the same metaphorical SF
| bars, it was in the water
|
| It also is coming from a long lineage of thought no? For
| instance, one of the things often thought early in an ML
| course is the notion that "early layers respond to/generate
| general information/patterns, and deeper layers respond
| to/generate more detailed/complex patterns/information." That
| is obviously an overly broad and vague statement but it is a
| useful intuition and can be backed up by doing some various
| inspection of eg what maximally activates some convolution
| filters. So already there is a notion that there is some sort
| of spatial structure to how semantics are processed and
| represented in a neural network (even if in a totally
| different context, as in image processing mentioned above),
| where "spatial" here is used to refer to different regions of
| the network.
|
| Even more simply, in fact as simple as you can get: with
| linear regression, the most interpretable model you can get-
| you have a clear notion that different parameter groups of
| the model respond to different "concepts" (where a concept is
| taken to be whatever the variables associated with a given
| subset of coefficients represent).
|
| In some sense, at least in a high-level/intuitive reading of
| the new research coming out of Anthropic and OpenAI, I think
| the current research is just a natural extension of these
| ideas, albeit in a much more complicated context and massive
| scale.
|
| Somebody else, please correct me if you think my reading is
| incorrect!!
| svieira wrote:
| When one of the first examples is:
|
| > GPT-4 feature: ends of phrases related to price increases
|
| and the 2/5s of the responses don't have any relation to
| _increase_ at all:
|
| > Brent crude, fell 38 cents to $118.29 a barrel on the ICE
| Futures Exchange in London. The U.S. benchmark, West Texas
| Intermediate crude, was down 53 cents to $99.34 a barrel on the
| New York Mercantile Exchange. -- Ronald D. White Graphic: The AAA
|
| and
|
| > ,115.18. The record reflects that appellant also included
| several hand-prepared invoices and employee pay slips, including
| an allegedly un-invoiced laundry ticket dated 29 June 2013 for 53
| bags oflaundry weighing 478 pounds, which, at the contract price
| of $
|
| I think I must be mis-understanding something. Why would this
| example (out of all the potential examples) be picked?
| Metus wrote:
| Notice that most of the examples have none of the green
| highlight counter, which is shown for
|
| > small losses. KEEPING SCORE: The Dow Jones industrial average
| rose 32 points, or 0.2 percent, to 18,156 as of 3:15 p.m.
| Eastern time. The Standard & Poor's ... OMAHA, Neb. (AP) --
| Warren Buffett's company has bought nearly
|
| the other sentences are in contrast to show how specific this
| neuron is.
| svieira wrote:
| Ah, that makes a lot of sense, thank you!
| yorwba wrote:
| The highlights are better visible in this visualisation:
| https://openaipublic.blob.core.windows.net/sparse-
| autoencode...
|
| There are also many top activations not showing increases,
| e.g.
|
| > 0.06 of a cent to 90.01 cents US.||U.S. indexes were mainly
| lower as the Dow Jones industrials lost 21.72 points to
| 16,329.53, the Nasdaq was up 11.71 points at 4,318.9 and the
| S&P 500
|
| (Highlight on the first comma.)
| Shoop wrote:
| Can anyone summarize the major differences between this and
| Scaling Monosemanticity?
| calibas wrote:
| I want to be able to view exactly how my input is translated into
| tokens, as well as the embeddings for the tokens.
| franzb wrote:
| For your first question: https://platform.openai.com/tokenizer
| calibas wrote:
| I saw that, but the language makes me think it's not quite
| the same as what's really being used?
|
| "how a piece of text _might_ be tokenized by a language model
| "
|
| "It's important to note that the exact tokenization process
| varies between models."
| yorwba wrote:
| That's why they have buttons to choose which model's
| tokenizer to use.
| calibas wrote:
| Yes, thank you, I understand that part.
|
| It's the _might_ condition in the description that makes
| me think the results might not be the exact same as what
| 's used in the live models.
| andy12_ wrote:
| a.k.a, the same work as Anthropic, but with less interpretable
| and interesting features. I guess there won't be Golden Gate[0]
| GPT anytime soon.
|
| I mean, you just have to compare the couple of interesting
| features of the OpenAI feature browser [1] and the features of
| the Anthropic feature browser [2].
|
| [0] https://twitter.com/AnthropicAI/status/1793741051867615494
|
| [1] https://openaipublic.blob.core.windows.net/sparse-
| autoencode...
|
| [2] https://transformer-circuits.pub/2024/scaling-
| monosemanticit...
| itissid wrote:
| Does this mean that it could be a good practice to release the
| auto encoder that was trained on a neural network to explain its
| outputs? Like all open models in hugging face could have this as
| a useful accompaniment?
| mlsu wrote:
| This is interesting:
|
| > Autoencoder family
|
| > Note: Only 65536 features available. Activations shown on The
| Pile (uncopyrighted) instead of our internal training dataset.
|
| So, the Pile is uncopyrighted, but the internal training dataset
| is copyrighted? Copyrighted by whom?
|
| Huh?
| Arcsech wrote:
| > Copyrighted by whom?
|
| By people who would get angry if they could definitively prove
| their stuff was in OpenAI's training set.
| immibis wrote:
| Basically everyone. You, and me, and Elon Musk, and EMPRESS,
| and my uncle who works for Nintendo. They're just hoping that
| AI training legally ignores copyright.
| Der_Einzige wrote:
| Hehe, related to this, _someone_ created a "book4" dataset and
| put it on torrent websites. I don't think it's being used in
| any major LLMs, but the future "piracy" community intersection
| with AI is going to be exciting.
|
| Watching the cyberpunk world that all of my favorite literature
| predicted slowly come to our world is fun indeed.
| swyx wrote:
| i think you mean @sillysaurus' books3? not books4?
| russellbeattie wrote:
| One feature I expect we'll get from this sort of research is
| identifying "hot spots" that are used during inference. Like
| virtual machines, these could be cached in whole or in part and
| used to both speed up the response time and reduce computation
| cycles needed.
| aeonik wrote:
| In their other examples, they have what looks to be a scientific
| explanation of reproductive anatomy classified as erotic
| content...
|
| Here is the link to the concept [content warning]:
| https://openaipublic.blob.core.windows.net/sparse-autoencode...
|
| DocID: 191632
| adamiscool8 wrote:
| How does this compare to or improve on applying something like
| SHAP[0][1] on a model? The idea in the first line that "we
| currently don't understand how to make sense of the neural
| activity within language models." is..straight up false?
|
| [0] https://github.com/shap/shap
|
| [1]
| https://en.wikipedia.org/wiki/Shapley_value#In_machine_learn...
| szvsw wrote:
| SHAP is pretty separate IMO. Shapley analysis is really a game
| theoretical methodology that is model agnostic and is only
| about determining how individual sections of the input
| contribute to a given prediction, not about how the model
| actually works internally to produce an output.
|
| As long as you have a callable black box, you can compute
| Shapley values (or approximations); it does not speak to how or
| why the model actually works internally.
___________________________________________________________________
(page generated 2024-06-06 23:00 UTC)