[HN Gopher] LLMs use a surprisingly simple mechanism to retrieve...
___________________________________________________________________
LLMs use a surprisingly simple mechanism to retrieve some stored
knowledge
Author : CharlesW
Score : 276 points
Date : 2024-03-28 14:37 UTC (8 hours ago)
(HTM) web link (news.mit.edu)
(TXT) w3m dump (news.mit.edu)
| leobg wrote:
| > In one experiment, they started with the prompt "Bill Bradley
| was a" and used the decoding functions for "plays sports" and
| "attended university" to see if the model knows that Sen. Bradley
| was a basketball player who attended Princeton.
|
| Why not just change the prompt? Name, University
| attended, Sport played Bill Bradley,
| numeri wrote:
| This is research, trying to understand the fundamentals of how
| these models work. They weren't actually trying to find out
| where Bill Bradley went to university.
| leobg wrote:
| Of course. But weren't they trying to find out whether or not
| that fact was represented in the model's parameters?
| wnoise wrote:
| No, they were trying to figure out if they had isolated
| where facts like that were represented.
| vsnf wrote:
| > Linear functions, equations with only two variables and no
| exponents, capture the straightforward, straight-line
| relationship between two variables
|
| Is this definition considering the output to be included in the
| set of variables? What a strange way to phrase it. Under this
| definition, I wonder what an equation with one variable is. Is a
| single constant an equation?
| 01HNNWZ0MV43FF wrote:
| Yeah I guess they mean one independent variable and one
| dependent variable
|
| It rarely matters because if you had 2 dependent variables, you
| can just express that as 2 equations, so you might as well
| assume there's exactly 1 dependent and then only discuss the
| number of independent variables.
| olejorgenb wrote:
| I would think `x = 4` is considered an equation, yes?
| pessimizer wrote:
| And linear at that: x = 0y + 4
| ksenzee wrote:
| I think they're trying to say "equations in the form y = mx +
| b" without getting too technical.
| hansvm wrote:
| It's just a change in perspective. Consider a vertical line. To
| have an "output" variable you have to switch the ordinary
| `y=mx+b` formulation to `x=c`. The generalization `ax+by=c`
| accommodates any shifted line you can draw. Adding more
| variables increases the dimension of the space in consideration
| (`ax+by+cz=d` could potentially define a plane). Adding more
| equations potentially reduces the size of the space in
| consideration (e.g., if `x+y=1` then also knowing `2x+2y=2`
| wouldn't reduce the solution space, but `x-y=0` would, and
| would imply `x=y=1/2`, and further adding `x+2y=12` would imply
| a lack of solutions).
|
| Mind you, the "two variable" statement in this news piece is a
| red-herring. The paper describes higher-dimension linear
| relationships, of the form `Mv=c` for some constant matrix `M`,
| some constant vector `c`, and some variable vector `v`.
|
| On some level, the result isn't _that_ surprising. The paper
| only examines one layer (not the whole network), after the
| network has done a huge amount of embedding work. In that
| layer, they find that under half the time they're able to get
| over 60% of the way there with a linear approximation. Another
| interpretation is that the single layer does some linear work
| and shoves it through some nonlinear transformations, and more
| than half the time that nonlinearity does something very
| meaningful (and even in that under half the time where the
| linear approximation is "okay", the metrics are still bad).
|
| I'm not super impressed, but I don't have time to full parse
| the thing right now. It is a bit surprising; if memory serves,
| one of the authors on this paper had a much better result in
| terms of neural network fact editing in the last year or two.
| This looks like a solid research idea, solid work, it didn't
| pan out, and to get it published they heavily overstated the
| conclusions (and then the university press release obviously
| bragged as much as it could).
| pb060 wrote:
| Aren't functions and equations two different things?
| whatever1 wrote:
| Llms seem like a good compression mechanism.
|
| It blows my mind that I can have a copy of llama locally on my PC
| and have access to virtually the entire internet
| Culonavirus wrote:
| Yea except it's a lossy compression. With the lost part being
| hallucinated in at inference time.
| Kuinox wrote:
| If you've read the article, the LLM hallucinations aren't due
| to the model not knowing the information but a function that
| choose to remember the wrong thing.
| sinemetu11 wrote:
| From the paper:
|
| > Finally, we use our dataset and LRE-estimating method to
| build a visualization tool we call an attribute lens.
| Instead of showing the next token distribution like Logit
| Lens (nostalgebraist, 2020) the attribute lens shows the
| object-token distribution at each layer for a given
| relation. This lets us visualize where and when the LM
| finishes retrieving knowledge about a specific relation,
| and can reveal the presence of knowledge about attributes
| even when that knowledge does not reach the output.
|
| They're just looking at what lights up in the embedding
| when they feed something in, and whatever lights up is
| "knowing" about that topic. The function is an
| approximation they added on top of the model. It's
| important to not conflate this with the actual weights of
| the model.
|
| You can't separate the hallucinations from the model --
| they exist precisely because of the lossy compression.
| ewild wrote:
| even this place has people not reading the articles. we are
| doomed
| krainboltgreene wrote:
| > have access to virtually the entire internet
|
| It isn't even close to 1% of the internet, much less virtually
| the entire internet. According to the latest dump, Common Crawl
| has 4.3B pages, but Google in 2016 estimated there are 130T
| pages. The difference between 130T and 4.3B is about 130T. Even
| if you narrow it down to Google's searchable text index it's
| "100's of billions of pages" and roughly 100P compared to
| CommonCrawl's 400T.
| fspeech wrote:
| 130T unique pages? That seems highly unlikely as that
| averages to over 10000 pages for each human being alive. If
| gp merely wants texts of interest to self as opposed to an
| accurate snapshot it seems LLMs should be quite capable, one
| day.
| aia24Q1 wrote:
| I thought "fact" means truth.
| MuffinFlavored wrote:
| I don't understand how a "CSV file/database/model" of
| 70,000,000,000 (70B) "parameters" of 4-bit weights (a 4 bit value
| can be 1 of 16 unique numbers) gets us an interactive LLM/GPT
| that is near-all-knowledgable on all topics/subjects.
|
| edit: did research, the 4-bit is just a "compression method", the
| model ends up seeing f32?
|
| > Quantization is the process of mapping 32-bit floating-point
| numbers (which are the weights in the neural network) to a much
| smaller bit representation, like 4-bit values, for storage and
| memory efficiency.
|
| > Dequantization happens when the model is used (during inference
| or even training, if applicable). The 4-bit quantized weights are
| converted back into floating-point numbers that the model's
| computations are actually performed with. This is done using the
| scale and zero-point determined during the initial quantization,
| or through more sophisticated mapping functions that aim to
| preserve as much information as possible despite the reduced
| precision.
|
| so what is the relationship to "parameters" and "# of unique
| tokens the model knows about (vocabulary size)"?
|
| > At first glance, LLAMa only has a 32,000 vocabulary size and
| 65B parameters as compared to GPT-3,
|
| > The 65 billion parameters in a model like LLAMA (or any large
| language model) essentially function as a highly intricate
| mapping system that determines how to respond to a given input
| based on the learned relationships between tokens in its training
| data.
| Filligree wrote:
| It doesn't, is the simple answer.
|
| The slightly more complicated one is that a compressed text
| dump of Wikipedia isn't even 70GB, and this is _lossy_
| compression of the internet.
| MuffinFlavored wrote:
| say the average LLM these days has a unique token
| (vocabulary) size of ~32,000 (not its context size, # of
| unique tokens it can pick between in a response. English
| words, punctuation, math, code, etc.)
|
| the 60-70B parameters of models is basically like... just
| stored patterns of "if these 10 tokens in a row input, then
| these 10 tokens in a row output score the highest"
|
| Is that a good summary?
|
| > The model uses its learned statistical patterns to predict
| the probability of what comes next in a sequence of text.
|
| based on what inputs?
|
| 1. previous tokens in the sequence from immediate context
|
| 2. tokens summarizing the overall topic/subject matter from
| the extended context
|
| 3. scoring of learned patterns from training
|
| 4. what else?
| numeri wrote:
| Your suggested scheme (assuming a mapping from 10 tokens to
| 10 tokens, with each token taking 2 bytes to store) would
| take (32000 * 20) * 2 bytes = 2.3e78 TiB of storage, or
| about 250 MiB per atom in the observable universe (1e82),
| prior to compression.
|
| I think it's more likely that LLMs are actually learning
| and understanding concepts as well as memorizing useful
| facts, than that LLMs have discovered a compression method
| with that high of a compression ratio, haha.
| mjburgess wrote:
| LLMs cannot determine the physical location of any atoms.
| they cannot plan movement, and so on.
|
| LLMs are just completing patterns of text that have been
| given before, 'everthing ever written' is both a lot for
| any individual person to read; but also, almost nothing,
| in that to propertly describe a table requires more
| information
|
| text is itself an extremely compressed medium which lacks
| almost any information about the world; it succeeds in
| being useful to generate because we have that information
| and are able to map it back to it
| numeri wrote:
| I didn't imply that they know anything about where atoms
| are, I was just pointing out the sheer absurdity of that
| volume of data.
|
| I should make it clear that my comparison there is unfair
| and mostly just funny - you don't need to store every
| possible combination of 10 tokens, because most of them
| will be nonsense, so you wouldn't actually need that much
| storage. That being said, it's been fairly solidly proven
| that LLMs aren't just lookup tables/stochastic parrots.
| mjburgess wrote:
| > fairly solidly proven that LLMs aren't just lookup
| tables/stochastic parrots
|
| Well i'd strongly disagree. I see no evidence of this;
| I'm am quite well acquainted with the literature.
|
| All _empirical_ statistical AI is just a means of
| approximating an empirical distribution. The problem with
| NLP is that there is no empirical function from text
| tokens to meanings; just as there is no function from
| sets of 2D images to a 3D structure.
|
| We know before we start that the distributions of text
| tokens are only coincidentally related to the
| distributions of meanings. The question is just how much
| value that coincidence has in any given task.
|
| (Consider, eg., that if I ask, "do you like what i'm
| wearing?" there is no distribution of responses which is
| correct. I do not want you to say "yes" 99/100, or even
| 100/100 times. etc. what I want you to say is a word
| caused a mental state you have: that of (dis)liking what
| i'm wearing.
|
| Since no statistical AI systems generate outputs based on
| causal features of reality, we know a priori that almost
| all possible questions that can be asked cannot be
| answered by LLMs.
|
| They are only useful where questions have cannonical
| answers; and only because "cannonical" means that a
| text->text function is likely to be conidentally
| indistinguishable from a the meaning->meaning function
| we're interested in).
| pk-protect-ai wrote:
| There is something wrong with these arithmetic: "(32000 *
| 20) * 2 bytes = 2.3e78 TiB of storage" ... The factorial
| is missing somewhere in there ...
| wongarsu wrote:
| That would be equivalent to a hidden markov chain. Those
| have been around for decades, but we have only managed to
| make them coherent for very short outputs. Even GPT2 beats
| any Markov chain, so there has to be more going on
|
| Modern LLMs are able to transfer knowledge between
| different languages, so it's fair to assume that some
| mapping between human language and a more abstract internal
| representation happens at the input and output, instead of
| the model "operating" on English or Chinese or whatever
| language you talk with it. And once this exists, an
| internal "world model" (as in: a collection of facts and
| implications) isn't far, and seems to indeed be something
| most LLMs do. The reasoning on top of that world model is
| still very spotty though
| HarHarVeryFunny wrote:
| > the 60-70B parameters of models is basically like... just
| stored patterns of "if these 10 tokens in a row input, then
| these 10 tokens in a row output score the highest"
|
| > Is that a good summary?
|
| No - there's a lot more going on. It's not just mapping
| input patterns to output patterns.
|
| A good starting point to understand it are linguist's
| sentence-structure trees (and these were the inspiration
| for the "transformer" design of these LLMs).
|
| https://www.nltk.org/book/ch08.html
|
| Note how there are multiple levels of nodes/branches to
| these trees, from the top node representing the sentence as
| a whole, to the words themselves which are all the way at
| the bottom.
|
| An LLM like ChatGPT is made out of multiple layers (e.g. 96
| layers for GPT-3) of transformer blocks, stacked on top of
| each other. When you feed an input sentence into an LLM,
| the sentence will first be turned into a sequence of token
| embeddings, then passed through each of these 96 layers in
| turn, each of which changes ("transforms") it a little bit,
| until it comes out the top of the stack as the predicted
| output sentence (or something that can be decoded into the
| output sentence). We only use the last word of the output
| sentence which is the "next word" it has predicted.
|
| You can think of these 96 transformer layers as a bit like
| the levels in one of those linguistic sentence-structure
| trees. At the bottom level/layer are the words themselves,
| and at each successive higher level/layer are higher-and-
| higher level representations of the sentence structure.
|
| In order to understand this a little better, you need to
| understand what these token "embeddings" are, which is the
| form in which the sentence is passed through, and
| transformed by, these stacked transformer layers.
|
| To keep it simple, think of a token as a word, and say the
| model has a vocabulary of 32,000 words. You might perhaps
| expect that each word is represented by a number in the
| range 1-32000, but that is not the way it works! Instead,
| each word is mapped (aka "embedded") to a point in a high
| dimensional space (e.g. 4096-D for LLaMA 7B), meaning that
| it is represented by a vector of 4096 numbers (cf a point
| in 3-D space represented as (x,y,z)).
|
| These 4096 element "embeddings" are what actually pass thru
| the LLM and get transformed by it. Having so many
| dimensions gives the LLM a huge space in which it can
| represent a very rich variety of concepts, not just words.
| At the _first_ layer of the transformer stack these
| embeddings do just represent words, the same as the nodes
| do at the bottom layer of the sentence-structure tree, but
| more information is gradually added to the embeddings by
| each layer, augmenting and transforming what they mean. For
| example, maybe the first transformer layer adds "part of
| speech" information so that each embedded word is now also
| tagged as a noun or verb, etc. At the next layer up, the
| words comprising a noun phase or verb phrase may get
| additionally tagged as such, and so-on as each transformer
| layer adds more information.
|
| This just gives a flavor of what is happening, but
| basically by the time the sentence has reached the top
| layer of the transformer it has been able to see the entire
| tree structure of the sentence, and only then have
| "understand" it well enough to predict a grammatically and
| semantically "correct" continuation from which it is able
| to predict continuation words.
| MichaelZuo wrote:
| Thanks for the explanation.
|
| Since unicode has well over 64000 symbols, does that
| imply models, trained on a large corpus, must necessarily
| have at least 64000 'branches' at the bottom layer?
| ramses0 wrote:
| Is there some sort of "LLM-on-Wikipedia" competition?
|
| ie: given "just wikipedia" what's the best score people can
| get on however these models are evaluated.
|
| I know that all the commercial ventures have a voracious
| data-input set, but it seems like there's room for
| dictionary.llm + wikipedia.llm + linux-kernel.llm and some
| sort of judging / bake-off for their different performance
| capabilities.
|
| Or does the training truly _NEED_ every book every written +
| the entire internet + all knowledge ever known by mankind to
| have an effective outcome?
| CraigJPerry wrote:
| >> Or does the training truly _NEED_ every book every
| written + the entire internet + all knowledge ever known by
| mankind to have an effective outcome?
|
| I have the same question.
|
| Peter Norvig's GOFAI Shakespeare generator example[1]
| (which is not an LLM) gets impressive results with little
| input data to go on. Does the leap to LLM preclude that
| kind of small input approach?
|
| [1] link should be here because I assumed as I wrote the
| above that I would just turn it up with a quick google.
| Alas t'was not to be. Take my word for it, somewhere on
| t'internet is an excellent write up by Peter Norvig on LLM
| vs GOFAI (good old fashioned artificial intelligence)
| bionhoward wrote:
| Yes, that's known as the Hutter Prize
| http://prize.hutter1.net/
| ramses0 wrote:
| Not exactly, because LLM's seem to be exhibiting value
| via "lossy knowledge response" vs. "exact reproduction
| measured in bytes", but close.
| Acumen321 wrote:
| Quantization in this context is the precision of each value in
| the vector or matrix/tensor.
|
| If the model in question has a token embedding length of 1024,
| even if it was a 1 bit quantization, each token has 2^1024
| possible values.
|
| If the context length is 32,000 tokens, there are 32,000^2^1024
| possible inputs.
| estebarb wrote:
| I find this similar to what relation vectors do in word2vec: you
| can add a vector of "X of" and often get the correct answer. It
| could be that the principle is still the same, and transformers
| "just" build a better mapping of entities into the embedding
| space?
| PaulHoule wrote:
| I think so. It's hard for me to believe that the decision
| surfaces inside those models are really curved enough (like the
| folds of your brain) to really take advantage of FP32 numbers
| inside vectors: that is I just don't believe it is
| x = 0 means "fly" x = 0.01 means "drive" x = 0.02
| means "purple"
|
| but rather more like x < 1.5 means "cold"
| x > 1.5 means "hot"
|
| which is one reason why quantization (often 1 bit) works. Also
| it is a reason why you can often get great results feeding text
| or images through a BERT or CLIP-type model and then applying
| classical ML models that frequently involve linear decision
| surfaces.
| taneq wrote:
| Are you conflating nonlinear embedding spaces with the
| physical curvature of the cerebellum? I don't think there's a
| direct mapping.
| PaulHoule wrote:
| My mental picture is that violently curved decision
| surfaces could _look_ like the convolutions of the brain
| even though they have nothing to do with how the brain
| actually works.
|
| I think of how tSNE and other algorithms sometimes produce
| projections that sometimes look like that (maybe that's
| just what you get when you have to bend something
| complicated to fit into a 2-d space) and frequently show
| cusps that to me look like a sign of trouble (took me a
| while in my PhD work to realize how Poincare sections from
| 4 or 6 dimensions can look messed up when a part of the
| energy surface tilts perpendicularly to the projection
| surface.)
|
| I still find it hard to believe that dense vectors are the
| right way to deal with text despite the fact that they work
| so well. For images it is one thing because changing one
| pixel a little doesn't change the meaning of an image, but
| changing a single character of a text can completely change
| the meaning of the text. Also there's the reality that if
| you randomly stick together tokens you get something
| meaningless, so it seems almost all of the representation
| space covers ill formed texts and only a low dimensional
| manifold holds the well formed texts. Now the decision
| surfaces really have to be nonlinear and crumpled over all
| but I think there's a definitely a limit on how crumpled
| those surfaces can be.
| Y_Y wrote:
| This is interesting. It makes me think of an
| "immersion"[0], as in a generalization of the concept of
| "embedding" in differential geometry.
|
| I share your uneasiness about mapping words to vectors
| and agree that it feels as if we're shoehorning some more
| complex space into a computationally convenient one.
|
| [0] https://en.wikipedia.org/wiki/Immersion_(mathematics)
| derefr wrote:
| Help me understand: when they say that the facts are stored as a
| linear function... are they saying that the LLM has a sort of
| N-dimensional "fact space" encoded into the model in some manner,
| where facts are embedded into the space as (points / hyperspheres
| / Voronoi manifolds / etc); and where recalling a fact is -- at
| least in an abstract sense -- the NN computing / remembering a
| key to use, and then doing a key-value lookup in this space?
|
| If so: how _do_ you embed a KV-store into an edge-propagated
| graphical model? Are there even any well-known techniques for
| doing that "by hand" right now?
|
| (Also, fun tangent: isn't the "memory palace" memory technique,
| an example of _human brains_ embedding facts into a linear
| function for easier retrieval?)
| bionhoward wrote:
| [Layer] Normalization constrains huge vectors representing
| tokens (input fragments) to positions on a unit ball (I think),
| and the attention mechanism operates by rotating the
| unconstrained ones based on the sum of their angles relative to
| all the others.
|
| I only skimmed the paper but believe the point here is that
| there are relatively simple functions hiding in or recoverable
| from the bigger network which specifically address certain
| categories of relationships between concepts.
|
| Since it would, in theory, be possible to optimize such
| functions more directly if they are possible to isolate, could
| this enable advances in the way such models are trained?
| Absolutely.
|
| After all, one of the best criticisms of "modern" AI is the
| notion we're just mixing around a soup of linear algebra.
| Allowing some sense of modularity (reductionism) could make
| them less of a black box and more of a component driven
| approach (in the lagging concept space and not just the leading
| layer space)
| jacobn wrote:
| The fundamental operation done by the transformer,
| softmax(Q.K^T).V, is essentially a KV-store lookup.
|
| The Query is dotted with the Key, then you take the softmax to
| pick mostly one winning Key (the Key closest to the Query
| basically), and then use the corresponding Value.
|
| That is really, really close to a KV lookup, except it's a
| little soft (i.e. can hit multiple Keys), and it can be
| optimized using gradient descent style methods to find the
| suitable QKV mappings.
| naveen99 wrote:
| Not sure there is any real lookup happening. Q,K are the same
| and sometimes even v is the same...
| toxik wrote:
| Q, K, V are not the same. In self-attention, they are all
| computed by separate linear transformation of the same
| input (ie the previous layer's output). In cross-attention
| even this is not true, then K and V are computed by linear
| transformation of whatever is cross-attended, and Q is
| computed by linear transformation of the input as before.
| ewild wrote:
| yeah a common misconception people think because the
| input is the same they forget that their is a pre
| attention linear transofrmation for q k and v (using the
| decoder only version obv v is diff with encoder decoder
| bert style)
| thfuran wrote:
| >isn't the "memory palace" memory technique, an example of
| human brains embedding facts into a linear function for easier
| retrieval?
|
| I'm not sure I see how that's a linear function.
| wslh wrote:
| Can we roughly say that LLMs produces (training mode) a lot of
| IF-THENs in an automatic way from a vast quantity of information
| (nor techniques) that was not available before?
| mike_hearn wrote:
| This is really cool. My mind goes immediately to what sort of
| functions are being used to encode programming knowledge, and if
| they are also simple linear functions whether the standard
| library or other libraries can be directly uploaded into an LLMs
| brain as it evolves, without needing to go through a costly
| training or performance-destroying fine-tune. That's still a sci-
| fi ability today but it seems to be getting closer.
| Animats wrote:
| That's a good point. It may be possible to directly upload
| predicate-type info into a LLM. This could be especially useful
| if you need to encode tabular data. Somewhere, someone probably
| read this and is thinking about how to export Excel or
| databases to an LLM.
|
| It's encouraging to see people looking inside the black box
| successfully. The other big result in this area was that paper
| which found a representation of a game board inside a LLM after
| the LLM had trained to play a game. Any other good results in
| that area?
|
| The authors point out that LLMs are doing more than encoding
| predicate-type info. That's just part of what they are doing.
| AaronFriel wrote:
| It indeed is. An attention mechanism's key and value matrices
| grow linearly with context length. With PagedAttention[1], we
| could imagine an external service providing context. The hard
| part is the how, of course. We can't load our entire database
| in every conversation, and I suspect there are also
| challenges around training (perhaps addressed via
| LandmarkAttention[2]) and building a service efficiently
| retrieve additional key-value matrices.
|
| The external service vector database may require tight
| timings necessary to avoid stalling LLMs. To manage 20-50
| tokens/sec, answers must arrive within 50-20ms.
|
| And we cannot do this in real-time, pausing the transformer
| when a layer produces a query vector stalls the batch, so we
| need a way to predict queries (or embeddings) several tokens
| ahead of where they'd be useful and inject the context in
| when it's needed, and to know when to page it out.
|
| [1] https://arxiv.org/abs/2309.06180
|
| [2] https://arxiv.org/abs/2305.16300
| wongarsu wrote:
| The opposite is also exciting: build a loss function that
| punishes models for storing knowledge. One of the issues of
| current models is that they seem to favor lookup over
| reasoning. If we can punish models (during training) for
| remembering that might cause them to become better at
| inference and logic instead.
| kossTKR wrote:
| Interesting. Reminds me of a sci-fi short i read years ago
| where AI's "went insane" when they had too much knowledge
| because they'd spent too much time looking through data and
| get a buffer overflow.
|
| I know some of the smaller models like PHI-2 are training
| for reasoning specifically before by training on question
| answer sets, though this seems like the opposite to me.
| politician wrote:
| Hah! Maybe Neo was an LLM. "I know kung-fu."
| i5heu wrote:
| So it is entirely possible to decouple the reasoning part from
| the information part?
|
| This is like absolutely mind blowing if this is true.
| learned wrote:
| A big caveat mentioned in the article is that this experiment
| was done with a small set (N=47) of specific questions that
| they expected to have relatively simple relational answers:
|
| > The researchers developed a method to estimate these simple
| functions, and then computed functions for 47 different
| relations, such as "capital city of a country" and "lead singer
| of a band." While there could be an infinite number of possible
| relations, the researchers chose to study this specific subset
| because they are representative of the kinds of facts that can
| be written in this way.
|
| About 60% of these relations were retrieved using a linear
| function in the model. The remaining appeared to have nonlinear
| retrieval and is still a subject of investigation:
|
| > Functions retrieved the correct information more than 60
| percent of the time, showing that some information in a
| transformer is encoded and retrieved in this way. "But not
| everything is linearly encoded. For some facts, even though the
| model knows them and will predict text that is consistent with
| these facts, we can't find linear functions for them. This
| suggests that the model is doing something more intricate to
| store that information," he says.
| retrofrost wrote:
| This is amazing work, but to me it highlights some of the biggest
| problems in the current AI zeitgeist, we are not really trying to
| work on any neuron or ruleset that isnt much different from the
| perceptron thats just a sumnation function. Is it really that
| suprising that we just see this same structure repeated in the
| models. Just because feedforward topologies with single neuron
| steps are the easiest to train and run on graphics cards does
| that really make them the actual best at accomplishing tasks? We
| have all sorts of unique training methods and encoding schemes
| that don't ever get used because the big libraries don't support
| them. Until, we start seeing real varation in the fundamental
| rulesets of neuralnets we are always just going to be fighting
| against the fact these are just perceptrons with extra steps.
| visarga wrote:
| > Just because feedforward topologies with single neuron steps
| are the easiest to train and run on graphics cards does that
| really make them the actual best at accomplishing tasks?
|
| You are ignoring a mountain of papers trying all conceivable
| approaches to create models. It is evolution by selection, in
| the end transformers won.
| dartos wrote:
| I mean RWKV seems promising and isn't a transformer model.
|
| Transformers have first mover advantage. They were the first
| models that scaled to large parameter counts.
|
| That doesn't mean they're the best or that they've won, just
| that they were the first to get big (literally and
| metaphorically)
| tkellogg wrote:
| Yeah, I'd argue that transformers created such capital
| saturation that there's a ton of opportunity for
| alternative approaches to emerge.
| dartos wrote:
| Speak of the devil. Jamba just hit the front page.
| refulgentis wrote:
| It doesn't seem promising, a one man band has been doing a
| quixotic quest based on intuition and it's gotten ~nowhere,
| and it's not for lack of interest in alternatives. There's
| never been a better time to have a different approach - is
| your metric "times I've seen it on HN with a convincing
| argument for it being promising?" -- I'm not embarrassed to
| admit that is/was mine, but alternatively, you're aware of
| recent breakthroughs I haven't seen.
| retrofrost wrote:
| Just because papers are getting published doesn't mean its
| actually gaining any traction. I mean we have known that time
| series of signals recieves plays a huge role in how bio
| neurons functionally operate and yet we have nearly no
| examples of spiking networks being pushed beyond basic
| academic exploration. We have known glial cells play a
| critical role in biological neural and yet you can probably
| count the number of papers that examine using an abstraction
| of that activity in neural net, on both your hands and toes.
| Neuroevolution using genetic algorithms has been basically
| looking for a big break since NEAT. Its the height of hubris
| to say that we have peaked with transformers when the entire
| field is based on not getting trapped in local maxima's.
| Sorry to be snippy, but there is so much uncovered ground its
| not even funny.
| gwervc wrote:
| "We" are not forbidding you to open a computer, start
| experimenting and publishing some new method. If you're so
| convinced that "we" are stuck in a local maxima, you can do
| some of the work you are advocating instead of asking other
| to do it for you.
| Kerb_ wrote:
| You can think chemotherapy is a local maxima for cancer
| treatment and hope medical research seeks out other
| options without having the resources to do it yourself.
| Not all of us have access to the tools and resources to
| start experimenting as casually as we wish we could.
| erisinger wrote:
| Not a single one of you bigbrains used the word "maxima"
| correctly and it's driving me crazy.
| vlovich123 wrote:
| As I understand it a local maxima means you're at a local
| peak but there may be higher maximums elsewhere. As I
| read it, transformers are a local maximum in the sense of
| outperforming all other ML techniques as the AI technique
| that gets the closest to human intelligence.
|
| Can you help my little brain understand the problem by
| elaborating?
|
| Also you may want to chill with the personal attacks.
| erisinger wrote:
| Not a personal attack. These posters are smarter than I
| am, just ribbing them about misusing the terminology.
|
| "Maxima" is plural, "maximum" is singular. So you would
| say "a local maximum," or "several local maxima." Not "a
| local maxima" or, the one that really got me, "getting
| trapped in local maxima's."
|
| As for the rest of it, carry on. Good discussion.
| FeepingCreature wrote:
| A local maxima, that is, /usr/bin/wxmaxima...
| erisinger wrote:
| Touche...
| gyrovagueGeist wrote:
| While "local maximas" is wrong, I think "a local maxima"
| is a valid way to say "a member of the set of local
| maxima" regardless of the number of elements in the set.
| It could even be a singleton.
| Tijdreiziger wrote:
| You can't have one maxima in the same way you can't have
| one pencils. That's just how English works.
| tschwimmer wrote:
| yeah, not a Nissan in sight
| mikewarot wrote:
| MNIST and other small and easy to train against datasets
| are widely available. You can try out anything you like
| even with a cheap laptop these days thanks to a few
| decades of Moore's law.
|
| It is definitely NOT out of your reach to try any ideas
| you have. Kaggle and other sites exist to make it easy.
|
| Good luck! 8)
| retrofrost wrote:
| My pet project has been trying to use elixir with NEAT or
| HyperNEAT to try and make a spiking network, then when
| thats working decently drop some glial interactions I saw
| in a paper. It would be kinda bad at purely functional
| stuff, but idk seems fun. The biggest problems are time
| and having to do a lot of both the evolutionary stuff and
| the network stuff. But yeah the ubiquity of free datasets
| does make it easy to train.
| haltIncomplete wrote:
| All we're doing is engineering new data compression and
| retrieval techniques: https://arxiv.org/abs/2309.10668
|
| Are we sure there's anything "net new" to find within the
| same old x86 machines, within the same old axiomatic
| systems of the past?
|
| Math is a few operations applied to carving up stuff and
| we believe we can do that infinitely in theory. So "all
| math that abides our axiomatic underpinnings" is valid
| regardless if we "prove it" or not.
|
| Physical space we can exist in, a middle ground of
| reality we evolved _just so_ to exist in, seems to be
| finite; I can't just up and move to Titan or Mars. So our
| computers are coupled to the same constraints of
| observation and understanding as us.
|
| What about daily life will be upended reconfirming
| decades old experiment? How is this not living in sunk
| cost fallacy?
|
| When all you have is a hammer...
|
| I'm reminded of Einstein's quote about insanity.
| typon wrote:
| Do you really think that transformers came to us from God?
| They're built on the corpses of millions of models that
| never went anywhere. I spent an entire year trying to scale
| up a stupid RNN back in 2014. Never went anywhere, because
| it didn't work. I am sure we are stuck in a local minima
| now - but it's able to solve problems that were previously
| impossible. So we will use it until we are impossibly stuck
| again. Currently, however, we have barely begun to scratch
| the surface of what's possible with these models.
| leoc wrote:
| (The singulars are 'maximum' and 'minimum', 'maxima' and
| 'minima' are the plurals.)
| nicklecompte wrote:
| His point is that "evolution by selection" also includes that
| transformers are easy to implement with modern linear algebra
| libraries and cheap to scale on current silicon, both of
| which are engineering details with no direct relationship to
| their innate efficacy at learning (though indirectly it means
| you scale up the training data for more inefficient
| learning).
| wanderingbort wrote:
| I think it is correct to include practical implementation
| costs in the selection.
|
| Theoretical efficacy doesn't guarantee real world efficacy.
|
| I accept that this is self reinforcing but I favor real
| gains today over potentially larger gains in a potentially
| achievable future.
|
| I also think we are learning practical lessons on the
| periphery of any application of AI that will apply if a
| mold-breaking solution becomes compelling.
| foobiekr wrote:
| "won"
|
| They barely work for a lot of cases (i.e., anything where
| accuracy matters, despite the bubble's wishful thinking).
| It's likely that something will sunset them in the next few
| years.
| victorbjorklund wrote:
| That is how evolution works. Something wins until something
| else comes along and win. And so on forever.
| refulgentis wrote:
| It seems cloyingly performative grumpy old man once you're
| at "it barely works and it's a bubble and blah blah" in
| response to a discussion about _their comparative
| advantage_ (yeah, they won, and absolutely convincingly so)
| ldjkfkdsjnv wrote:
| Cannot understand people claiming we are in a local maxima,
| when we literally had an ai scientific breakthrough only in the
| last two years.
| xanderlewis wrote:
| Which breakthrough in the last two years are you referring
| to?
| ldjkfkdsjnv wrote:
| the LLM scaling law
| ikkiew wrote:
| > the perceptron thats just a sumnation[sic] function
|
| What would you suggest?
|
| My understanding of part of the whole NP-Complete thing is that
| any algorithm in the complexity class can be reduced to, among
| other things, a 'summation function'.
| blueboo wrote:
| The bitter lesson, my dude.
| http://www.incompleteideas.net/IncIdeas/BitterLesson.html
|
| If you find a simpler, trainable structure you might be onto
| something
|
| Attempts to get fancy tried and died
| posix86 wrote:
| I don't understand enough about the subject to say, but to me
| it seemed like yes, other models have better metrics with equal
| model size i.t.o. number of neurons or asymptotic runtime, but
| the most important metric will always be accuracy/precision/etc
| for money spent... or in other words, if GPT requires 10x
| number of neurons to reach the same performance, but buying
| compute & memory for these neuros is cheaper, then GPT is a
| better means to an end.
| zyklonix wrote:
| This reminds me of the famous "King - Man + Woman = Queen"
| embedding example. The fact that embeddings have semantic
| properties in them explains why simple linear functions would
| work as well.
| robertclaus wrote:
| I think this paper is cool and I love that they ran these
| experiments to validate these ideas. However, I'm having trouble
| reconciling the novelty of the ideas themselves. Isn't this
| result expected given that LLM's naturally learn simple
| statistical trends between words? To me it's way cooler that they
| clearly demonstrated not all LLM behavior can be explained this
| simply.
| mikewarot wrote:
| I wonder if this relation still holds with newer models that have
| have even more compute thrown at them?
|
| My intuition is that the structure inherent to language makes
| Word2Vec possible. Then training on terabytes of human text
| encoded with Word2Vec + Positional Encoding makes it possible to
| then have the ability to predict the next encoding at superhuman
| levels of cognition (while training!).
|
| It's my sense that the bag of words (as input/output method)
| combined with limited context windows (to make Positional
| Encoding work) is a huge impedance mismatch to the internal
| cognitive structure.
|
| Thus I think that given the orders of magnitude more compute
| thrown at GPT-4 et al, it's entirely possible new forms of
| representation evolved and remain to be discovered by humans
| probing through all the weights.
|
| I also think that MemGPT could, eventually, become an AGI because
| of the unlimited long term memory. More likely, though, I think
| it would be like the protagonist in Memento[1].
|
| [1] https://en.wikipedia.org/wiki/Memento_(film)
|
| [edit - revise to address question]
| autokad wrote:
| sorry if I misread your comment, but you seem to be indicating
| that LLMs such as chat gpt (which use gpt 3+) are bag of words
| models? they are sequence models.
| mikewarot wrote:
| I edited my response... I hope it helps... my understanding
| is that the output gives probabilities for all the words,
| then one is chosen with some random thrown in (via the
| #temperature) then fed back in... which to me seems to equate
| to bag of words. Perhaps I mis-understood the term.
| smaddox wrote:
| Bag of words models use a context that is a "bag" (i.e. an
| unorder map from elements to their counts) of words/tokens.
| GPT's use a context that is a sequence (i.e. an ordered
| list) of words/tokens.
| uoaei wrote:
| This is the "random linear projections as memorization technique"
| perspective on Transformers. It's not a new idea per se, but nice
| to see it fleshed out.
|
| If you dig into this perspective, it does temper any claims of
| "cognitive behavior" quite strongly, if only because Transformers
| have such a large capacity for these kinds of "memories".
| seydor wrote:
| Does this point to a way to compress entire LLMs by selecting a
| set of relations?
___________________________________________________________________
(page generated 2024-03-28 23:00 UTC)