[HN Gopher] LLMs use a surprisingly simple mechanism to retrieve...
       ___________________________________________________________________
        
       LLMs use a surprisingly simple mechanism to retrieve some stored
       knowledge
        
       Author : CharlesW
       Score  : 276 points
       Date   : 2024-03-28 14:37 UTC (8 hours ago)
        
 (HTM) web link (news.mit.edu)
 (TXT) w3m dump (news.mit.edu)
        
       | leobg wrote:
       | > In one experiment, they started with the prompt "Bill Bradley
       | was a" and used the decoding functions for "plays sports" and
       | "attended university" to see if the model knows that Sen. Bradley
       | was a basketball player who attended Princeton.
       | 
       | Why not just change the prompt?                 Name, University
       | attended, Sport played       Bill Bradley,
        
         | numeri wrote:
         | This is research, trying to understand the fundamentals of how
         | these models work. They weren't actually trying to find out
         | where Bill Bradley went to university.
        
           | leobg wrote:
           | Of course. But weren't they trying to find out whether or not
           | that fact was represented in the model's parameters?
        
             | wnoise wrote:
             | No, they were trying to figure out if they had isolated
             | where facts like that were represented.
        
       | vsnf wrote:
       | > Linear functions, equations with only two variables and no
       | exponents, capture the straightforward, straight-line
       | relationship between two variables
       | 
       | Is this definition considering the output to be included in the
       | set of variables? What a strange way to phrase it. Under this
       | definition, I wonder what an equation with one variable is. Is a
       | single constant an equation?
        
         | 01HNNWZ0MV43FF wrote:
         | Yeah I guess they mean one independent variable and one
         | dependent variable
         | 
         | It rarely matters because if you had 2 dependent variables, you
         | can just express that as 2 equations, so you might as well
         | assume there's exactly 1 dependent and then only discuss the
         | number of independent variables.
        
         | olejorgenb wrote:
         | I would think `x = 4` is considered an equation, yes?
        
           | pessimizer wrote:
           | And linear at that: x = 0y + 4
        
         | ksenzee wrote:
         | I think they're trying to say "equations in the form y = mx +
         | b" without getting too technical.
        
         | hansvm wrote:
         | It's just a change in perspective. Consider a vertical line. To
         | have an "output" variable you have to switch the ordinary
         | `y=mx+b` formulation to `x=c`. The generalization `ax+by=c`
         | accommodates any shifted line you can draw. Adding more
         | variables increases the dimension of the space in consideration
         | (`ax+by+cz=d` could potentially define a plane). Adding more
         | equations potentially reduces the size of the space in
         | consideration (e.g., if `x+y=1` then also knowing `2x+2y=2`
         | wouldn't reduce the solution space, but `x-y=0` would, and
         | would imply `x=y=1/2`, and further adding `x+2y=12` would imply
         | a lack of solutions).
         | 
         | Mind you, the "two variable" statement in this news piece is a
         | red-herring. The paper describes higher-dimension linear
         | relationships, of the form `Mv=c` for some constant matrix `M`,
         | some constant vector `c`, and some variable vector `v`.
         | 
         | On some level, the result isn't _that_ surprising. The paper
         | only examines one layer (not the whole network), after the
         | network has done a huge amount of embedding work. In that
         | layer, they find that under half the time they're able to get
         | over 60% of the way there with a linear approximation. Another
         | interpretation is that the single layer does some linear work
         | and shoves it through some nonlinear transformations, and more
         | than half the time that nonlinearity does something very
         | meaningful (and even in that under half the time where the
         | linear approximation is "okay", the metrics are still bad).
         | 
         | I'm not super impressed, but I don't have time to full parse
         | the thing right now. It is a bit surprising; if memory serves,
         | one of the authors on this paper had a much better result in
         | terms of neural network fact editing in the last year or two.
         | This looks like a solid research idea, solid work, it didn't
         | pan out, and to get it published they heavily overstated the
         | conclusions (and then the university press release obviously
         | bragged as much as it could).
        
         | pb060 wrote:
         | Aren't functions and equations two different things?
        
       | whatever1 wrote:
       | Llms seem like a good compression mechanism.
       | 
       | It blows my mind that I can have a copy of llama locally on my PC
       | and have access to virtually the entire internet
        
         | Culonavirus wrote:
         | Yea except it's a lossy compression. With the lost part being
         | hallucinated in at inference time.
        
           | Kuinox wrote:
           | If you've read the article, the LLM hallucinations aren't due
           | to the model not knowing the information but a function that
           | choose to remember the wrong thing.
        
             | sinemetu11 wrote:
             | From the paper:
             | 
             | > Finally, we use our dataset and LRE-estimating method to
             | build a visualization tool we call an attribute lens.
             | Instead of showing the next token distribution like Logit
             | Lens (nostalgebraist, 2020) the attribute lens shows the
             | object-token distribution at each layer for a given
             | relation. This lets us visualize where and when the LM
             | finishes retrieving knowledge about a specific relation,
             | and can reveal the presence of knowledge about attributes
             | even when that knowledge does not reach the output.
             | 
             | They're just looking at what lights up in the embedding
             | when they feed something in, and whatever lights up is
             | "knowing" about that topic. The function is an
             | approximation they added on top of the model. It's
             | important to not conflate this with the actual weights of
             | the model.
             | 
             | You can't separate the hallucinations from the model --
             | they exist precisely because of the lossy compression.
        
             | ewild wrote:
             | even this place has people not reading the articles. we are
             | doomed
        
         | krainboltgreene wrote:
         | > have access to virtually the entire internet
         | 
         | It isn't even close to 1% of the internet, much less virtually
         | the entire internet. According to the latest dump, Common Crawl
         | has 4.3B pages, but Google in 2016 estimated there are 130T
         | pages. The difference between 130T and 4.3B is about 130T. Even
         | if you narrow it down to Google's searchable text index it's
         | "100's of billions of pages" and roughly 100P compared to
         | CommonCrawl's 400T.
        
           | fspeech wrote:
           | 130T unique pages? That seems highly unlikely as that
           | averages to over 10000 pages for each human being alive. If
           | gp merely wants texts of interest to self as opposed to an
           | accurate snapshot it seems LLMs should be quite capable, one
           | day.
        
       | aia24Q1 wrote:
       | I thought "fact" means truth.
        
       | MuffinFlavored wrote:
       | I don't understand how a "CSV file/database/model" of
       | 70,000,000,000 (70B) "parameters" of 4-bit weights (a 4 bit value
       | can be 1 of 16 unique numbers) gets us an interactive LLM/GPT
       | that is near-all-knowledgable on all topics/subjects.
       | 
       | edit: did research, the 4-bit is just a "compression method", the
       | model ends up seeing f32?
       | 
       | > Quantization is the process of mapping 32-bit floating-point
       | numbers (which are the weights in the neural network) to a much
       | smaller bit representation, like 4-bit values, for storage and
       | memory efficiency.
       | 
       | > Dequantization happens when the model is used (during inference
       | or even training, if applicable). The 4-bit quantized weights are
       | converted back into floating-point numbers that the model's
       | computations are actually performed with. This is done using the
       | scale and zero-point determined during the initial quantization,
       | or through more sophisticated mapping functions that aim to
       | preserve as much information as possible despite the reduced
       | precision.
       | 
       | so what is the relationship to "parameters" and "# of unique
       | tokens the model knows about (vocabulary size)"?
       | 
       | > At first glance, LLAMa only has a 32,000 vocabulary size and
       | 65B parameters as compared to GPT-3,
       | 
       | > The 65 billion parameters in a model like LLAMA (or any large
       | language model) essentially function as a highly intricate
       | mapping system that determines how to respond to a given input
       | based on the learned relationships between tokens in its training
       | data.
        
         | Filligree wrote:
         | It doesn't, is the simple answer.
         | 
         | The slightly more complicated one is that a compressed text
         | dump of Wikipedia isn't even 70GB, and this is _lossy_
         | compression of the internet.
        
           | MuffinFlavored wrote:
           | say the average LLM these days has a unique token
           | (vocabulary) size of ~32,000 (not its context size, # of
           | unique tokens it can pick between in a response. English
           | words, punctuation, math, code, etc.)
           | 
           | the 60-70B parameters of models is basically like... just
           | stored patterns of "if these 10 tokens in a row input, then
           | these 10 tokens in a row output score the highest"
           | 
           | Is that a good summary?
           | 
           | > The model uses its learned statistical patterns to predict
           | the probability of what comes next in a sequence of text.
           | 
           | based on what inputs?
           | 
           | 1. previous tokens in the sequence from immediate context
           | 
           | 2. tokens summarizing the overall topic/subject matter from
           | the extended context
           | 
           | 3. scoring of learned patterns from training
           | 
           | 4. what else?
        
             | numeri wrote:
             | Your suggested scheme (assuming a mapping from 10 tokens to
             | 10 tokens, with each token taking 2 bytes to store) would
             | take (32000 * 20) * 2 bytes = 2.3e78 TiB of storage, or
             | about 250 MiB per atom in the observable universe (1e82),
             | prior to compression.
             | 
             | I think it's more likely that LLMs are actually learning
             | and understanding concepts as well as memorizing useful
             | facts, than that LLMs have discovered a compression method
             | with that high of a compression ratio, haha.
        
               | mjburgess wrote:
               | LLMs cannot determine the physical location of any atoms.
               | they cannot plan movement, and so on.
               | 
               | LLMs are just completing patterns of text that have been
               | given before, 'everthing ever written' is both a lot for
               | any individual person to read; but also, almost nothing,
               | in that to propertly describe a table requires more
               | information
               | 
               | text is itself an extremely compressed medium which lacks
               | almost any information about the world; it succeeds in
               | being useful to generate because we have that information
               | and are able to map it back to it
        
               | numeri wrote:
               | I didn't imply that they know anything about where atoms
               | are, I was just pointing out the sheer absurdity of that
               | volume of data.
               | 
               | I should make it clear that my comparison there is unfair
               | and mostly just funny - you don't need to store every
               | possible combination of 10 tokens, because most of them
               | will be nonsense, so you wouldn't actually need that much
               | storage. That being said, it's been fairly solidly proven
               | that LLMs aren't just lookup tables/stochastic parrots.
        
               | mjburgess wrote:
               | > fairly solidly proven that LLMs aren't just lookup
               | tables/stochastic parrots
               | 
               | Well i'd strongly disagree. I see no evidence of this;
               | I'm am quite well acquainted with the literature.
               | 
               | All _empirical_ statistical AI is just a means of
               | approximating an empirical distribution. The problem with
               | NLP is that there is no empirical function from text
               | tokens to meanings; just as there is no function from
               | sets of 2D images to a 3D structure.
               | 
               | We know before we start that the distributions of text
               | tokens are only coincidentally related to the
               | distributions of meanings. The question is just how much
               | value that coincidence has in any given task.
               | 
               | (Consider, eg., that if I ask, "do you like what i'm
               | wearing?" there is no distribution of responses which is
               | correct. I do not want you to say "yes" 99/100, or even
               | 100/100 times. etc. what I want you to say is a word
               | caused a mental state you have: that of (dis)liking what
               | i'm wearing.
               | 
               | Since no statistical AI systems generate outputs based on
               | causal features of reality, we know a priori that almost
               | all possible questions that can be asked cannot be
               | answered by LLMs.
               | 
               | They are only useful where questions have cannonical
               | answers; and only because "cannonical" means that a
               | text->text function is likely to be conidentally
               | indistinguishable from a the meaning->meaning function
               | we're interested in).
        
               | pk-protect-ai wrote:
               | There is something wrong with these arithmetic: "(32000 *
               | 20) * 2 bytes = 2.3e78 TiB of storage" ... The factorial
               | is missing somewhere in there ...
        
             | wongarsu wrote:
             | That would be equivalent to a hidden markov chain. Those
             | have been around for decades, but we have only managed to
             | make them coherent for very short outputs. Even GPT2 beats
             | any Markov chain, so there has to be more going on
             | 
             | Modern LLMs are able to transfer knowledge between
             | different languages, so it's fair to assume that some
             | mapping between human language and a more abstract internal
             | representation happens at the input and output, instead of
             | the model "operating" on English or Chinese or whatever
             | language you talk with it. And once this exists, an
             | internal "world model" (as in: a collection of facts and
             | implications) isn't far, and seems to indeed be something
             | most LLMs do. The reasoning on top of that world model is
             | still very spotty though
        
             | HarHarVeryFunny wrote:
             | > the 60-70B parameters of models is basically like... just
             | stored patterns of "if these 10 tokens in a row input, then
             | these 10 tokens in a row output score the highest"
             | 
             | > Is that a good summary?
             | 
             | No - there's a lot more going on. It's not just mapping
             | input patterns to output patterns.
             | 
             | A good starting point to understand it are linguist's
             | sentence-structure trees (and these were the inspiration
             | for the "transformer" design of these LLMs).
             | 
             | https://www.nltk.org/book/ch08.html
             | 
             | Note how there are multiple levels of nodes/branches to
             | these trees, from the top node representing the sentence as
             | a whole, to the words themselves which are all the way at
             | the bottom.
             | 
             | An LLM like ChatGPT is made out of multiple layers (e.g. 96
             | layers for GPT-3) of transformer blocks, stacked on top of
             | each other. When you feed an input sentence into an LLM,
             | the sentence will first be turned into a sequence of token
             | embeddings, then passed through each of these 96 layers in
             | turn, each of which changes ("transforms") it a little bit,
             | until it comes out the top of the stack as the predicted
             | output sentence (or something that can be decoded into the
             | output sentence). We only use the last word of the output
             | sentence which is the "next word" it has predicted.
             | 
             | You can think of these 96 transformer layers as a bit like
             | the levels in one of those linguistic sentence-structure
             | trees. At the bottom level/layer are the words themselves,
             | and at each successive higher level/layer are higher-and-
             | higher level representations of the sentence structure.
             | 
             | In order to understand this a little better, you need to
             | understand what these token "embeddings" are, which is the
             | form in which the sentence is passed through, and
             | transformed by, these stacked transformer layers.
             | 
             | To keep it simple, think of a token as a word, and say the
             | model has a vocabulary of 32,000 words. You might perhaps
             | expect that each word is represented by a number in the
             | range 1-32000, but that is not the way it works! Instead,
             | each word is mapped (aka "embedded") to a point in a high
             | dimensional space (e.g. 4096-D for LLaMA 7B), meaning that
             | it is represented by a vector of 4096 numbers (cf a point
             | in 3-D space represented as (x,y,z)).
             | 
             | These 4096 element "embeddings" are what actually pass thru
             | the LLM and get transformed by it. Having so many
             | dimensions gives the LLM a huge space in which it can
             | represent a very rich variety of concepts, not just words.
             | At the _first_ layer of the transformer stack these
             | embeddings do just represent words, the same as the nodes
             | do at the bottom layer of the sentence-structure tree, but
             | more information is gradually added to the embeddings by
             | each layer, augmenting and transforming what they mean. For
             | example, maybe the first transformer layer adds  "part of
             | speech" information so that each embedded word is now also
             | tagged as a noun or verb, etc. At the next layer up, the
             | words comprising a noun phase or verb phrase may get
             | additionally tagged as such, and so-on as each transformer
             | layer adds more information.
             | 
             | This just gives a flavor of what is happening, but
             | basically by the time the sentence has reached the top
             | layer of the transformer it has been able to see the entire
             | tree structure of the sentence, and only then have
             | "understand" it well enough to predict a grammatically and
             | semantically "correct" continuation from which it is able
             | to predict continuation words.
        
               | MichaelZuo wrote:
               | Thanks for the explanation.
               | 
               | Since unicode has well over 64000 symbols, does that
               | imply models, trained on a large corpus, must necessarily
               | have at least 64000 'branches' at the bottom layer?
        
           | ramses0 wrote:
           | Is there some sort of "LLM-on-Wikipedia" competition?
           | 
           | ie: given "just wikipedia" what's the best score people can
           | get on however these models are evaluated.
           | 
           | I know that all the commercial ventures have a voracious
           | data-input set, but it seems like there's room for
           | dictionary.llm + wikipedia.llm + linux-kernel.llm and some
           | sort of judging / bake-off for their different performance
           | capabilities.
           | 
           | Or does the training truly _NEED_ every book every written +
           | the entire internet + all knowledge ever known by mankind to
           | have an effective outcome?
        
             | CraigJPerry wrote:
             | >> Or does the training truly _NEED_ every book every
             | written + the entire internet + all knowledge ever known by
             | mankind to have an effective outcome?
             | 
             | I have the same question.
             | 
             | Peter Norvig's GOFAI Shakespeare generator example[1]
             | (which is not an LLM) gets impressive results with little
             | input data to go on. Does the leap to LLM preclude that
             | kind of small input approach?
             | 
             | [1] link should be here because I assumed as I wrote the
             | above that I would just turn it up with a quick google.
             | Alas t'was not to be. Take my word for it, somewhere on
             | t'internet is an excellent write up by Peter Norvig on LLM
             | vs GOFAI (good old fashioned artificial intelligence)
        
             | bionhoward wrote:
             | Yes, that's known as the Hutter Prize
             | http://prize.hutter1.net/
        
               | ramses0 wrote:
               | Not exactly, because LLM's seem to be exhibiting value
               | via "lossy knowledge response" vs. "exact reproduction
               | measured in bytes", but close.
        
         | Acumen321 wrote:
         | Quantization in this context is the precision of each value in
         | the vector or matrix/tensor.
         | 
         | If the model in question has a token embedding length of 1024,
         | even if it was a 1 bit quantization, each token has 2^1024
         | possible values.
         | 
         | If the context length is 32,000 tokens, there are 32,000^2^1024
         | possible inputs.
        
       | estebarb wrote:
       | I find this similar to what relation vectors do in word2vec: you
       | can add a vector of "X of" and often get the correct answer. It
       | could be that the principle is still the same, and transformers
       | "just" build a better mapping of entities into the embedding
       | space?
        
         | PaulHoule wrote:
         | I think so. It's hard for me to believe that the decision
         | surfaces inside those models are really curved enough (like the
         | folds of your brain) to really take advantage of FP32 numbers
         | inside vectors: that is I just don't believe it is
         | x = 0 means "fly"       x = 0.01 means "drive"       x = 0.02
         | means "purple"
         | 
         | but rather more like                 x < 1.5 means "cold"
         | x > 1.5 means "hot"
         | 
         | which is one reason why quantization (often 1 bit) works. Also
         | it is a reason why you can often get great results feeding text
         | or images through a BERT or CLIP-type model and then applying
         | classical ML models that frequently involve linear decision
         | surfaces.
        
           | taneq wrote:
           | Are you conflating nonlinear embedding spaces with the
           | physical curvature of the cerebellum? I don't think there's a
           | direct mapping.
        
             | PaulHoule wrote:
             | My mental picture is that violently curved decision
             | surfaces could _look_ like the convolutions of the brain
             | even though they have nothing to do with how the brain
             | actually works.
             | 
             | I think of how tSNE and other algorithms sometimes produce
             | projections that sometimes look like that (maybe that's
             | just what you get when you have to bend something
             | complicated to fit into a 2-d space) and frequently show
             | cusps that to me look like a sign of trouble (took me a
             | while in my PhD work to realize how Poincare sections from
             | 4 or 6 dimensions can look messed up when a part of the
             | energy surface tilts perpendicularly to the projection
             | surface.)
             | 
             | I still find it hard to believe that dense vectors are the
             | right way to deal with text despite the fact that they work
             | so well. For images it is one thing because changing one
             | pixel a little doesn't change the meaning of an image, but
             | changing a single character of a text can completely change
             | the meaning of the text. Also there's the reality that if
             | you randomly stick together tokens you get something
             | meaningless, so it seems almost all of the representation
             | space covers ill formed texts and only a low dimensional
             | manifold holds the well formed texts. Now the decision
             | surfaces really have to be nonlinear and crumpled over all
             | but I think there's a definitely a limit on how crumpled
             | those surfaces can be.
        
               | Y_Y wrote:
               | This is interesting. It makes me think of an
               | "immersion"[0], as in a generalization of the concept of
               | "embedding" in differential geometry.
               | 
               | I share your uneasiness about mapping words to vectors
               | and agree that it feels as if we're shoehorning some more
               | complex space into a computationally convenient one.
               | 
               | [0] https://en.wikipedia.org/wiki/Immersion_(mathematics)
        
       | derefr wrote:
       | Help me understand: when they say that the facts are stored as a
       | linear function... are they saying that the LLM has a sort of
       | N-dimensional "fact space" encoded into the model in some manner,
       | where facts are embedded into the space as (points / hyperspheres
       | / Voronoi manifolds / etc); and where recalling a fact is -- at
       | least in an abstract sense -- the NN computing / remembering a
       | key to use, and then doing a key-value lookup in this space?
       | 
       | If so: how _do_ you embed a KV-store into an edge-propagated
       | graphical model? Are there even any well-known techniques for
       | doing that "by hand" right now?
       | 
       | (Also, fun tangent: isn't the "memory palace" memory technique,
       | an example of _human brains_ embedding facts into a linear
       | function for easier retrieval?)
        
         | bionhoward wrote:
         | [Layer] Normalization constrains huge vectors representing
         | tokens (input fragments) to positions on a unit ball (I think),
         | and the attention mechanism operates by rotating the
         | unconstrained ones based on the sum of their angles relative to
         | all the others.
         | 
         | I only skimmed the paper but believe the point here is that
         | there are relatively simple functions hiding in or recoverable
         | from the bigger network which specifically address certain
         | categories of relationships between concepts.
         | 
         | Since it would, in theory, be possible to optimize such
         | functions more directly if they are possible to isolate, could
         | this enable advances in the way such models are trained?
         | Absolutely.
         | 
         | After all, one of the best criticisms of "modern" AI is the
         | notion we're just mixing around a soup of linear algebra.
         | Allowing some sense of modularity (reductionism) could make
         | them less of a black box and more of a component driven
         | approach (in the lagging concept space and not just the leading
         | layer space)
        
         | jacobn wrote:
         | The fundamental operation done by the transformer,
         | softmax(Q.K^T).V, is essentially a KV-store lookup.
         | 
         | The Query is dotted with the Key, then you take the softmax to
         | pick mostly one winning Key (the Key closest to the Query
         | basically), and then use the corresponding Value.
         | 
         | That is really, really close to a KV lookup, except it's a
         | little soft (i.e. can hit multiple Keys), and it can be
         | optimized using gradient descent style methods to find the
         | suitable QKV mappings.
        
           | naveen99 wrote:
           | Not sure there is any real lookup happening. Q,K are the same
           | and sometimes even v is the same...
        
             | toxik wrote:
             | Q, K, V are not the same. In self-attention, they are all
             | computed by separate linear transformation of the same
             | input (ie the previous layer's output). In cross-attention
             | even this is not true, then K and V are computed by linear
             | transformation of whatever is cross-attended, and Q is
             | computed by linear transformation of the input as before.
        
               | ewild wrote:
               | yeah a common misconception people think because the
               | input is the same they forget that their is a pre
               | attention linear transofrmation for q k and v (using the
               | decoder only version obv v is diff with encoder decoder
               | bert style)
        
         | thfuran wrote:
         | >isn't the "memory palace" memory technique, an example of
         | human brains embedding facts into a linear function for easier
         | retrieval?
         | 
         | I'm not sure I see how that's a linear function.
        
       | wslh wrote:
       | Can we roughly say that LLMs produces (training mode) a lot of
       | IF-THENs in an automatic way from a vast quantity of information
       | (nor techniques) that was not available before?
        
       | mike_hearn wrote:
       | This is really cool. My mind goes immediately to what sort of
       | functions are being used to encode programming knowledge, and if
       | they are also simple linear functions whether the standard
       | library or other libraries can be directly uploaded into an LLMs
       | brain as it evolves, without needing to go through a costly
       | training or performance-destroying fine-tune. That's still a sci-
       | fi ability today but it seems to be getting closer.
        
         | Animats wrote:
         | That's a good point. It may be possible to directly upload
         | predicate-type info into a LLM. This could be especially useful
         | if you need to encode tabular data. Somewhere, someone probably
         | read this and is thinking about how to export Excel or
         | databases to an LLM.
         | 
         | It's encouraging to see people looking inside the black box
         | successfully. The other big result in this area was that paper
         | which found a representation of a game board inside a LLM after
         | the LLM had trained to play a game. Any other good results in
         | that area?
         | 
         | The authors point out that LLMs are doing more than encoding
         | predicate-type info. That's just part of what they are doing.
        
           | AaronFriel wrote:
           | It indeed is. An attention mechanism's key and value matrices
           | grow linearly with context length. With PagedAttention[1], we
           | could imagine an external service providing context. The hard
           | part is the how, of course. We can't load our entire database
           | in every conversation, and I suspect there are also
           | challenges around training (perhaps addressed via
           | LandmarkAttention[2]) and building a service efficiently
           | retrieve additional key-value matrices.
           | 
           | The external service vector database may require tight
           | timings necessary to avoid stalling LLMs. To manage 20-50
           | tokens/sec, answers must arrive within 50-20ms.
           | 
           | And we cannot do this in real-time, pausing the transformer
           | when a layer produces a query vector stalls the batch, so we
           | need a way to predict queries (or embeddings) several tokens
           | ahead of where they'd be useful and inject the context in
           | when it's needed, and to know when to page it out.
           | 
           | [1] https://arxiv.org/abs/2309.06180
           | 
           | [2] https://arxiv.org/abs/2305.16300
        
           | wongarsu wrote:
           | The opposite is also exciting: build a loss function that
           | punishes models for storing knowledge. One of the issues of
           | current models is that they seem to favor lookup over
           | reasoning. If we can punish models (during training) for
           | remembering that might cause them to become better at
           | inference and logic instead.
        
             | kossTKR wrote:
             | Interesting. Reminds me of a sci-fi short i read years ago
             | where AI's "went insane" when they had too much knowledge
             | because they'd spent too much time looking through data and
             | get a buffer overflow.
             | 
             | I know some of the smaller models like PHI-2 are training
             | for reasoning specifically before by training on question
             | answer sets, though this seems like the opposite to me.
        
         | politician wrote:
         | Hah! Maybe Neo was an LLM. "I know kung-fu."
        
       | i5heu wrote:
       | So it is entirely possible to decouple the reasoning part from
       | the information part?
       | 
       | This is like absolutely mind blowing if this is true.
        
         | learned wrote:
         | A big caveat mentioned in the article is that this experiment
         | was done with a small set (N=47) of specific questions that
         | they expected to have relatively simple relational answers:
         | 
         | > The researchers developed a method to estimate these simple
         | functions, and then computed functions for 47 different
         | relations, such as "capital city of a country" and "lead singer
         | of a band." While there could be an infinite number of possible
         | relations, the researchers chose to study this specific subset
         | because they are representative of the kinds of facts that can
         | be written in this way.
         | 
         | About 60% of these relations were retrieved using a linear
         | function in the model. The remaining appeared to have nonlinear
         | retrieval and is still a subject of investigation:
         | 
         | > Functions retrieved the correct information more than 60
         | percent of the time, showing that some information in a
         | transformer is encoded and retrieved in this way. "But not
         | everything is linearly encoded. For some facts, even though the
         | model knows them and will predict text that is consistent with
         | these facts, we can't find linear functions for them. This
         | suggests that the model is doing something more intricate to
         | store that information," he says.
        
       | retrofrost wrote:
       | This is amazing work, but to me it highlights some of the biggest
       | problems in the current AI zeitgeist, we are not really trying to
       | work on any neuron or ruleset that isnt much different from the
       | perceptron thats just a sumnation function. Is it really that
       | suprising that we just see this same structure repeated in the
       | models. Just because feedforward topologies with single neuron
       | steps are the easiest to train and run on graphics cards does
       | that really make them the actual best at accomplishing tasks? We
       | have all sorts of unique training methods and encoding schemes
       | that don't ever get used because the big libraries don't support
       | them. Until, we start seeing real varation in the fundamental
       | rulesets of neuralnets we are always just going to be fighting
       | against the fact these are just perceptrons with extra steps.
        
         | visarga wrote:
         | > Just because feedforward topologies with single neuron steps
         | are the easiest to train and run on graphics cards does that
         | really make them the actual best at accomplishing tasks?
         | 
         | You are ignoring a mountain of papers trying all conceivable
         | approaches to create models. It is evolution by selection, in
         | the end transformers won.
        
           | dartos wrote:
           | I mean RWKV seems promising and isn't a transformer model.
           | 
           | Transformers have first mover advantage. They were the first
           | models that scaled to large parameter counts.
           | 
           | That doesn't mean they're the best or that they've won, just
           | that they were the first to get big (literally and
           | metaphorically)
        
             | tkellogg wrote:
             | Yeah, I'd argue that transformers created such capital
             | saturation that there's a ton of opportunity for
             | alternative approaches to emerge.
        
               | dartos wrote:
               | Speak of the devil. Jamba just hit the front page.
        
             | refulgentis wrote:
             | It doesn't seem promising, a one man band has been doing a
             | quixotic quest based on intuition and it's gotten ~nowhere,
             | and it's not for lack of interest in alternatives. There's
             | never been a better time to have a different approach - is
             | your metric "times I've seen it on HN with a convincing
             | argument for it being promising?" -- I'm not embarrassed to
             | admit that is/was mine, but alternatively, you're aware of
             | recent breakthroughs I haven't seen.
        
           | retrofrost wrote:
           | Just because papers are getting published doesn't mean its
           | actually gaining any traction. I mean we have known that time
           | series of signals recieves plays a huge role in how bio
           | neurons functionally operate and yet we have nearly no
           | examples of spiking networks being pushed beyond basic
           | academic exploration. We have known glial cells play a
           | critical role in biological neural and yet you can probably
           | count the number of papers that examine using an abstraction
           | of that activity in neural net, on both your hands and toes.
           | Neuroevolution using genetic algorithms has been basically
           | looking for a big break since NEAT. Its the height of hubris
           | to say that we have peaked with transformers when the entire
           | field is based on not getting trapped in local maxima's.
           | Sorry to be snippy, but there is so much uncovered ground its
           | not even funny.
        
             | gwervc wrote:
             | "We" are not forbidding you to open a computer, start
             | experimenting and publishing some new method. If you're so
             | convinced that "we" are stuck in a local maxima, you can do
             | some of the work you are advocating instead of asking other
             | to do it for you.
        
               | Kerb_ wrote:
               | You can think chemotherapy is a local maxima for cancer
               | treatment and hope medical research seeks out other
               | options without having the resources to do it yourself.
               | Not all of us have access to the tools and resources to
               | start experimenting as casually as we wish we could.
        
               | erisinger wrote:
               | Not a single one of you bigbrains used the word "maxima"
               | correctly and it's driving me crazy.
        
               | vlovich123 wrote:
               | As I understand it a local maxima means you're at a local
               | peak but there may be higher maximums elsewhere. As I
               | read it, transformers are a local maximum in the sense of
               | outperforming all other ML techniques as the AI technique
               | that gets the closest to human intelligence.
               | 
               | Can you help my little brain understand the problem by
               | elaborating?
               | 
               | Also you may want to chill with the personal attacks.
        
               | erisinger wrote:
               | Not a personal attack. These posters are smarter than I
               | am, just ribbing them about misusing the terminology.
               | 
               | "Maxima" is plural, "maximum" is singular. So you would
               | say "a local maximum," or "several local maxima." Not "a
               | local maxima" or, the one that really got me, "getting
               | trapped in local maxima's."
               | 
               | As for the rest of it, carry on. Good discussion.
        
               | FeepingCreature wrote:
               | A local maxima, that is, /usr/bin/wxmaxima...
        
               | erisinger wrote:
               | Touche...
        
               | gyrovagueGeist wrote:
               | While "local maximas" is wrong, I think "a local maxima"
               | is a valid way to say "a member of the set of local
               | maxima" regardless of the number of elements in the set.
               | It could even be a singleton.
        
               | Tijdreiziger wrote:
               | You can't have one maxima in the same way you can't have
               | one pencils. That's just how English works.
        
               | tschwimmer wrote:
               | yeah, not a Nissan in sight
        
               | mikewarot wrote:
               | MNIST and other small and easy to train against datasets
               | are widely available. You can try out anything you like
               | even with a cheap laptop these days thanks to a few
               | decades of Moore's law.
               | 
               | It is definitely NOT out of your reach to try any ideas
               | you have. Kaggle and other sites exist to make it easy.
               | 
               | Good luck! 8)
        
               | retrofrost wrote:
               | My pet project has been trying to use elixir with NEAT or
               | HyperNEAT to try and make a spiking network, then when
               | thats working decently drop some glial interactions I saw
               | in a paper. It would be kinda bad at purely functional
               | stuff, but idk seems fun. The biggest problems are time
               | and having to do a lot of both the evolutionary stuff and
               | the network stuff. But yeah the ubiquity of free datasets
               | does make it easy to train.
        
               | haltIncomplete wrote:
               | All we're doing is engineering new data compression and
               | retrieval techniques: https://arxiv.org/abs/2309.10668
               | 
               | Are we sure there's anything "net new" to find within the
               | same old x86 machines, within the same old axiomatic
               | systems of the past?
               | 
               | Math is a few operations applied to carving up stuff and
               | we believe we can do that infinitely in theory. So "all
               | math that abides our axiomatic underpinnings" is valid
               | regardless if we "prove it" or not.
               | 
               | Physical space we can exist in, a middle ground of
               | reality we evolved _just so_ to exist in, seems to be
               | finite; I can't just up and move to Titan or Mars. So our
               | computers are coupled to the same constraints of
               | observation and understanding as us.
               | 
               | What about daily life will be upended reconfirming
               | decades old experiment? How is this not living in sunk
               | cost fallacy?
               | 
               | When all you have is a hammer...
               | 
               | I'm reminded of Einstein's quote about insanity.
        
             | typon wrote:
             | Do you really think that transformers came to us from God?
             | They're built on the corpses of millions of models that
             | never went anywhere. I spent an entire year trying to scale
             | up a stupid RNN back in 2014. Never went anywhere, because
             | it didn't work. I am sure we are stuck in a local minima
             | now - but it's able to solve problems that were previously
             | impossible. So we will use it until we are impossibly stuck
             | again. Currently, however, we have barely begun to scratch
             | the surface of what's possible with these models.
        
             | leoc wrote:
             | (The singulars are 'maximum' and 'minimum', 'maxima' and
             | 'minima' are the plurals.)
        
           | nicklecompte wrote:
           | His point is that "evolution by selection" also includes that
           | transformers are easy to implement with modern linear algebra
           | libraries and cheap to scale on current silicon, both of
           | which are engineering details with no direct relationship to
           | their innate efficacy at learning (though indirectly it means
           | you scale up the training data for more inefficient
           | learning).
        
             | wanderingbort wrote:
             | I think it is correct to include practical implementation
             | costs in the selection.
             | 
             | Theoretical efficacy doesn't guarantee real world efficacy.
             | 
             | I accept that this is self reinforcing but I favor real
             | gains today over potentially larger gains in a potentially
             | achievable future.
             | 
             | I also think we are learning practical lessons on the
             | periphery of any application of AI that will apply if a
             | mold-breaking solution becomes compelling.
        
           | foobiekr wrote:
           | "won"
           | 
           | They barely work for a lot of cases (i.e., anything where
           | accuracy matters, despite the bubble's wishful thinking).
           | It's likely that something will sunset them in the next few
           | years.
        
             | victorbjorklund wrote:
             | That is how evolution works. Something wins until something
             | else comes along and win. And so on forever.
        
             | refulgentis wrote:
             | It seems cloyingly performative grumpy old man once you're
             | at "it barely works and it's a bubble and blah blah" in
             | response to a discussion about _their comparative
             | advantage_ (yeah, they won, and absolutely convincingly so)
        
         | ldjkfkdsjnv wrote:
         | Cannot understand people claiming we are in a local maxima,
         | when we literally had an ai scientific breakthrough only in the
         | last two years.
        
           | xanderlewis wrote:
           | Which breakthrough in the last two years are you referring
           | to?
        
             | ldjkfkdsjnv wrote:
             | the LLM scaling law
        
         | ikkiew wrote:
         | > the perceptron thats just a sumnation[sic] function
         | 
         | What would you suggest?
         | 
         | My understanding of part of the whole NP-Complete thing is that
         | any algorithm in the complexity class can be reduced to, among
         | other things, a 'summation function'.
        
         | blueboo wrote:
         | The bitter lesson, my dude.
         | http://www.incompleteideas.net/IncIdeas/BitterLesson.html
         | 
         | If you find a simpler, trainable structure you might be onto
         | something
         | 
         | Attempts to get fancy tried and died
        
         | posix86 wrote:
         | I don't understand enough about the subject to say, but to me
         | it seemed like yes, other models have better metrics with equal
         | model size i.t.o. number of neurons or asymptotic runtime, but
         | the most important metric will always be accuracy/precision/etc
         | for money spent... or in other words, if GPT requires 10x
         | number of neurons to reach the same performance, but buying
         | compute & memory for these neuros is cheaper, then GPT is a
         | better means to an end.
        
       | zyklonix wrote:
       | This reminds me of the famous "King - Man + Woman = Queen"
       | embedding example. The fact that embeddings have semantic
       | properties in them explains why simple linear functions would
       | work as well.
        
       | robertclaus wrote:
       | I think this paper is cool and I love that they ran these
       | experiments to validate these ideas. However, I'm having trouble
       | reconciling the novelty of the ideas themselves. Isn't this
       | result expected given that LLM's naturally learn simple
       | statistical trends between words? To me it's way cooler that they
       | clearly demonstrated not all LLM behavior can be explained this
       | simply.
        
       | mikewarot wrote:
       | I wonder if this relation still holds with newer models that have
       | have even more compute thrown at them?
       | 
       | My intuition is that the structure inherent to language makes
       | Word2Vec possible. Then training on terabytes of human text
       | encoded with Word2Vec + Positional Encoding makes it possible to
       | then have the ability to predict the next encoding at superhuman
       | levels of cognition (while training!).
       | 
       | It's my sense that the bag of words (as input/output method)
       | combined with limited context windows (to make Positional
       | Encoding work) is a huge impedance mismatch to the internal
       | cognitive structure.
       | 
       | Thus I think that given the orders of magnitude more compute
       | thrown at GPT-4 et al, it's entirely possible new forms of
       | representation evolved and remain to be discovered by humans
       | probing through all the weights.
       | 
       | I also think that MemGPT could, eventually, become an AGI because
       | of the unlimited long term memory. More likely, though, I think
       | it would be like the protagonist in Memento[1].
       | 
       | [1] https://en.wikipedia.org/wiki/Memento_(film)
       | 
       | [edit - revise to address question]
        
         | autokad wrote:
         | sorry if I misread your comment, but you seem to be indicating
         | that LLMs such as chat gpt (which use gpt 3+) are bag of words
         | models? they are sequence models.
        
           | mikewarot wrote:
           | I edited my response... I hope it helps... my understanding
           | is that the output gives probabilities for all the words,
           | then one is chosen with some random thrown in (via the
           | #temperature) then fed back in... which to me seems to equate
           | to bag of words. Perhaps I mis-understood the term.
        
             | smaddox wrote:
             | Bag of words models use a context that is a "bag" (i.e. an
             | unorder map from elements to their counts) of words/tokens.
             | GPT's use a context that is a sequence (i.e. an ordered
             | list) of words/tokens.
        
       | uoaei wrote:
       | This is the "random linear projections as memorization technique"
       | perspective on Transformers. It's not a new idea per se, but nice
       | to see it fleshed out.
       | 
       | If you dig into this perspective, it does temper any claims of
       | "cognitive behavior" quite strongly, if only because Transformers
       | have such a large capacity for these kinds of "memories".
        
       | seydor wrote:
       | Does this point to a way to compress entire LLMs by selecting a
       | set of relations?
        
       ___________________________________________________________________
       (page generated 2024-03-28 23:00 UTC)