[HN Gopher] Infini-Gram: Scaling unbounded n-gram language model...
___________________________________________________________________
Infini-Gram: Scaling unbounded n-gram language models to a trillion
tokens
Author : nsagent
Score : 65 points
Date : 2024-05-05 17:58 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| Genbox wrote:
| A web interface for the Infini-gram engine can be found here:
| https://huggingface.co/spaces/liujch1998/infini-gram
| novaomnidev wrote:
| The hugging face demo site doesn't seem to work with any of the
| examples. It returns an error in every case.
| kmeisthax wrote:
| My current mental model of LLMs is that attention refines
| information from the input (e.g. "red balloon" becomes some
| vector that means red + balloon), and that the feed-forward
| layers are lossy compressed indexes into the training set.
|
| From the paper, having a better n-gram index is giving as much
| perplexity improvement as a 10x increase in parameters. I wonder
| if it would make sense to train new foundation models with more
| attention heads and use infinigram in lieu of the feed-forward
| layer.
| sigmoid10 wrote:
| Attention just allows the model to attend to elements across
| the input sequence. The feed forward is what gives it (and
| basically all other architectures) its universal function
| abilities. The only reason why we don't use dense layers
| directly across the input sequence and instead go for things
| like convolution, recursion or transformers, is because that is
| prohibitively expensive computationally. But replacing the
| dense layer with n-grams would make LLMs exactly what so many
| people falsely believe them to be right now: Pure stochastic
| parrots instead of functions that can actually learn to
| generalize from examples.
| novaRom wrote:
| TL;DR we've trained n-gram model implemented in a form of suffix
| array on very large data set, it shows good perplexity
| throwaway81523 wrote:
| Is this sort of like Dissociated Press in Emacs?
| bglazer wrote:
| Somewhat off-topic, but has anyone tried training a random forest
| at LLM scale? Like an RF with millions of trees and billions of
| branches? My intuition says it could be much more efficient on
| CPUs, provided it works at all.
| skyde wrote:
| Forest are good at classification but they cannot leverage pre-
| training on unclassified data.
| tgv wrote:
| What this shows, IMO, is that quite a lot of the expected output
| is almost literally present in the training data.
| flawsofar wrote:
| Phrases and ideas repeat, because the universe has patterns and
| structure.
| matt4711 wrote:
| A paper [1] we wrote in 2015 (cited by the authors) uses some
| more sophisticated data structures (compressed suffix trees) and
| Kneser-Ney smoothing to get the same "unlimited" context. I
| imagine with better smoothing and the same larger corpus sizes as
| the authors use this could improve on some of the results the
| authors provide.
|
| Back then neural LMs were just beginning to emerge and we briefly
| experimented with using one of these unlimited N-gram models for
| pre-training but never got any results.
|
| [1] https://aclanthology.org/D15-1288.pdf
| inciampati wrote:
| The formulation of a succinct full text index as an "infinite
| n-gram model" is both genius in connecting with the ML
| literature, but also blind-eye, in that it ignores all the work
| on using exactly this data structure (well, the FM-index) over
| DNA. bwa mem is the OG [?]-gram aligner.
| drdeca wrote:
| A few thoughts this brings to mind:
|
| 1) this reminds me of how I was thinking about "what if you tried
| to modify an n-gram model to incorporate a poor-man's
| approximation of the copying heads found in transformer models?
| (one which is also based just on counting)"
|
| 2) What if you trained a model to, given a sequence of tokens,
| predict how many times that sequence of tokens appeared in the
| training set? And/or trained a model to, given a sequence of
| tokens, predict the length of longest suffix of it which appears
| in the training set. Could this be done in a way that gave a good
| approximation to the actual counts, while being substantially
| smaller? (Though, of course, this would fail to have the
| attribution property that they mentioned.)
|
| 3) It surprised me that they just always use the largest n such
| that there is a sample in the training set. My thought would have
| been to combine the different choices of n with some weighting.
|
| Like, if you have one example in the training set of "a b c d",
| and a million examples of "b c e", and only one or two other
| examples of "b c d" other than the single instance of "a b c d",
| does increasing the number of samples of "b c e" (which aren't
| preceded by "a") really give no weight towards "e" rather than
| "d" for the next token?
| hiddencost wrote:
| Unfortunately, single digit millisecond is quite slow for n-gram
| language models.
|
| Weighted Finite State Transducers are where they really shine;
| you tend to want to do massive parallel operations, and you want
| to do the without being bottle necked on memory access. I think
| these challenges make this framework challenging to adopt.
| snats wrote:
| I recently wrote a writeup on bigrams and the infinigram
| outputs[1]. I genuinely believe that ngrams are making a
| comeback. For query searching it's really good.
|
| [1]
| https://snats.xyz/pages/articles/from_bigram_to_infinigram.h...
| omeze wrote:
| This is a really cool paper, reminds me of the simple exercise
| Karpathy goes through in his NN vid series with a bigram
| predictor. Looks like in practice there's still some grounding
| issues when attempting to use them for instruction-tuned
| applications, but clever direction to explore!
| mfornet wrote:
| As I see it, this model will be able to predict "easy" to derive
| tokens but will no chance on "hard" tokens.
|
| For example doing a sum of random numbers. If the token you are
| trying to predict is not in the training data, even if similar
| patterns exist, this model defaults to the Neural Model.
|
| I guess then it is an aide to the neural model on filling the
| easy patterns.
___________________________________________________________________
(page generated 2024-05-05 23:00 UTC)