[HN Gopher] Infini-Gram: Scaling unbounded n-gram language model...
       ___________________________________________________________________
        
       Infini-Gram: Scaling unbounded n-gram language models to a trillion
       tokens
        
       Author : nsagent
       Score  : 65 points
       Date   : 2024-05-05 17:58 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | Genbox wrote:
       | A web interface for the Infini-gram engine can be found here:
       | https://huggingface.co/spaces/liujch1998/infini-gram
        
       | novaomnidev wrote:
       | The hugging face demo site doesn't seem to work with any of the
       | examples. It returns an error in every case.
        
       | kmeisthax wrote:
       | My current mental model of LLMs is that attention refines
       | information from the input (e.g. "red balloon" becomes some
       | vector that means red + balloon), and that the feed-forward
       | layers are lossy compressed indexes into the training set.
       | 
       | From the paper, having a better n-gram index is giving as much
       | perplexity improvement as a 10x increase in parameters. I wonder
       | if it would make sense to train new foundation models with more
       | attention heads and use infinigram in lieu of the feed-forward
       | layer.
        
         | sigmoid10 wrote:
         | Attention just allows the model to attend to elements across
         | the input sequence. The feed forward is what gives it (and
         | basically all other architectures) its universal function
         | abilities. The only reason why we don't use dense layers
         | directly across the input sequence and instead go for things
         | like convolution, recursion or transformers, is because that is
         | prohibitively expensive computationally. But replacing the
         | dense layer with n-grams would make LLMs exactly what so many
         | people falsely believe them to be right now: Pure stochastic
         | parrots instead of functions that can actually learn to
         | generalize from examples.
        
       | novaRom wrote:
       | TL;DR we've trained n-gram model implemented in a form of suffix
       | array on very large data set, it shows good perplexity
        
       | throwaway81523 wrote:
       | Is this sort of like Dissociated Press in Emacs?
        
       | bglazer wrote:
       | Somewhat off-topic, but has anyone tried training a random forest
       | at LLM scale? Like an RF with millions of trees and billions of
       | branches? My intuition says it could be much more efficient on
       | CPUs, provided it works at all.
        
         | skyde wrote:
         | Forest are good at classification but they cannot leverage pre-
         | training on unclassified data.
        
       | tgv wrote:
       | What this shows, IMO, is that quite a lot of the expected output
       | is almost literally present in the training data.
        
         | flawsofar wrote:
         | Phrases and ideas repeat, because the universe has patterns and
         | structure.
        
       | matt4711 wrote:
       | A paper [1] we wrote in 2015 (cited by the authors) uses some
       | more sophisticated data structures (compressed suffix trees) and
       | Kneser-Ney smoothing to get the same "unlimited" context. I
       | imagine with better smoothing and the same larger corpus sizes as
       | the authors use this could improve on some of the results the
       | authors provide.
       | 
       | Back then neural LMs were just beginning to emerge and we briefly
       | experimented with using one of these unlimited N-gram models for
       | pre-training but never got any results.
       | 
       | [1] https://aclanthology.org/D15-1288.pdf
        
         | inciampati wrote:
         | The formulation of a succinct full text index as an "infinite
         | n-gram model" is both genius in connecting with the ML
         | literature, but also blind-eye, in that it ignores all the work
         | on using exactly this data structure (well, the FM-index) over
         | DNA. bwa mem is the OG [?]-gram aligner.
        
       | drdeca wrote:
       | A few thoughts this brings to mind:
       | 
       | 1) this reminds me of how I was thinking about "what if you tried
       | to modify an n-gram model to incorporate a poor-man's
       | approximation of the copying heads found in transformer models?
       | (one which is also based just on counting)"
       | 
       | 2) What if you trained a model to, given a sequence of tokens,
       | predict how many times that sequence of tokens appeared in the
       | training set? And/or trained a model to, given a sequence of
       | tokens, predict the length of longest suffix of it which appears
       | in the training set. Could this be done in a way that gave a good
       | approximation to the actual counts, while being substantially
       | smaller? (Though, of course, this would fail to have the
       | attribution property that they mentioned.)
       | 
       | 3) It surprised me that they just always use the largest n such
       | that there is a sample in the training set. My thought would have
       | been to combine the different choices of n with some weighting.
       | 
       | Like, if you have one example in the training set of "a b c d",
       | and a million examples of "b c e", and only one or two other
       | examples of "b c d" other than the single instance of "a b c d",
       | does increasing the number of samples of "b c e" (which aren't
       | preceded by "a") really give no weight towards "e" rather than
       | "d" for the next token?
        
       | hiddencost wrote:
       | Unfortunately, single digit millisecond is quite slow for n-gram
       | language models.
       | 
       | Weighted Finite State Transducers are where they really shine;
       | you tend to want to do massive parallel operations, and you want
       | to do the without being bottle necked on memory access. I think
       | these challenges make this framework challenging to adopt.
        
       | snats wrote:
       | I recently wrote a writeup on bigrams and the infinigram
       | outputs[1]. I genuinely believe that ngrams are making a
       | comeback. For query searching it's really good.
       | 
       | [1]
       | https://snats.xyz/pages/articles/from_bigram_to_infinigram.h...
        
       | omeze wrote:
       | This is a really cool paper, reminds me of the simple exercise
       | Karpathy goes through in his NN vid series with a bigram
       | predictor. Looks like in practice there's still some grounding
       | issues when attempting to use them for instruction-tuned
       | applications, but clever direction to explore!
        
       | mfornet wrote:
       | As I see it, this model will be able to predict "easy" to derive
       | tokens but will no chance on "hard" tokens.
       | 
       | For example doing a sum of random numbers. If the token you are
       | trying to predict is not in the training data, even if similar
       | patterns exist, this model defaults to the Neural Model.
       | 
       | I guess then it is an aide to the neural model on filling the
       | easy patterns.
        
       ___________________________________________________________________
       (page generated 2024-05-05 23:00 UTC)