[HN Gopher] Infini-Gram: Scaling unbounded n-gram language model...
       ___________________________________________________________________
        
       Infini-Gram: Scaling unbounded n-gram language models to a trillion
       tokens
        
       Author : nsagent
       Score  : 132 points
       Date   : 2024-05-05 17:58 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | Genbox wrote:
       | A web interface for the Infini-gram engine can be found here:
       | https://huggingface.co/spaces/liujch1998/infini-gram
        
       | novaomnidev wrote:
       | The hugging face demo site doesn't seem to work with any of the
       | examples. It returns an error in every case.
        
       | kmeisthax wrote:
       | My current mental model of LLMs is that attention refines
       | information from the input (e.g. "red balloon" becomes some
       | vector that means red + balloon), and that the feed-forward
       | layers are lossy compressed indexes into the training set.
       | 
       | From the paper, having a better n-gram index is giving as much
       | perplexity improvement as a 10x increase in parameters. I wonder
       | if it would make sense to train new foundation models with more
       | attention heads and use infinigram in lieu of the feed-forward
       | layer.
        
         | sigmoid10 wrote:
         | Attention just allows the model to attend to elements across
         | the input sequence. The feed forward is what gives it (and
         | basically all other architectures) its universal function
         | abilities. The only reason why we don't use dense layers
         | directly across the input sequence and instead go for things
         | like convolution, recursion or transformers, is because that is
         | prohibitively expensive computationally. But replacing the
         | dense layer with n-grams would make LLMs exactly what so many
         | people falsely believe them to be right now: Pure stochastic
         | parrots instead of functions that can actually learn to
         | generalize from examples.
        
           | Chrupiter wrote:
           | Aren't human stochastic parrots in the end? I mean, when we
           | "learn", don't we model our internal stochastic functions?
           | Whether it is walking, learning a language, or anything else.
        
             | fnordpiglet wrote:
             | No, because we are able to extrapolate from our experience.
             | The ability to synthesize something coherent that doesn't
             | map directly into our training set is a major difference
             | between human intelligence and what we call AI today.
        
               | alexeldeib wrote:
               | Isn't there an argument we're simply better at brain
               | statistics and modeling than current AI? Forget
               | architectural limitations. What is the nature of the
               | extrapolation? How do individuals balance their
               | experiences and determine likely outcomes?
        
               | fnordpiglet wrote:
               | Maybe! But even so there's facilities AI lack that are
               | more capability based than model based. For instance we
               | demonstrate agency, we can simulate things in our mind
               | alone, such as arriving at Maxwells Equations, or general
               | relativity, or any number of other profound insights that
               | aren't based on our training data but are an
               | extrapolation through our mind into domains we've no
               | experience with and arrive at profound insights never
               | conceived of before. Statistical models generally aren't
               | able to do this - they're reflections of their training
               | set, even if very complex ones. The human mind can create
               | its own training set and that's a remarkable capability.
        
               | strangescript wrote:
               | "extrapolate from our experience" "synthesize something
               | coherent"
               | 
               | These are non-scientific concepts. You are basically
               | saying "humans are doing something more, but we can't
               | really explain it".
               | 
               | That assumption is getting weaker by the day. Our entire
               | existence is a single, linear, time sequence data set. Am
               | I "extrapolating from my experience" when I decide to
               | scratch my head? No, I got a sequential data point of an
               | "itch" and my reward programming has learned to output
               | "scratch".
        
               | fnordpiglet wrote:
               | Are you saying the discovery of relativity happened
               | because Einstein was reacting to some reward / stimulus
               | in his environment? Galois' discoveries were a stochastic
               | parrot regurgitating stimulus from his life?
               | 
               | There are known faculties humans have that LLMs
               | especially do not, such as actual memory, the ability to
               | simulate the world independently via the imagination and
               | structured thought, as well as facilities we don't really
               | understand but AIs definitely don't have which are the
               | source of our fundamental agency. We are absolutely able
               | to create thought and reasoning without direct stimulus
               | or as a response to something in the environment - and
               | it's frankly bizarre a human being can believe they've
               | never done something as a reaction to their internal
               | state rather than extrinsic.
               | 
               | LLMs literally can not "do" anything that isn't
               | predicated on their training set. This means, more or
               | less, they can only interpolate within their populated
               | vector space. The emergent properties are astounding and
               | they absolutely demonstrate what appears to be some form
               | of pseudo abductive reasoning which is powerful. I think
               | it's probably the most important advance of computing in
               | the last 30 years. But people have confused a remarkable
               | capability for a human like capability, and have
               | simultaneously missed the importance of the advance as
               | well as inexplicably diminished the remarkable
               | capabilities of the human mind. It's possible with more
               | research we will bridge the gaps, and I'm not appealing
               | to magic of the soul here.
               | 
               | But the human mind has a remarkable ability to reason,
               | synthesize, extrapolate beyond their experience, and
               | those are all things LLMs fundamentally - from a rigorous
               | mathematical basis - can not do and will never do alone.
               | Any thing that bridges that will need an ensemble of AI
               | and classical computing techniques - and maybe LLMs will
               | be a core part of a part of something even more amazing.
               | But we aren't there yet and I've not seen a roadmap that
               | takes us there.
        
               | Der_Einzige wrote:
               | The overwhelming majority of human advancements is
               | interpolation. Extrapolation is rare and we tend to only
               | realize something was extrapolation after the fact.
        
               | fnordpiglet wrote:
               | Some achievement are very clearly extrapolation. Galois
               | field theory, general relativity, Greens theorem, to name
               | a few I'm familiar with. These are the leaps of
               | extrapolation that change the world - but the human mind
               | has capabilities LLM simple can't - agency, facilities at
               | speculative simulation of approximations (dreaming and
               | imagination), random recall memory, among others. These
               | aren't woowoo magic human ideas they're measurable and
               | identifiable capabilities that by construction LLMs can't
               | have - even if they can approximate it in a compelling
               | way. That doesn't mean we can't build something with
               | these and other intelligence capabilities AI of today
               | lack or that they aren't profoundly useful. But they
               | literally can't do anything but "walk" their vector space
               | - and nothing but their vector space.
        
             | imtringued wrote:
             | If I asked you what 5+3+9 is, then you wouldn't be allowed
             | to calculate the intermediate values inside your head. Is
             | it really that hard to believe that humans have internal
             | thoughts and that they think before they speak? Is it
             | really such a revelation that I have to remind you of it?
        
               | knome wrote:
               | Creating a small group of bot 'personalities' that have
               | an internal dialog, generating and sharing intermediate
               | values before coming to a consensus and issuing a
               | response to a user is trivial. I did it in my earliest
               | experiments with GPT-3
               | 
               | You could use the same framework to generate an internal
               | dialog for a bot.
               | 
               | A lot of people don't think before they speak. If you
               | tell me you have a small conversation with yourself
               | before each thing you say out loud during a conversation,
               | I will have doubts. Quick wit and fast paced conversation
               | do not leave time for any real internal narration, just
               | "stream of consciousness".
               | 
               | There is a time for carefully choosing and reflecting on
               | your words, surely, but there are many times staying in
               | tune with a real time conversation takes precedence.
        
               | littlestymaar wrote:
               | > Creating a small group of bot 'personalities' that have
               | an internal dialog, generating and sharing intermediate
               | values before coming to a consensus and issuing a
               | response to a user is trivial. I did it in my earliest
               | experiments with GPT-3
               | 
               | > You could use the same framework to generate an
               | internal dialog for a bot.
               | 
               | We can, for sure. But will it works? Given my (admittedly
               | limited) experience with feeding LLM-generated stuff back
               | in the LLM, I'd suspect it may actually lower the output
               | quality. But maybe fine-tuning for this specific work-
               | case could be a solution to this problem, as I suspect
               | the instruction-tuning to be a culprit in the poor
               | behavior I've witnessed (the bots have been instruction-
               | tuned to believe the human, and apologize if you tell
               | them they've made mistakes for instance, even if they
               | were right in the first place, so this blind trust is
               | likely polluting the results).
        
               | CuriouslyC wrote:
               | You make it sound binary. We have a thought process
               | ongoing at all times - sometimes we wait to respond for
               | that process to accumulate more information, and
               | sometimes we fire off right away, but the information
               | processing engine is running in the background
               | regardless.
        
               | Workaccount2 wrote:
               | Here is the rub; even when you "stop and think" it's
               | still just a stream of consciousness. The internal
               | decisions about what to say arise out of the same mental
               | abyss as "stream of consciousness" thoughts.
               | 
               | If you pay attention, you can catch that it is all just
               | an illusion.
        
             | jillesvangurp wrote:
             | If you've ever had a conversation with a toddler, they do
             | sound a bit like stochastic parrots. It takes us a while to
             | be able to talk coherently. The learning process in schools
             | involves a lot of repetition. From learning the abc to
             | mastering calculus.
        
               | sigmoid10 wrote:
               | Toddlers are just learning the building blocks of
               | language. You could make the same statement about any new
               | skill. However, at some point, most humans gain the
               | ability to take two concepts they have heard about before
               | and create a third concept that they have never
               | encountered. You can also get that with artificial neural
               | networks, but it is fundamentally impossible with
               | n-grams.
        
               | PaulHoule wrote:
               | n-grams will be able to figure out                  4 + 5
               | = 9
               | 
               | or                  1729 is a taxicab number
               | 
               | if those phrases are in the database but not
               | 4 + 05 = 9             5 + 4 = 9             11 + 3 = 14
               | 48988659276962496 is a taxicab number
               | 
               | if those are not in the database.
        
             | root_axis wrote:
             | No, that's absurdly reductive. You might as well say
             | "aren't humans just calculators made of meat in the end?".
             | If you append "in the end" to any analogy you'll find some
             | people that are willing to stretch to fit the analogy
             | because they like it.
        
               | margalabargala wrote:
               | Everything is entropy in the end.
        
             | krainboltgreene wrote:
             | There's at least one of these in every LLM critical thread.
        
       | novaRom wrote:
       | TL;DR we've trained n-gram model implemented in a form of suffix
       | array on very large data set, it shows good perplexity
        
       | throwaway81523 wrote:
       | Is this sort of like Dissociated Press in Emacs?
        
       | bglazer wrote:
       | Somewhat off-topic, but has anyone tried training a random forest
       | at LLM scale? Like an RF with millions of trees and billions of
       | branches? My intuition says it could be much more efficient on
       | CPUs, provided it works at all.
        
         | skyde wrote:
         | Forest are good at classification but they cannot leverage pre-
         | training on unclassified data.
        
           | ColonelPhantom wrote:
           | Aren't LLMs a classification problem in a sense? "Given this
           | text, classify it based on the next token" seems like a
           | viable interpretation of the problem. Although there are a
           | couple thousand or so classes, which might be a lot (but I
           | know very little about this field).
        
           | thesz wrote:
           | https://gradientdescending.com/unsupervised-random-forest-
           | ex...
           | 
           | You can cluster data using unsupervised random forests and
           | then use these cluster indices as features.
        
         | solresol wrote:
         | I'm working on it, but with a loss function that penalises
         | hypernyms/hyponyms less than other kinds of mistakes. I'm
         | vaguely optimistic that this is going to be more efficient.
        
         | Der_Einzige wrote:
         | I've been talking about this for years! It's fully possible!
        
       | tgv wrote:
       | What this shows, IMO, is that quite a lot of the expected output
       | is almost literally present in the training data.
        
         | flawsofar wrote:
         | Phrases and ideas repeat, because the universe has patterns and
         | structure.
        
           | tgv wrote:
           | But it also suggests that the neural networks are not the
           | miracle as some proclaim. They get their knowledge from
           | compressing a very large amount of text, but add little in
           | terms of inference. Its success is then a function of its
           | size, but that size cannot grow much. It also ties in with
           | the idea that the current ANNs don't discriminate text, and
           | repeat bad input as readily as anything else.
        
       | matt4711 wrote:
       | A paper [1] we wrote in 2015 (cited by the authors) uses some
       | more sophisticated data structures (compressed suffix trees) and
       | Kneser-Ney smoothing to get the same "unlimited" context. I
       | imagine with better smoothing and the same larger corpus sizes as
       | the authors use this could improve on some of the results the
       | authors provide.
       | 
       | Back then neural LMs were just beginning to emerge and we briefly
       | experimented with using one of these unlimited N-gram models for
       | pre-training but never got any results.
       | 
       | [1] https://aclanthology.org/D15-1288.pdf
        
         | inciampati wrote:
         | The formulation of a succinct full text index as an "infinite
         | n-gram model" is both genius in connecting with the ML
         | literature, but also blind-eye, in that it ignores all the work
         | on using exactly this data structure (well, the FM-index) over
         | DNA. bwa mem is the OG [?]-gram aligner.
        
           | jltsiren wrote:
           | The idea is far older than BWA. The entire point of the
           | Burrows-Wheeler transform is to create a compact order-k
           | "language model" for every value of k simultaneously. And
           | then use that for data compression.
           | 
           | David Wheeler reportedly had the idea in the early 80s, but
           | he rejected it as impractical. Then he tried publishing it
           | with Michael Burrows in the early 90s, but it was rejected
           | from the Data Compression Conference. Algorithms researchers
           | found the BWT more interesting than data compression
           | researchers, especially after the discovery of the FM-index
           | in 2000. There was a lot of theoretical work done in the
           | 2000s, and the paper mentioned by GP cites some of that.
           | 
           | The primary application imagined for the FM-index was usually
           | information retrieval, mostly due to the people involved. But
           | some people considered using it for DNA sequences and started
           | focusing on efficient construction and approximate pattern
           | matching instead of better compression. And when sequencing
           | technology advanced to the point where BWT-based aligners
           | became relevant, a few of them were published almost
           | simultaneously.
           | 
           | And if you ask Heng Li, he gives credit to Tak-Wah Lam:
           | https://lh3.github.io/2024/04/12/where-did-bwa-come-from
        
       | drdeca wrote:
       | A few thoughts this brings to mind:
       | 
       | 1) this reminds me of how I was thinking about "what if you tried
       | to modify an n-gram model to incorporate a poor-man's
       | approximation of the copying heads found in transformer models?
       | (one which is also based just on counting)"
       | 
       | 2) What if you trained a model to, given a sequence of tokens,
       | predict how many times that sequence of tokens appeared in the
       | training set? And/or trained a model to, given a sequence of
       | tokens, predict the length of longest suffix of it which appears
       | in the training set. Could this be done in a way that gave a good
       | approximation to the actual counts, while being substantially
       | smaller? (Though, of course, this would fail to have the
       | attribution property that they mentioned.)
       | 
       | 3) It surprised me that they just always use the largest n such
       | that there is a sample in the training set. My thought would have
       | been to combine the different choices of n with some weighting.
       | 
       | Like, if you have one example in the training set of "a b c d",
       | and a million examples of "b c e", and only one or two other
       | examples of "b c d" other than the single instance of "a b c d",
       | does increasing the number of samples of "b c e" (which aren't
       | preceded by "a") really give no weight towards "e" rather than
       | "d" for the next token?
        
       | hiddencost wrote:
       | Unfortunately, single digit millisecond is quite slow for n-gram
       | language models.
       | 
       | Weighted Finite State Transducers are where they really shine;
       | you tend to want to do massive parallel operations, and you want
       | to do the without being bottle necked on memory access. I think
       | these challenges make this framework challenging to adopt.
        
       | snats wrote:
       | I recently wrote a writeup on bigrams and the infinigram
       | outputs[1]. I genuinely believe that ngrams are making a
       | comeback. For query searching it's really good.
       | 
       | [1]
       | https://snats.xyz/pages/articles/from_bigram_to_infinigram.h...
        
         | tarasglek wrote:
         | Would love to follow your blog but no rss
        
       | omeze wrote:
       | This is a really cool paper, reminds me of the simple exercise
       | Karpathy goes through in his NN vid series with a bigram
       | predictor. Looks like in practice there's still some grounding
       | issues when attempting to use them for instruction-tuned
       | applications, but clever direction to explore!
        
       | mfornet wrote:
       | As I see it, this model will be able to predict "easy" to derive
       | tokens but will no chance on "hard" tokens.
       | 
       | For example doing a sum of random numbers. If the token you are
       | trying to predict is not in the training data, even if similar
       | patterns exist, this model defaults to the Neural Model.
       | 
       | I guess then it is an aide to the neural model on filling the
       | easy patterns.
        
       | om8 wrote:
       | Perplexity results are impressive. I wonder, how does combined
       | models perform on MMLU and other problem-solving benchmarks. It
       | can be that this infini-gram method is exactly the thing that
       | "hacks" perplexity without adding any "understanding" to the
       | model.
        
         | om8 wrote:
         | P.S. They acknowledge this: "... our preliminary experiments
         | show that such method might not be helpful, and even harmful,
         | to open-ended text generation tasks. During generation,
         | [?]-gram can make odd mistakes (e.g., predicting totally
         | irrelevant tokens) which makes the model to digress. Thus this
         | combined model is not ready to replace neural LMs. Additional
         | investigation is required to make inf-gram best contribute to
         | text generation."
        
       | FloatArtifact wrote:
       | I can't help but want complex grammers for speech recognition
       | :-). There really needs to be a hybrid between speech recognition
       | grammar based commands and natural language for commands.
        
         | bmc7505 wrote:
         | It's coming! CMUSphinx used to have something like this, and
         | there are some [1] solutions [2] on the horizon.
         | 
         | [1]: https://github.com/alphacep/vosk-api/issues/55
         | 
         | [2]: https://github.com/outlines-dev/outlines?tab=readme-ov-
         | file#...
        
       | mirekrusin wrote:
       | Could be used to dedup training data.
        
       | thesz wrote:
       | The paper does not cite [1], which describes prediction by
       | partial match algorithm with unbounded context length.
       | 
       | [1]
       | https://www.researchgate.net/publication/2473004_Unbounded_L...
       | 
       | That is from 2003 and actually quite interesting. It shows, for
       | example, that these models were applied to rather small texts,
       | because of the presence of context with length of -1 (uniform
       | distribution). When text is large the need for such context
       | vanishes, but for small texts it can give an advantage.
        
       | renonce wrote:
       | I think this leaves we find that explore in language models
       | _regularized_ (or maybe augmented?) by a n-gram model: instead of
       | predicing next token without any external knowledge, the n-gram
       | predictions can be added to the softmax head as a "default"
       | prediction. The language model's job would then be to improve the
       | accuracy on top of the n-gram prediction. This shifts some of the
       | language model's job to the n-gram predictor, which relies on
       | traditional methods and not GPUs, saving a lot of computation.
       | 
       | EDIT: Oh so this thing takes 10TB disk space to keep an index
       | while a LLM takes... 175GB (assuming GPT-3 in fp16). A huge
       | resource requirement that cannot be ignored.
        
       ___________________________________________________________________
       (page generated 2024-05-06 23:02 UTC)