[HN Gopher] Infini-Gram: Scaling unbounded n-gram language model...
___________________________________________________________________
Infini-Gram: Scaling unbounded n-gram language models to a trillion
tokens
Author : nsagent
Score : 132 points
Date : 2024-05-05 17:58 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| Genbox wrote:
| A web interface for the Infini-gram engine can be found here:
| https://huggingface.co/spaces/liujch1998/infini-gram
| novaomnidev wrote:
| The hugging face demo site doesn't seem to work with any of the
| examples. It returns an error in every case.
| kmeisthax wrote:
| My current mental model of LLMs is that attention refines
| information from the input (e.g. "red balloon" becomes some
| vector that means red + balloon), and that the feed-forward
| layers are lossy compressed indexes into the training set.
|
| From the paper, having a better n-gram index is giving as much
| perplexity improvement as a 10x increase in parameters. I wonder
| if it would make sense to train new foundation models with more
| attention heads and use infinigram in lieu of the feed-forward
| layer.
| sigmoid10 wrote:
| Attention just allows the model to attend to elements across
| the input sequence. The feed forward is what gives it (and
| basically all other architectures) its universal function
| abilities. The only reason why we don't use dense layers
| directly across the input sequence and instead go for things
| like convolution, recursion or transformers, is because that is
| prohibitively expensive computationally. But replacing the
| dense layer with n-grams would make LLMs exactly what so many
| people falsely believe them to be right now: Pure stochastic
| parrots instead of functions that can actually learn to
| generalize from examples.
| Chrupiter wrote:
| Aren't human stochastic parrots in the end? I mean, when we
| "learn", don't we model our internal stochastic functions?
| Whether it is walking, learning a language, or anything else.
| fnordpiglet wrote:
| No, because we are able to extrapolate from our experience.
| The ability to synthesize something coherent that doesn't
| map directly into our training set is a major difference
| between human intelligence and what we call AI today.
| alexeldeib wrote:
| Isn't there an argument we're simply better at brain
| statistics and modeling than current AI? Forget
| architectural limitations. What is the nature of the
| extrapolation? How do individuals balance their
| experiences and determine likely outcomes?
| fnordpiglet wrote:
| Maybe! But even so there's facilities AI lack that are
| more capability based than model based. For instance we
| demonstrate agency, we can simulate things in our mind
| alone, such as arriving at Maxwells Equations, or general
| relativity, or any number of other profound insights that
| aren't based on our training data but are an
| extrapolation through our mind into domains we've no
| experience with and arrive at profound insights never
| conceived of before. Statistical models generally aren't
| able to do this - they're reflections of their training
| set, even if very complex ones. The human mind can create
| its own training set and that's a remarkable capability.
| strangescript wrote:
| "extrapolate from our experience" "synthesize something
| coherent"
|
| These are non-scientific concepts. You are basically
| saying "humans are doing something more, but we can't
| really explain it".
|
| That assumption is getting weaker by the day. Our entire
| existence is a single, linear, time sequence data set. Am
| I "extrapolating from my experience" when I decide to
| scratch my head? No, I got a sequential data point of an
| "itch" and my reward programming has learned to output
| "scratch".
| fnordpiglet wrote:
| Are you saying the discovery of relativity happened
| because Einstein was reacting to some reward / stimulus
| in his environment? Galois' discoveries were a stochastic
| parrot regurgitating stimulus from his life?
|
| There are known faculties humans have that LLMs
| especially do not, such as actual memory, the ability to
| simulate the world independently via the imagination and
| structured thought, as well as facilities we don't really
| understand but AIs definitely don't have which are the
| source of our fundamental agency. We are absolutely able
| to create thought and reasoning without direct stimulus
| or as a response to something in the environment - and
| it's frankly bizarre a human being can believe they've
| never done something as a reaction to their internal
| state rather than extrinsic.
|
| LLMs literally can not "do" anything that isn't
| predicated on their training set. This means, more or
| less, they can only interpolate within their populated
| vector space. The emergent properties are astounding and
| they absolutely demonstrate what appears to be some form
| of pseudo abductive reasoning which is powerful. I think
| it's probably the most important advance of computing in
| the last 30 years. But people have confused a remarkable
| capability for a human like capability, and have
| simultaneously missed the importance of the advance as
| well as inexplicably diminished the remarkable
| capabilities of the human mind. It's possible with more
| research we will bridge the gaps, and I'm not appealing
| to magic of the soul here.
|
| But the human mind has a remarkable ability to reason,
| synthesize, extrapolate beyond their experience, and
| those are all things LLMs fundamentally - from a rigorous
| mathematical basis - can not do and will never do alone.
| Any thing that bridges that will need an ensemble of AI
| and classical computing techniques - and maybe LLMs will
| be a core part of a part of something even more amazing.
| But we aren't there yet and I've not seen a roadmap that
| takes us there.
| Der_Einzige wrote:
| The overwhelming majority of human advancements is
| interpolation. Extrapolation is rare and we tend to only
| realize something was extrapolation after the fact.
| fnordpiglet wrote:
| Some achievement are very clearly extrapolation. Galois
| field theory, general relativity, Greens theorem, to name
| a few I'm familiar with. These are the leaps of
| extrapolation that change the world - but the human mind
| has capabilities LLM simple can't - agency, facilities at
| speculative simulation of approximations (dreaming and
| imagination), random recall memory, among others. These
| aren't woowoo magic human ideas they're measurable and
| identifiable capabilities that by construction LLMs can't
| have - even if they can approximate it in a compelling
| way. That doesn't mean we can't build something with
| these and other intelligence capabilities AI of today
| lack or that they aren't profoundly useful. But they
| literally can't do anything but "walk" their vector space
| - and nothing but their vector space.
| imtringued wrote:
| If I asked you what 5+3+9 is, then you wouldn't be allowed
| to calculate the intermediate values inside your head. Is
| it really that hard to believe that humans have internal
| thoughts and that they think before they speak? Is it
| really such a revelation that I have to remind you of it?
| knome wrote:
| Creating a small group of bot 'personalities' that have
| an internal dialog, generating and sharing intermediate
| values before coming to a consensus and issuing a
| response to a user is trivial. I did it in my earliest
| experiments with GPT-3
|
| You could use the same framework to generate an internal
| dialog for a bot.
|
| A lot of people don't think before they speak. If you
| tell me you have a small conversation with yourself
| before each thing you say out loud during a conversation,
| I will have doubts. Quick wit and fast paced conversation
| do not leave time for any real internal narration, just
| "stream of consciousness".
|
| There is a time for carefully choosing and reflecting on
| your words, surely, but there are many times staying in
| tune with a real time conversation takes precedence.
| littlestymaar wrote:
| > Creating a small group of bot 'personalities' that have
| an internal dialog, generating and sharing intermediate
| values before coming to a consensus and issuing a
| response to a user is trivial. I did it in my earliest
| experiments with GPT-3
|
| > You could use the same framework to generate an
| internal dialog for a bot.
|
| We can, for sure. But will it works? Given my (admittedly
| limited) experience with feeding LLM-generated stuff back
| in the LLM, I'd suspect it may actually lower the output
| quality. But maybe fine-tuning for this specific work-
| case could be a solution to this problem, as I suspect
| the instruction-tuning to be a culprit in the poor
| behavior I've witnessed (the bots have been instruction-
| tuned to believe the human, and apologize if you tell
| them they've made mistakes for instance, even if they
| were right in the first place, so this blind trust is
| likely polluting the results).
| CuriouslyC wrote:
| You make it sound binary. We have a thought process
| ongoing at all times - sometimes we wait to respond for
| that process to accumulate more information, and
| sometimes we fire off right away, but the information
| processing engine is running in the background
| regardless.
| Workaccount2 wrote:
| Here is the rub; even when you "stop and think" it's
| still just a stream of consciousness. The internal
| decisions about what to say arise out of the same mental
| abyss as "stream of consciousness" thoughts.
|
| If you pay attention, you can catch that it is all just
| an illusion.
| jillesvangurp wrote:
| If you've ever had a conversation with a toddler, they do
| sound a bit like stochastic parrots. It takes us a while to
| be able to talk coherently. The learning process in schools
| involves a lot of repetition. From learning the abc to
| mastering calculus.
| sigmoid10 wrote:
| Toddlers are just learning the building blocks of
| language. You could make the same statement about any new
| skill. However, at some point, most humans gain the
| ability to take two concepts they have heard about before
| and create a third concept that they have never
| encountered. You can also get that with artificial neural
| networks, but it is fundamentally impossible with
| n-grams.
| PaulHoule wrote:
| n-grams will be able to figure out 4 + 5
| = 9
|
| or 1729 is a taxicab number
|
| if those phrases are in the database but not
| 4 + 05 = 9 5 + 4 = 9 11 + 3 = 14
| 48988659276962496 is a taxicab number
|
| if those are not in the database.
| root_axis wrote:
| No, that's absurdly reductive. You might as well say
| "aren't humans just calculators made of meat in the end?".
| If you append "in the end" to any analogy you'll find some
| people that are willing to stretch to fit the analogy
| because they like it.
| margalabargala wrote:
| Everything is entropy in the end.
| krainboltgreene wrote:
| There's at least one of these in every LLM critical thread.
| novaRom wrote:
| TL;DR we've trained n-gram model implemented in a form of suffix
| array on very large data set, it shows good perplexity
| throwaway81523 wrote:
| Is this sort of like Dissociated Press in Emacs?
| bglazer wrote:
| Somewhat off-topic, but has anyone tried training a random forest
| at LLM scale? Like an RF with millions of trees and billions of
| branches? My intuition says it could be much more efficient on
| CPUs, provided it works at all.
| skyde wrote:
| Forest are good at classification but they cannot leverage pre-
| training on unclassified data.
| ColonelPhantom wrote:
| Aren't LLMs a classification problem in a sense? "Given this
| text, classify it based on the next token" seems like a
| viable interpretation of the problem. Although there are a
| couple thousand or so classes, which might be a lot (but I
| know very little about this field).
| thesz wrote:
| https://gradientdescending.com/unsupervised-random-forest-
| ex...
|
| You can cluster data using unsupervised random forests and
| then use these cluster indices as features.
| solresol wrote:
| I'm working on it, but with a loss function that penalises
| hypernyms/hyponyms less than other kinds of mistakes. I'm
| vaguely optimistic that this is going to be more efficient.
| Der_Einzige wrote:
| I've been talking about this for years! It's fully possible!
| tgv wrote:
| What this shows, IMO, is that quite a lot of the expected output
| is almost literally present in the training data.
| flawsofar wrote:
| Phrases and ideas repeat, because the universe has patterns and
| structure.
| tgv wrote:
| But it also suggests that the neural networks are not the
| miracle as some proclaim. They get their knowledge from
| compressing a very large amount of text, but add little in
| terms of inference. Its success is then a function of its
| size, but that size cannot grow much. It also ties in with
| the idea that the current ANNs don't discriminate text, and
| repeat bad input as readily as anything else.
| matt4711 wrote:
| A paper [1] we wrote in 2015 (cited by the authors) uses some
| more sophisticated data structures (compressed suffix trees) and
| Kneser-Ney smoothing to get the same "unlimited" context. I
| imagine with better smoothing and the same larger corpus sizes as
| the authors use this could improve on some of the results the
| authors provide.
|
| Back then neural LMs were just beginning to emerge and we briefly
| experimented with using one of these unlimited N-gram models for
| pre-training but never got any results.
|
| [1] https://aclanthology.org/D15-1288.pdf
| inciampati wrote:
| The formulation of a succinct full text index as an "infinite
| n-gram model" is both genius in connecting with the ML
| literature, but also blind-eye, in that it ignores all the work
| on using exactly this data structure (well, the FM-index) over
| DNA. bwa mem is the OG [?]-gram aligner.
| jltsiren wrote:
| The idea is far older than BWA. The entire point of the
| Burrows-Wheeler transform is to create a compact order-k
| "language model" for every value of k simultaneously. And
| then use that for data compression.
|
| David Wheeler reportedly had the idea in the early 80s, but
| he rejected it as impractical. Then he tried publishing it
| with Michael Burrows in the early 90s, but it was rejected
| from the Data Compression Conference. Algorithms researchers
| found the BWT more interesting than data compression
| researchers, especially after the discovery of the FM-index
| in 2000. There was a lot of theoretical work done in the
| 2000s, and the paper mentioned by GP cites some of that.
|
| The primary application imagined for the FM-index was usually
| information retrieval, mostly due to the people involved. But
| some people considered using it for DNA sequences and started
| focusing on efficient construction and approximate pattern
| matching instead of better compression. And when sequencing
| technology advanced to the point where BWT-based aligners
| became relevant, a few of them were published almost
| simultaneously.
|
| And if you ask Heng Li, he gives credit to Tak-Wah Lam:
| https://lh3.github.io/2024/04/12/where-did-bwa-come-from
| drdeca wrote:
| A few thoughts this brings to mind:
|
| 1) this reminds me of how I was thinking about "what if you tried
| to modify an n-gram model to incorporate a poor-man's
| approximation of the copying heads found in transformer models?
| (one which is also based just on counting)"
|
| 2) What if you trained a model to, given a sequence of tokens,
| predict how many times that sequence of tokens appeared in the
| training set? And/or trained a model to, given a sequence of
| tokens, predict the length of longest suffix of it which appears
| in the training set. Could this be done in a way that gave a good
| approximation to the actual counts, while being substantially
| smaller? (Though, of course, this would fail to have the
| attribution property that they mentioned.)
|
| 3) It surprised me that they just always use the largest n such
| that there is a sample in the training set. My thought would have
| been to combine the different choices of n with some weighting.
|
| Like, if you have one example in the training set of "a b c d",
| and a million examples of "b c e", and only one or two other
| examples of "b c d" other than the single instance of "a b c d",
| does increasing the number of samples of "b c e" (which aren't
| preceded by "a") really give no weight towards "e" rather than
| "d" for the next token?
| hiddencost wrote:
| Unfortunately, single digit millisecond is quite slow for n-gram
| language models.
|
| Weighted Finite State Transducers are where they really shine;
| you tend to want to do massive parallel operations, and you want
| to do the without being bottle necked on memory access. I think
| these challenges make this framework challenging to adopt.
| snats wrote:
| I recently wrote a writeup on bigrams and the infinigram
| outputs[1]. I genuinely believe that ngrams are making a
| comeback. For query searching it's really good.
|
| [1]
| https://snats.xyz/pages/articles/from_bigram_to_infinigram.h...
| tarasglek wrote:
| Would love to follow your blog but no rss
| omeze wrote:
| This is a really cool paper, reminds me of the simple exercise
| Karpathy goes through in his NN vid series with a bigram
| predictor. Looks like in practice there's still some grounding
| issues when attempting to use them for instruction-tuned
| applications, but clever direction to explore!
| mfornet wrote:
| As I see it, this model will be able to predict "easy" to derive
| tokens but will no chance on "hard" tokens.
|
| For example doing a sum of random numbers. If the token you are
| trying to predict is not in the training data, even if similar
| patterns exist, this model defaults to the Neural Model.
|
| I guess then it is an aide to the neural model on filling the
| easy patterns.
| om8 wrote:
| Perplexity results are impressive. I wonder, how does combined
| models perform on MMLU and other problem-solving benchmarks. It
| can be that this infini-gram method is exactly the thing that
| "hacks" perplexity without adding any "understanding" to the
| model.
| om8 wrote:
| P.S. They acknowledge this: "... our preliminary experiments
| show that such method might not be helpful, and even harmful,
| to open-ended text generation tasks. During generation,
| [?]-gram can make odd mistakes (e.g., predicting totally
| irrelevant tokens) which makes the model to digress. Thus this
| combined model is not ready to replace neural LMs. Additional
| investigation is required to make inf-gram best contribute to
| text generation."
| FloatArtifact wrote:
| I can't help but want complex grammers for speech recognition
| :-). There really needs to be a hybrid between speech recognition
| grammar based commands and natural language for commands.
| bmc7505 wrote:
| It's coming! CMUSphinx used to have something like this, and
| there are some [1] solutions [2] on the horizon.
|
| [1]: https://github.com/alphacep/vosk-api/issues/55
|
| [2]: https://github.com/outlines-dev/outlines?tab=readme-ov-
| file#...
| mirekrusin wrote:
| Could be used to dedup training data.
| thesz wrote:
| The paper does not cite [1], which describes prediction by
| partial match algorithm with unbounded context length.
|
| [1]
| https://www.researchgate.net/publication/2473004_Unbounded_L...
|
| That is from 2003 and actually quite interesting. It shows, for
| example, that these models were applied to rather small texts,
| because of the presence of context with length of -1 (uniform
| distribution). When text is large the need for such context
| vanishes, but for small texts it can give an advantage.
| renonce wrote:
| I think this leaves we find that explore in language models
| _regularized_ (or maybe augmented?) by a n-gram model: instead of
| predicing next token without any external knowledge, the n-gram
| predictions can be added to the softmax head as a "default"
| prediction. The language model's job would then be to improve the
| accuracy on top of the n-gram prediction. This shifts some of the
| language model's job to the n-gram predictor, which relies on
| traditional methods and not GPUs, saving a lot of computation.
|
| EDIT: Oh so this thing takes 10TB disk space to keep an index
| while a LLM takes... 175GB (assuming GPT-3 in fp16). A huge
| resource requirement that cannot be ignored.
___________________________________________________________________
(page generated 2024-05-06 23:02 UTC)