[HN Gopher] Better and Faster Large Language Models via Multi-To...
       ___________________________________________________________________
        
       Better and Faster Large Language Models via Multi-Token Prediction
        
       Author : jasondavies
       Score  : 277 points
       Date   : 2024-05-01 08:28 UTC (14 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | Havoc wrote:
       | How does that still end up making grammatical sense?
       | 
       | If token/word +1 and +2 are predicted independently then surely
       | often it won't ?
        
         | vletal wrote:
         | The n+1-th token is discarded if it is unlikely given the n-th
         | token.
        
         | wongarsu wrote:
         | They just throw the predictions for +1 and +2 away, and only
         | generate them for more efficient training.
         | 
         | The abstract doesn't make that clear, but from the description
         | of figure 1: "During inference, we employ only the next-token
         | output head. Optionally, the other three heads may be used to
         | speed-up inference time"
         | 
         | Maybe you can use all three heads if you take the top
         | prediction from all of them, but that prevents you from doing
         | any of the common sampling strategies. I'm not sure how many
         | people actually run an LLM with temperature 0 outside of
         | benchmarks, unless they do something even better than applying
         | a temperature
        
           | faabian wrote:
           | Exactly, but there is also a rejection sampling based method
           | for speculative sampling: https://arxiv.org/abs/2302.01318
        
           | Havoc wrote:
           | Thanks for explaining
        
       | throw310822 wrote:
       | Apologies in advance for the super naive question; but assuming
       | that we can create vectors that encode for the meaning of entire
       | sentences, what prevents us from training llms to predict those
       | vectors instead of single words?
        
         | yorwba wrote:
         | You still need to convert between words and sentence vectors
         | somehow. You could try using a faster model for that, but I
         | suspect that the output quality will suffer.
        
           | magicalhippo wrote:
           | LLMs somehow need to do this anyway, implicitly or not. Has
           | anyone tried to do it more explicitly?
           | 
           | That is to break out the idea that characters are formed into
           | words, and words into sentences, and a sentence is a sequence
           | of "concepts" for the lack of a better description.
           | 
           | So have one NN which takes a sequence of tokens and predicts
           | an moderately-dimensional "word vector", which is fed into
           | another which predicts a high-dimensional "concept vector".
           | 
           | Then the "thinking layer" would map a sequence of "concept
           | vectors" to "concept vectors", and then you'd have some
           | layers which does the reverse of the input layers to output
           | tokens which can be printed.
           | 
           | Thought being that by splitting it up like this you could
           | swap out the decode and encode layers independently to
           | translate, for example, and so on.
           | 
           | Just a shower thought.
        
             | flawsofar wrote:
             | A sentence is both a sequence and a tree of phrases.
             | Phrases have a head and one or more valents; they're
             | relations.
             | 
             | If you wanted to create an embedding algorithm for phrases,
             | you could and you could throw a transformer at it.
             | 
             | I don't know how you get the output of higher levels to
             | diffuse to phrase and word levels.
        
         | marcyb5st wrote:
         | Not a stupid question in my opinion.
         | 
         | The problem is that once you have the vectors representing the
         | answers you need something like another model that goes back to
         | a word representation of said answers. Something like a
         | diffusion model but for text. Additionally, the function that
         | this diffusion model will approximate won't be injective, but
         | at best surjective and at worst not even a function (in the
         | mathematical meaning) since many textual represantions are
         | possible given an embedding, and most of those won't be valid
         | (not grammatically valid, no sense sentences, ...).
         | 
         | Finally remember that the embeddings are a "lossy"
         | representation of some datum and so the inverse function will
         | lose a lot of the nuances/context/... .
         | 
         | LLMs avoid the problems above by predicting the next (now next
         | n tokens) in a way that is self consistent with the query and
         | the previous n tokens, so the function they approximate should
         | be mostly surjective.
        
           | throw310822 wrote:
           | > The problem is that once you have the vectors representing
           | the answers you need something like another model that goes
           | back to a word representation of said answers. Something like
           | a diffusion model but for text.
           | 
           | Could it be just a smaller llm that takes as input both the
           | semantic vector and the prompt, and is trained to predict the
           | output tokens based on those? A model with high linguistic
           | abilities and very little reasoning skills.
        
             | marcyb5st wrote:
             | I think what you suggest would be very similar to a
             | encoder-decoder architecture, which has been abandoned in
             | favor of decoder-only architectures
             | (https://cameronrwolfe.substack.com/p/decoder-only-
             | transforme...). So I am guessing that what you suggest has
             | already been tried and didn't work out, ut not sure why
             | (the problems I mentioned above or something else).
             | 
             | Sorry, that's where the limit of my knowledge is. I work on
             | ML stuff, but mostly on "traditional" deep learning and so
             | I am not up to speed with the genAI field (also, the sheer
             | amount of papers coming out makes it basically impossible
             | stay up to date of you're not in the field).
        
           | HanClinto wrote:
           | > Something like a diffusion model but for text.
           | 
           | This actually sounds amazingly useful.
        
             | soulofmischief wrote:
             | I've played around with the concept in the past but there
             | are some issues yet to be solved to make them more
             | practical than current generation LLMs.
             | 
             | People are working on it though:
             | https://arxiv.org/pdf/2305.09515
        
         | faabian wrote:
         | Author here -- that's a very good point and as I understand
         | work in progress in different teams. Training autoencoders for
         | language is actually super easy given the small amount of
         | information contained in text (compared to vision/video), the
         | hard part is making the model focus on the semantic part if all
         | signal we have comes from exact match in token space. Hence
         | Yann LeCun's ideas on joint embedding predictive architectures.
         | Note also that there is always a trade-off between auxiliary
         | tasks giving more signal but shifting the focus. In our case,
         | we noticed degradation if the number of predicted tokens is too
         | high. So latent prediction methods need to sort out what is
         | useful.
        
           | mike_hearn wrote:
           | Aren't the models already doing this, in a way? We know they
           | can do things like write rhyming poems and song lyrics that
           | do make perfect sense, so at some point the activations must
           | be encoding some sort of overall plan for the upcoming
           | sentences, even if maybe every word isn't predicted yet.
        
             | faabian wrote:
             | Yes. Otherwise next-token models wouldn't be nearly as good
             | as they are. But the question is how to train these
             | capabilities most efficiently! We had some interesting
             | findings on how with increasing model/dataset scale/data
             | quality, capabilities can move from "only learnable with
             | multi-token prediction" to "indifferent" and "multi-token
             | prediction actually hurts". This depends on the capability
             | itself, induction e.g. matures way earlier in this sense
             | than code generation capabilities.
        
               | mike_hearn wrote:
               | Is it possible that anti-scaling effect occurs because
               | you are removing some middle layers to free up space for
               | the extra output heads? I only scanned the paper quickly
               | but what happens if you treat the technique as strictly
               | additive and don't keep parameter sizes fixed?
        
             | mjburgess wrote:
             | > so at some point the activations must be encoding some
             | sort of overall plan for the upcoming sentences
             | 
             | This isn't obviously the case, compare this "intelligent
             | designer" view with evolution: there was no prior plan for
             | rabbits. it's sufficient to create the appearance of design
             | that sequential steps are simply probabilistically
             | modulated by prior ones.
             | 
             | Consider a continuation of "the cat..." merely a
             | distribution over all possible words suffices to create the
             | illusion of a plan, suppose: "the cat sat..." then, "on..,
             | the..." etc. follow from the training data.
             | 
             | I think there's a strong argument against trying to model
             | entire sentences exactly because the system isn't modelling
             | semantics: one _should_ expect accuracy to drop off a cliff
             | if there is no actual plan. ie., predicting  "sat on the
             | mat" from "cat" _shouldnt_ be a valid prediction, because
             | of the infinite number of possible continuations that _as a
             | whole_ is terrible (eg., what about  "chased the mouse"
             | etc.). The space of all possible _sentences_ to continue
             | from  "the cat" is infinite, which much of that space
             | actually useful; whereas the number of words is very small,
             | very fininte, and many of them not useful.
             | 
             | The only reason that "the cat sat..", "the cat sat on..."
             | is reasonable is because each sequential word can be
             | modulated by the prompt to seem as if planned.
        
               | edmara wrote:
               | The modelling is advanced enough that you can't
               | fundamentally distinguish it from (lossy, limited)
               | planning in the way you're describing.
               | 
               | If the KQV doesn't encode information about likely future
               | token sequences then a transformer empirically couldn't
               | outperform Markov text generators.
        
               | mjburgess wrote:
               | No one is spending $10-50mil building a markov text model
               | of everything ever digitised; if they did so, their
               | performance would approach a basic LLM.
               | 
               | Though, more simply, you can just take any LLM and
               | rephrase it as a markov model. All algorithms which model
               | conditional probability are equivalent; you can even
               | unpack a NN as a kNN model or a decision tree.
               | 
               | They all model 'planning' in the same way: P(C|A, B) is a
               | 'plan' for C following A, B. There is no model of P("A B
               | C" | "A B"). Literally, at inference time, no computation
               | whatsoever is performed to anticipate any future
               | prediction -- this follows both trivially form the
               | mathematical formalism (which no one seems to want to
               | understand); or you can also see this empirically:
               | inference time is constant _regardless_ of prompt
               | /continuation.
               | 
               | The reason 'the cat sat...' is completed by 'on the mat'
               | is that it's maximal that P(on|the cat sat...), P(the|the
               | cat sat on...), P(mat|the cat sat on the...)
               | 
               |  _Why_ its maximal is not in the model at all, nor in the
               | data. It 's in the data generating process, ie., us. It
               | is we who arranged text by these frequencies and we did
               | so because the phrase is a popular one for academic
               | demonstrations (and so on).
               | 
               | As ever, people attribute "to the data" or worse, "to the
               | LLM" no properties it has.. rather it replays the data to
               | us and we suppose the LLM must have the property that
               | generates this data originally. Nope.
               | 
               | Why did the tape recorder say, "the cat sat on the mat"?
               | What, on the tape or in the recorder made "mat" the right
               | word? Surely, the tape must have planned the word...
        
             | flawsofar wrote:
             | In case you're thinking that rhyming requires planning,
             | that's just as silly as a rabbit tanning.
             | 
             | You can make things up as you go, and the constraints
             | emerge from the flow.
        
               | gbasin wrote:
               | great comment
        
         | jerpint wrote:
         | My understanding is that tokenization is part of the
         | bottleneck. When you break up a sentence to tokens, each token
         | gets a vector representation. The dictionary of all tokens
         | would be infinite if it was at the sentence level
        
           | faabian wrote:
           | Vectors can do what one-hot vectors cannot do -- no one said
           | inputs need to be rows from an token_id -> vector embeddings
           | map. Basically, we are doing this already by moving from one-
           | hot vectors to n-tuples of one-hot vectors, increasing the
           | effective vocabulary size from V to V^n.
        
         | bjourne wrote:
         | That's hierarchical prediction. On one level you predict the
         | style of the paragraph, on another the tone and form of the
         | sentence, and on the third the next word. However, this form of
         | prediction is quite difficult since predictions from the layers
         | affect each other.
        
         | wangii wrote:
         | the problem is then the total number of computation drops
         | dramatically therefore leads to much less "thinking" power. i
         | think the idea originated from an understanding that when we
         | write/speak, we have an overall idea. my current hypothesis is
         | it's probably an illusion.
         | 
         | you may want to search for "filler" papers to read.
        
         | everforward wrote:
         | Also a noob here, if we encoded, trained on and synthesized
         | sentence vectors, wouldn't that move the AIs ability to create
         | novel things up from sentences to words?
         | 
         | I.e. we currently operate on words (roughly) so the AI can only
         | use words it knows but can synthesize unique sentences from
         | words. If the AI operates on sentences, wouldn't it only be
         | able to regurgitate sentences it has seen before? So it could
         | synthesize novel paragraphs, but not sentences?
         | 
         | I'm not convinced that sentences are a useful abstraction for
         | AI (in English, anyways). They're barely useful to humans.
         | Check out your average chat conversation, email, YouTube
         | comment, etc. There's a very good chance the sentences aren't
         | actually sentences, or that they haven't even bothered to use
         | punctuation.
         | 
         | I just don't think sentences map to a semantic device. A
         | sentence could be two words or half an English paper depending
         | on the writer. It could traverse a half dozen ideas or a single
         | one. Where a sentence ends generally is more about the writer
         | than the semantics.
        
       | bradley13 wrote:
       | I read an article that pointed out that LLMs literally have a
       | one-dimensional window onto the world. Everything is just a
       | sequence of tokens.
       | 
       | Maybe this sort of multi more fiction takes their view into 1.1
       | dimensions? In any gas, there us s real argument for expanding
       | that window, somehow, into two or more dimensions.
        
         | mike_hearn wrote:
         | Well, it feels like architecturally there's a lot of scope to
         | do better for coding tasks specifically. Like, if you had FAIR
         | level resources and wanted to train a really great Java coding
         | model for example it would make sense to train the model to
         | predict ASTs rather than tokens. You'd still need some kind of
         | joint normal LLM for predicting comments, identifier names and
         | so on, but you wouldn't model the program itself as a stream of
         | tokens. Instead it would predict things like "add an if block",
         | "add a method call block with 4 parameters" and so on.
         | 
         | You could also train the model to expect certain context window
         | positions to be reserved for things like "type members at the
         | current cursor" and then integrate the inferencing loop with
         | IDE/LSP-style static analysis. This would allow the model to
         | see more information than is actually contained in the text.
         | 
         | I think the reason we're not seeing models like this right now
         | is the cost of doing such research combined with the fact that
         | AI people are all Python-heads, and Python doesn't benefit from
         | much IDEs.
        
           | bradley13 wrote:
           | That sounds right. My vague idea of a "second dimension"
           | could well be some sort of structure - be it ASTs for
           | programming languages or for natural language.
           | 
           | Another possibility would be some sort of fixed knowledge
           | base, which could be program language documentation or
           | "common sense" like CYC wants to provide.
        
       | albertzeyer wrote:
       | For those who know speculative decoding: This is basically self-
       | speculative decoding. It still auto-regressively feeds the
       | predicted label sequence through the network again, and only
       | keeps the prediction up to the point where it matches. So it will
       | not get worse in performance but only faster (here up to 3 times,
       | which is normal for speculative decoding).
       | 
       | Due to the multi-task training, it will however also get better.
       | (This idea is already quite old, to predict multiple targets into
       | the future as an auxiliary loss.)
       | 
       | Nice work.
        
         | imtringued wrote:
         | The problem with speculative decoding is that there are hardly
         | any models that support it and adding support takes extra GPU
         | time. If speculative decoding also improves planning
         | performance, then it will be more readily adopted.
        
           | albertzeyer wrote:
           | What do you mean? Speculative decoding can be done with any
           | auto-regressive model. Normally you use another much faster
           | model to predict the next N subwords, and then you use the
           | big model to verify whether it gets the same output, or maybe
           | just reranked. Evaluating N subwords in one go is much faster
           | compared to doing it subword by subword. That's why this is
           | faster. Not all N words might match, so then you might need
           | to redo the prediction for M < N subwords, but there are many
           | simple cases where a faster and weaker model is still
           | accurate enough. In the very extreme case, where N-1 subwords
           | are always wrongly predicted, it would be slightly slower,
           | but usually you get quite a big speedup, e.g. 3x faster or
           | so.
           | 
           | The nice thing here is that you actually don't need another
           | smaller model but the model itself already predicts the next
           | N subwords.
           | 
           | Or maybe you mean it's not implemented in some of the common
           | software? I'm not sure about that, but I thought it's a quite
           | popular feature now.
        
             | HanClinto wrote:
             | For anyone interested in exploring this, llama.cpp has an
             | example implementation here:
             | 
             | https://github.com/ggerganov/llama.cpp/tree/master/examples
             | /...
        
         | techbruv wrote:
         | > So it will not get worse in performance but only faster
         | 
         | A bit confused by this statement. Speculative decoding does not
         | decrease the performance of the model in terms of "accuracy" or
         | "quality" of output. Mathematically, the altered distribution
         | being sampled from is identical to the original distribution if
         | you had just used regular autoregressive decoding. The only
         | reason you get variability between autoregressive vs
         | speculative is simply due to randomness.
         | 
         | Unless you meant performance as in "speed", in which case it's
         | possible that speculative decoding could degrade speed (but on
         | most inputs, and with a good selection of the draft model, this
         | shouldn't be the case).
        
           | jasonjmcghee wrote:
           | I think parent is saying the same thing as you. Pointing out
           | to folks unfamiliar, speculative decoding doesn't trade
           | quality for speed.
        
           | albertzeyer wrote:
           | Yes that's what I mean, speculative decoding does not
           | decrease the performance in terms of quality. I guess my
           | wording was confusing on this.
        
       | mg wrote:
       | Currently, LLMs start from scratch for each output token, right?
       | 
       | Lets say you ask an LLM                   What makes bananas
       | yellow?
       | 
       | And it replies                   Bananas are yellow due to a
       | pigment called bromelain.
       | 
       | I would think that the concept of "pigment" and "bromelain" are
       | already somehow activated in the neural net when it outputs "a".
       | Because now it can't change its mind anymore and follow up with
       | "an optical illusion that makes humans perceive every bent object
       | as yellow". So it seems to have already planned ahead to talk
       | about the pigment called bromelain.
       | 
       | Would it be possible to capitalize on the work that has already
       | been done when the LLM outputs "a"? Could the state of the neural
       | net be somehow preserved for the next answer?
        
         | pbh101 wrote:
         | The alternative theory is that any word starting with a vowel
         | sound is exceedingly uncommon after 'a' in its training set, so
         | it doesn't need to plan ahead, just predict the distribution of
         | the most likely next words and choose.
         | 
         | Which is my understanding of how they work and the dynamic at
         | play.
        
         | faabian wrote:
         | To some degree, attention is already a mechanism to make
         | computations from previous tokens useful later. (You can think
         | of the KV cache as a representation of the text so far and all
         | the models thoughts on it.) And since language models are
         | trained on sequences end-to-end, I think this is likely to
         | happen. Multi-token prediction encourages this behavior
         | explicitly but only for the small n token window you define.
         | 
         | That said, there are many works attempting to increase the
         | compute utilization of transformer language models (early exit,
         | mixture of depths) and novel architectures (SSMs etc.).
        
           | jacobsimon wrote:
           | Thanks for highlighting the KV cache, I've been wondering the
           | same thing and hadn't come across that or didn't remember.
        
             | edmara wrote:
             | Transformers are still stateless, KV cache is just a
             | compute-saving measure (but otherwise correctly described)
        
               | jacobsimon wrote:
               | Oh huh. Why not make it stateful, like re-use and compute
               | just the "diff" when you add a new token? Assuming it's
               | not that easy because each token can affect attention
               | globally.
               | 
               | I think I've read something about this but I wonder if
               | you could abstract attention to sentence/page levels and
               | then only recalculate the parts that are relevant.
        
         | avianlyric wrote:
         | The output of most LLMs is stochastic. The core LLM is given
         | token, and outputs a set of ranked tokens, with a "confidence",
         | to go next. Then there's normally a filtering and search stage,
         | where those ranked token are either feed back into the LLM to
         | get more ranked tokens and used to for a short probability
         | tree. I.e. if we pick the top N-ranked tokens and put them back
         | in, each of those tokens results in a new set of N-ranked
         | tokens.
         | 
         | By looking at that tree some basic filtering is done. Such as
         | picking the branch that has the highest summed confidence, or
         | the branch that has the fewest repeated tokens, or the fewest
         | tokens that match with input tokens, or more often some
         | combination of the above, plus a random choice weighed by
         | summed confidences.
         | 
         | That how you can give a LLM with complete fixed weights, which
         | is all LLM, the same input multiple times, but get different
         | outputs.
         | 
         | So to answer your specific question, it can "change its mind".
         | Every token produced creates a new opportunity for the
         | stochastic output filters to pick a new path through all the
         | possible outputs.
        
         | berkes wrote:
         | I always presumed that "a pigment" is one token.
         | 
         | I'm a total amateur in this field though.
        
           | ralusek wrote:
           | Tokens are not multiple words, but are actually usually parts
           | of words. The rule of thumb OpenAI uses is that there will be
           | 100 tokens for every 75 words.
           | 
           | If you want to just see the tokens for yourself, though, just
           | enter some text here:
           | 
           | https://platform.openai.com/tokenizer
        
           | sanxiyn wrote:
           | No, it isn't, at least on OpenAI tokenizer.
        
           | jasonjmcghee wrote:
           | what constitutes a token is super unintuitive.
           | 
           | my gut said "pig" and "ment" on this one, which happens to be
           | right, but my gut would also say "para" and "graph" but no,
           | "paragraph" is a single token which falls way outside the
           | "normal" length I see of 3-4 characters
           | 
           | In either case, I do consistently see spaces between
           | characters included as part of the token following the space.
           | 
           | " paragraph" (10 characters) is the longest token I've seen-
           | and now I wonder what the longest token is
        
             | jasonjmcghee wrote:
             | New winner " communication" at 14
        
         | matthewdgreen wrote:
         | This post is interesting:
         | https://clementneo.com/posts/2023/02/11/we-found-an-neuron
        
         | nicklecompte wrote:
         | Maybe look at it another way: ask GPT to complete the following
         | Bananas are yellow due to a            Bananas are yellow due
         | to an
         | 
         | In the first case it might respond                 Bananas are
         | yellow due to a pigment called bromelain.
         | 
         | In the second case it might respond                 Bananas are
         | yellow due to an organic compound called bromelain, which is a
         | yellow pigment.
         | 
         | So in either case GPT could have picked "a" or "an" without any
         | impact on the semantic meaning of its response. In the extreme
         | case, you could see the LLM operating according to a dumb
         | heuristic:                 The token following "due to" is "a"
         | with 55% probability, "an" with 45% probability.
         | 
         | In reality it is of course more sophisticated than this. But
         | this dumb heuristic would explain the behavior.
         | 
         | And if you didn't actually include any facts about bromelain in
         | the pretraining data, LLMs absolutely could autocomplete this
         | with something about "an optical illusion." GPT-3 made factual
         | mistakes like that pretty routinely, but I recall it figured
         | out the grammatical rules of "a" and "an."
         | 
         | I don't think the concept actually needs to be pre-activated as
         | you said, though I agree with faabian that this "preactivation"
         | probably does happen in some implicit/emergent sense.
        
         | HarHarVeryFunny wrote:
         | 1. Bananas are yellow due to a biochemical process known as
         | carotenogenesis, which involves the synthesis and accumulation
         | of carotenoid pigments.
         | 
         | 2. Bananas are yellow due to a specific carotenoid called beta-
         | cryptoxanthin, which gives the fruit its characteristic yellow
         | hue.
         | 
         | 3. Bananas are yellow due to a gradual increase in the
         | concentration of carotenoid pigments as the fruit ripens and
         | chlorophyll levels decrease.
         | 
         | 4. Bananas are yellow due to a series of enzymatic reactions
         | that convert starch into sugars and break down the green
         | chloroplasts, revealing the underlying yellow carotenoids.
         | 
         | 5. Bananas are yellow due to a change in the pH levels within
         | the fruit cells during ripening, which triggers the production
         | of yellow carotenoid pigments.
         | 
         | 6. Bananas are yellow due to a genetic trait inherited from
         | their wild ancestors, which enabled the development of
         | carotenoid pigments as a way to attract seed dispersers.
         | 
         | 7. Bananas are yellow due to a complex interplay between
         | various plant hormones, such as ethylene and abscisic acid,
         | which regulate the ripening process and pigment formation.
         | 
         | 8. Bananas are yellow due to a metabolic shift from chlorophyll
         | synthesis to carotenoid synthesis as the fruit reaches
         | maturity.
         | 
         | 9. Bananas are yellow due to a natural defense mechanism that
         | involves the production of carotenoid pigments, which protect
         | the fruit from oxidative stress during ripening.
         | 
         | 10. Bananas are yellow due to a evolutionary adaptation that
         | helps the fruit stand out against the green foliage, making it
         | more visible to potential seed dispersers.
         | 
         | The output of an LLM is usually randomly sampled from the top
         | few highest probability next token/word predictions, but the
         | model itself has no idea which word the sampler will pick. It
         | presumably has some conceptual plan of what could follow "a",
         | or any of it's other suggestions, but any such plan (high level
         | prediction) is then rethought from scratch once "a" is
         | generated.
         | 
         | The model not only can, but has to, change it's mind after each
         | word generated, so this "planning ahead" is very ephemeral -
         | more like a freestyle rapper making it up on the fly than
         | someone thinking deeply about how best to reply and how to
         | express it.
        
           | scoot wrote:
           | > 10. Bananas are yellow due to _a evolutionary_ adaptation
           | 
           | Did an LLM really make this basic grammatical error?
        
             | HarHarVeryFunny wrote:
             | Yes it did (free Claude Sonnet), but presumably only
             | because it was trained on examples of us making the same
             | mistake!
        
               | scoot wrote:
               | > presumably only because it was trained on examples of
               | us making the same mistake
               | 
               | That's what makes it surprising.
        
           | Workaccount2 wrote:
           | There has to be more going on somewhere in the system. You
           | can ask GPT4 to describe something (Vermont) and ask it to
           | end to description with a chosen word (house). It will then
           | usually be able to write out a descriptive paragraph that
           | ultimately lands on your chosen word pretty seamlessly -
           | "...embodying a serene and rural charm that culminates in the
           | warm, welcoming feel of a cozy house."
           | 
           | Models do struggle with these tests, for sure, but from an
           | analytical standpoint, a "next token predictor" should not be
           | able to ever correctly land on the right token 100 tokens in
           | the future.
           | 
           | Edit: Thinking about it, I suppose it is possible that the
           | model can encode a "destination" in the first token. Like a
           | pool shot that is artfully bounced off many bumpers to hit a
           | ball, perhaps the LLM can encode a "path" to a destination
           | token in the first token generated. Which might be even
           | crazier as it suggests that the model is playing a meta-game
           | with being able to precisely manipulate the individual layers
           | of output, even though those layers are disparate from token
           | to token.
        
             | HarHarVeryFunny wrote:
             | The "next token predictor" description is a bit too
             | literal, and anyways incomplete. A transformer has a lot of
             | layers (e.g. 96 for GPT-2) and it's only at the input that
             | it has pure token embeddings (ignoring positional
             | encoding). At the output it's a built-up embedding that's
             | decodable into a token.
             | 
             | One way to think of what all the intermediate layers are
             | doing is to consider them as levels of a linguistic parse
             | tree with the leaves (words) at the bottom and trunk
             | ("sentence") at the top, except in the transformer the
             | evolving embeddings at each level contain semantic as well
             | as syntactic information. This largely hierarchical view of
             | language was the motivation for the transformer design.
             | 
             | It seems we should really think of each layer of the
             | transformer as an independent predictor, with increasingly
             | abstract and more semantically complete information
             | available as we ascend the transformer layers towards the
             | output. Predict next token is only what the transformer is
             | being trained to do at the output layer. At the inner
             | layers (i.e. the bulk of what the transformer is doing), it
             | will be predicting at these higher levels of representation
             | held at those layers.
             | 
             | These models do struggle (although getting better) at
             | ending on a given word rather than starting on it, and
             | understandably so since random sampling and continual
             | resetting after each output token (= new input sequence)
             | means that planning ahead at level of word specificity is
             | simply not an option. They have to continually adapt to
             | next sampled token, and take it from there.
             | 
             | I'm guessing that ability to end on a chosen word is due to
             | continued salience of that word during generation,
             | prediction of sentence fragments using/ending with that
             | word, and opportunistic stopping when it has been emitted
             | and the sentence is complete. Kind of the same way you
             | might do it yourself if you just started talking
             | immediately without planning, while trying to end on a
             | given word.
        
         | wewtyflakes wrote:
         | I wonder if this means that these types of models would perform
         | better (or worse?) for languages that do not have this sort of
         | forward-looking grammar.
        
       | hhcoder wrote:
       | I am curious what happens if the multiple tokens predicted
       | interfere with one another. Say I ask "What are the colors of the
       | rainbow?", if one of the tokens is a repeated color, how do we
       | resolve that?
        
       | bjornsing wrote:
       | I've been thinking about this, but I'm leaning more towards
       | letting the LLM output a small PixelCNN or similar model over the
       | next N tokens. That way the LLM can describe conditional
       | probabilities over the coming tokens.
        
       | deskamess wrote:
       | Side track: There is so much going on this space. I wish there
       | was a chronological flow of a machine learning scenario/story
       | with all the terms being introduced as we meet them (data, pre-
       | training, training, inference, mixture of experts, RAG). Like
       | someone walking me through a factory explaining what happens at
       | each stage (like Mr Rogers used to do). Most of the time I do not
       | know where the terms fit in the big picture. When I first came
       | across pre-training I thought it was something done to the data
       | before training happened but it was actually another training.
        
         | berkes wrote:
         | > Most of the time I do not know where the terms fit in the big
         | picture.
         | 
         | Nor do the majority of "AI" experts and consultants that I see
         | on LinkedIn, Twitter or in podcasts.
         | 
         | The S/N ratio is very low in this field. Just pick some
         | documentation from "industry leaders" like Langchain and see
         | that not only is it already and always outdated, it sometimes
         | simply contradicts itself.
         | 
         | In the "blockchain hype" this was similar, so I guess it's a
         | trait of the hype train.
        
           | highwaylights wrote:
           | Totally agree with the above, although I'm not sure that
           | documentation on tools like Langchain is a reflection of the
           | hype in the way social media is. I think in that case it's
           | just a reflection of the pace things are moving at.
        
           | pixl97 wrote:
           | >Nor do the majority of "AI" experts...
           | 
           | I mean yes, this is what a rapidly expanding field looks like
           | that's probing the boundaries of its problem space. Kind of
           | like following physics in the early to mid 1900s. Different
           | classes of problems have barely been tested against each
           | other, much less fully explored themselves.
           | 
           | In some ways it reminds me of the earlier days of the
           | internet when progress was still very rapid.
        
         | highwaylights wrote:
         | I feel your pain and excitement in equal measure.
         | 
         | It can be hard to know where to start with some of these
         | concepts, especially so given that a lot of recent developments
         | (e.g. RAG) are developing so rapidly that there's unlikely to
         | be a reference book you could refer to anytime soon that would
         | be current.
         | 
         | That said, I do find that documentation is getting better
         | depending on where you look. The documentation for higher level
         | tools like LlamaIndex is a good starting point for
         | understanding the concepts (not so much in terms of
         | _explaining_ the concepts, but showing where they fit into the
         | overall picture, then you can deep-dive elsewhere on the
         | different parts).
         | 
         | YouTube has always been a mixed bag of very little solid
         | information in a sea of non-experts trying to attract clicks
         | for the latest trends, so it's not a great starting point IMHO.
        
           | phkahler wrote:
           | >> YouTube has always been a mixed bag
           | 
           | As an outsider but avid reader of this stuff linked from HN,
           | I would recommend the channel 3blue1brown. He's got several
           | NN and AI related videos, and the couple I've seen were
           | pretty good.
           | 
           | https://www.youtube.com/watch?v=aircAruvnKk
        
             | renonce wrote:
             | Yeah but the other side of the coin is that they only
             | explain the very basic concepts that are already settled
             | for several years, not any of these "latest trends"
        
               | objektif wrote:
               | Which latest trends you mean?
        
           | objektif wrote:
           | Llamaindex docs are absolutely terrible IMO. I have gone
           | through it so many times but still do not understand the
           | terms and organization. Router for querying router query
           | engine?
        
           | mercer wrote:
           | how would you rate Yannic Kilcher, if you're aware of him?
        
         | saddabbas wrote:
         | I recommend checking out Machine Learning Q and AI by Sebastian
         | Raschka
        
         | jstummbillig wrote:
         | People waste too much time building out stuff that is really
         | bad in AI right now.
         | 
         | Of course, everything is, but instead of taking on the task of
         | patching that up, the better approach would be to pretend there
         | will be something that is a lot better than GPT-4 in the near
         | future (because there will be) and design a differentiated
         | product under that premise.
        
           | sunir wrote:
           | That's an interesting idea. What do you mean?
           | 
           | I understand the prompts as a service is short term... but
           | what is a long term product you see?
        
             | redblacktree wrote:
             | I may be misunderstanding your meaning, but I'm not
             | convinced that "prompts as a service" is short term. I
             | think we'll see a number of apps pop up that will be
             | essentially that, i.e. powered by a generative AI, but with
             | a great UX. Not everyone is good at prompting, and although
             | it is a skill many will develop, packaging up great prompts
             | in niche problem areas still looks like an area of
             | opportunity to me. I'm not talking necessarily about chat
             | experiences, but apps that can, as an example, maintain
             | task lists for you after consuming your incoming
             | communications.
        
               | objektif wrote:
               | How many times did you have issues communicating with
               | your spouse because of prompting issues? And how did you
               | resolve it?
               | 
               | Why would you need an extra layer here?
        
               | malnourish wrote:
               | Why do PR firms and copy editors exist?
        
             | jstummbillig wrote:
             | I think a general way to answer this is by considering for
             | any domain you know: What would you pay a human to do right
             | now, that LLMs frustratingly can't, but should in theory,
             | if only they were a bit better and more consistent?
             | 
             | This could mean: Instead of diving into langchain and
             | trying to program your way out of a bad model, or trying to
             | do weird prompts, just write a super clear set of
             | instructions and wait for a model that is capable of
             | understanding clear instructions, because that is an
             | obvious goal of everyone working on models right now and
             | they are going to solve this better than your custom
             | workaround can.
             | 
             | This is not a rigid rule, just a matter of proportions. For
             | example, you should probably be willing to try a few weird
             | intermediary prompt hacks, if you want to get going with AI
             | dev right now. But if most of what most people do will
             | probably be solved by a somewhat better model, that's
             | probably a cause for pause.
        
               | mercer wrote:
               | I suppose with an eye on open-source, an interesting
               | 'rule' would be to set a cut-off point for models that
               | can run locally, and/or are considered to be feasible
               | locally soon.
        
           | JKCalhoun wrote:
           | Can I assume AI are continuing their training as they
           | interact with people when deployed? Are ChatGPT, Claude,
           | learning from my interactions with them? I do, BTW, correct
           | them when they unknowingly (I assume) steer me wrong.
           | 
           | One wonders, if that's the case, how quickly an AI might
           | improve if it has something close to Google's search site
           | throughput. I mean fielding several billion queries a day,
           | for a year -- that would be some pretty stellar training
           | right there I would think.
        
             | jstummbillig wrote:
             | > Can I assume AI are continuing their training as they
             | interact with people when deployed?
             | 
             | Yes, you can. Some of the big providers are fairly clear on
             | where in their products this happens, and all offer a way
             | out (mostly when paying for api access)
             | 
             | > One wonders, if that's the case, how quickly an AI might
             | improve if it has something close to Google's search site
             | throughput
             | 
             | Indeed. Another possibility is that user input will turn
             | out to be increasingly less important for upcoming state of
             | the art models.
        
             | astrange wrote:
             | They don't train as they go. Training is incredibly
             | expensive.
             | 
             | They do take your feedback and presumably do something with
             | it. Your actual queries are only indirectly useful since
             | they might have private info in them.
        
         | reece_omahoney wrote:
         | Check out Lilian Weng's blog
         | https://lilianweng.github.io/posts/2023-01-27-the-transforme...
        
         | snorkel wrote:
         | Strongly recommend watching Andrej Karpathy's "Lets build
         | GPT-2" videos on YouTube which dives into an actual PyTorch
         | implementation, then download the code and study it carefully.
         | Then study "Spreadsheets is all you need" to see what the
         | internal data structures look like.
        
       | nicklecompte wrote:
       | I haven't read the paper in full detail yet, but I do have a
       | minor editorial comment: while the appendix L.2 was satisfactory,
       | I thought the condensed argument in 5.2 was a bit too sloppy. In
       | particular,                 H(X) + H(Y) = H(X | Y) + 2I(X ; Y) +
       | H(Y | X)            By discarding H(Y | X) - which appears again
       | when predicting at the following position - we observe that
       | 2-token prediction increases the importance of I(X ; Y) by a
       | factor of 2.
       | 
       | The argument about "discarding" was not clear to me - if you're
       | predicting the third token Z, then shouldn't H(Y | X) be
       | contained in the implicit context C, and therefore can't be
       | freely discarded? I don't think this argument was clarified in
       | the appendix. But this is mostly about presentation, I wasn't so
       | confused as to doubt the gist of the argument.
        
         | faabian wrote:
         | Thanks for the feedback! Let me try to state it better:
         | 
         | In the end, we only use the next-token head for generating. So
         | which parts of the 2-token target H(X) + H(Y) are "auxiliary"
         | in the sense that they help learning and which are "wasted"?
         | H(X | Y) and I(X; Y) are useful for next-token generation
         | while, by definition, H(Y | X) is the information quantity not
         | related to the next token X. So we could say: "multi-token
         | prediction trades the useful information I(X; Y) from H(Y) for
         | the wasted computations on H(Y | X)". However, note that H(Y |
         | X) is a next-token entropy for predicting Y from the prefix (C,
         | X). If the attention mechanism allows to transfer computations
         | already made for predicting Y|X to the next step, these
         | computations may actually _not_ have been wasted -- it was just
         | pre-computations.
        
           | stealthcat wrote:
           | Did you have some small toy experiments to prove this?
        
       | Xcelerate wrote:
       | Do LLMs not consider the probability distribution over all
       | combinations of tokens up to a certain output length with regard
       | to sequence prediction? I assumed they did that already.
       | 
       | If they don't, I'm amazed they work as well as they do. Consider
       | 2-bit sequence prediction with the following possible outcomes
       | and associated probabilities:                   00: p=0.36
       | 01: p=0.04         10: p=0.30         11: p=0.30
       | 
       | So the most likely 2-bit sequence is 00. But on the basis of
       | predicting the next token (bit) alone, we have:
       | 0: p=0.40          1: p=0.60
       | 
       | which suggests that 1 is the next bit and leads to a suboptimal
       | starting point for predicting the bit after that. The error is
       | even more prominent with longer sequences as the joint
       | probability distribution becomes more unfactorizable into
       | marginal distributions (as I would expect any minimal algorithmic
       | description of real-world data to be).
       | 
       | Edit: now that I think about this a bit more, a cool research
       | project that would be really simple to carry out might be to
       | modify the cross-entropy loss function to consider _only_ the nth
       | future token in the text training data, and then plot LLM
       | performance vs n, assuming that for all current LLM models we
       | just have n=1.
       | 
       | My hypothesis is that you can mostly bypass all of the resource
       | blow-up involved in predicting the joint probability distribution
       | over the next 1 through n tokens (which scales as x^n) by just
       | predicting the nth token directly, since doing so would
       | implicitly require a better data model (at least for human-
       | generated text; this wouldn't be the case for all types of data).
        
         | faabian wrote:
         | Language models factor the joint probability p(y, x) as p(y, x)
         | = p(y|x) p(x) which is exact. I.e. if you train a language
         | model on your distribution _and sample with temperature 1_ ,
         | you will get the exact same distribution out. If you sample at
         | lower temperature or even greedily, evidently, you will get
         | other distributions.
        
         | puttycat wrote:
         | You're mixing training loss (cross-entropy/surprisal of next
         | token) and post-training prediction decoding (done e.g. with
         | beam search)
        
           | Xcelerate wrote:
           | Training loss considers only the next single token, right?
           | (I'm not up-to-date on the SOTA.)
           | 
           | I thought post-training prediction still only directly
           | predicts the next token and beam search is sort of a meta-
           | model applied over that (i.e., it is a model on top of the
           | output of the model that performs next-token prediction--beam
           | search considers at each iteration a subset of the current
           | next-token predictions ranked by their probability to use as
           | multiple starting points for predicting the next token, while
           | keeping track of the joint probabilities to prune the set of
           | candidate sequences at each step).
           | 
           | Seems like beam search would fail drastically in cases where
           | the true (unknown) probability distribution over all
           | sequences of tokens of length n has very low conditional
           | probabilities for the first few tokens, each given the
           | computed joint probability of the prior predicted tokens.
           | That is, the true values of p(t2|t1), p(t3|t2,t1),
           | p(t4|t3,t2,t1), ... as derived from the unknown
           | p(t1,t2,...,tn) are very small, but very high when computed
           | via a next-token prediction model.
           | 
           | I'm suggesting to modify both. Use cross-entropy of the nth
           | token for training loss. Use cross-entropy of nth token for
           | post-training prediction and then work backward from there to
           | the beginning of your sequence prediction.
        
             | namibj wrote:
             | The problem is that a position's probability output is
             | conditioned via attention on all previous positions.
             | 
             | If you want to be better you need to switch to DDPMs for
             | example (e.g. an encoder-only transformer to predict
             | diffusion transition probabilities in parallel, then apply
             | steps of denoising).
             | 
             | The problem is just that these don't work so well from auto
             | regressive decoder transformers, and encoder-decoder
             | architectures like e.g. Google's T5 have fallen out of
             | favor since about LLAMA dropped.
        
         | sebzim4500 wrote:
         | This is how they work and it's a real problem when doing
         | prediction with low temperatures.
         | 
         | IIRC you see weird patterns in LLM outputs since "an" is often
         | less likely than "a" so you end up with fewer nouns beginning
         | with vowels than you would expect.
        
         | hiddencost wrote:
         | It's called the Markov assumption. It was basically the single
         | most important piece of mathematics in the field for decades.
         | It allowed us to solve otherwise intractable problems given the
         | limited compute budgets of the time.
        
           | Xcelerate wrote:
           | Sure, and it's probably the wrong assumption to make in this
           | case if our eventual goal is to capture general reasoning
           | ability via LLMs.
        
         | HanClinto wrote:
         | This is a fascinating point.
         | 
         | If I'm reading you right, you're saying that a simple way to do
         | this would be to calculate logits for not just the next token,
         | but also n+1 -- all at the same time. If one of the n+1 logits
         | is chosen, then do an infill on the skipped token for the next
         | step, then resume.
         | 
         | This could get us around the example that you gave for only a
         | linear increase in the vocabulary size -- so looking an extra
         | token ahead only increases vocab size by a factor of 2, and
         | looking at a third token is a total factor of 3.
         | 
         | This seems really promising!
        
         | ljlolel wrote:
         | https://arxiv.org/pdf/2404.19737
        
         | elcomet wrote:
         | I think you're not looking at this from the right perspective.
         | A LLM is designed to sample text, that follows the training
         | distribution. It is not designed to tell you the "most likely"
         | text that follows, and we don't actually want that. This would
         | mean you have no diversity in your outputs.
         | 
         | In your example, sampling a 0 in 40% of cases and a 1 in 60% of
         | cases does make sense for chat applications.
         | 
         | For applications where we _do_ care about the most likely
         | sentence (e.g. question answering), then beam search helps, as
         | others have mentioned.
         | 
         | Another thing to consider is that the model can "look ahead"
         | and precompute what the future tokens might be. And it can then
         | use this to predict the current token. In fact, some work have
         | been investigating this, such as [1].
         | 
         | And a final note, predicting one token at a time is what we are
         | doing as humans when we speak, so clearly it is not a _wrong_
         | approach. We are doing this  "look ahead" in our mind before
         | speaking.
         | 
         | [1] https://arxiv.org/abs/2404.00859
        
           | tmoertel wrote:
           | > And a final note, predicting one token at a time is what we
           | are doing as humans when we speak...
           | 
           | I wouldn't be surprised if we could predict token groups.
           | When speaking off the cuff, people often rely on well-worn
           | phrases and cliches.
        
             | ivalm wrote:
             | I think that corresponds to well worn phrases being a
             | "single token"
        
             | abakker wrote:
             | Indeed. An interesting reference to this is the work
             | Millman Parry did to describe the key phrases in the
             | Odyssey and the queues they gave to help someone memorize
             | the poem.
             | 
             | Also, this is maybe a semantic point, but, I am not
             | predicting any words I speak. Not in a statistical sense. I
             | have intent behind my words, which means I have an
             | abstraction of meaning that I want to convey and I assemble
             | the correct words to do that. no part of that is
             | "predictive"
        
               | thwarted wrote:
               | > _describe the key phrases in the Odyssey and the queues
               | they gave to help someone memorize the poem._
               | 
               | Queues of words give cues to help memorize.
        
               | abakker wrote:
               | lol. Hurray for speech to text, I guess.
        
           | wantsanagent wrote:
           | > "It is not designed to tell you the "most likely" text that
           | follows,"
           | 
           | It is exactly designed to do that. A temperature of 0 this is
           | what you are approximating. The crucial point though is that
           | it is the most likely next word given the proceeding multi-
           | token context, not just the previous token.
        
           | Xcelerate wrote:
           | > It is not designed to tell you the "most likely" text that
           | follows, and we don't actually want that. This would mean you
           | have no diversity in your outputs.
           | 
           | No, we specifically _do_ want  "most likely" to follow; the
           | goal is to approximate Solomonoff induction as well as
           | possible. See this recent paper by Hutter's team:
           | https://arxiv.org/pdf/2401.14953
           | 
           | Quote from the paper:
           | 
           | "LLMs pretrained on long-range coherent documents can learn
           | new tasks from a few examples by inferring a shared latent
           | concept. They can do so because in-context learning does
           | implicit Bayesian inference (in line with our CTW
           | experiments) and builds world representations and algorithms
           | (necessary to perform SI [Solomonoff Induction]). In fact,
           | one could argue that the impressive in-context generalization
           | capabilities of LLMs is a sign of a rough approximation of
           | Solomonoff induction."
           | 
           | > In your example, sampling a 0 in 40% of cases and a 1 in
           | 60% of cases does[n't] make sense for chat applications.
           | 
           | I didn't say anything about sampling. A sequence prediction
           | model represents a mapping between an input sequence and a
           | probability distribution over all possible output sequences
           | up to a certain length.
           | 
           | My example uses a binary alphabet, but LLMs use an alphabet
           | of tokens. Any chat application that expresses its output as
           | a string of concatenated symbols from a given alphabet has a
           | probability distribution defined over all possible output
           | sequences. I'm simply comparing the fundamental limitations
           | of _any_ approach to inference that restricts its outcome
           | space to sequences consisting of one symbol (and then layers
           | on a meta-model to generate longer sequences by repeatedly
           | calling the core inference capability) vs an approach that
           | performs inference over an outcome space consisting of
           | sequences longer than one symbol.
        
         | BoiledCabbage wrote:
         | > 0: p=0.40 1: p=0.60 which suggests that 1 is the next bit and
         | leads to a suboptimal starting point for predicting the bit
         | after that. The error is even more prominent with longer
         | sequences as the joint probability distribution becomes more
         | unfactorizable into marginal distributions (as I would expect
         | any minimal algorithmic description of real-world data to be).
         | 
         | Can someone explain this part a bit more? I'm not seeing the
         | issue. From what I see, if the first token (t1) output is a
         | zero, then the next token (t2) would have probabilities 0:p=.90
         | and 1:p=.10. (And t2 0/1:p= .50/.50 if t1=1)
         | 
         | Mathematically, those line up with the initial distribution, so
         | what's the concern? That's how conditional probability works.
        
         | cgearhart wrote:
         | What you've described is basically the problem with greedy
         | sampling in the decoder. Many other local optimization sampling
         | strategies exist (e.g., beam search) and there's been a lot of
         | work on more global sampling (e.g., speculative decoding).
        
       | ralusek wrote:
       | Given that LLMs appear to, in large part, "think" by virtue of
       | feeding its own input into itself, people have consistently
       | noticed that insisting that the model "think out loud" results in
       | higher quality reasoning. i.e. "chain of thought" reasoning will
       | contrast simply having the model answer a question directly with
       | first having it write out things like:
       | 
       | - restating what it thinks is being asked of it
       | 
       | - expressing a high level strategy over what sort of information
       | it might need in order to answer that question
       | 
       | - stating the information it knows
       | 
       | - describing how that information might inform its initial
       | reasoning
       | 
       | etc...
       | 
       | I'd be concerned that going about this by having the model
       | predict the next multiple tokens at any given time would
       | essentially have the opposite effect.
       | 
       | Chain of thought prompting appears to indicate that a model is
       | "smarter" when it has n + m tokens than when it just has n tokens
       | as input. As such, getting the next 5 tokens for a given n might
       | net worse results than getting the next 1 token at n, then the
       | next 1 token at n + 1, and so on.
        
         | imtringued wrote:
         | If the LLM had an affordable model it would always generate
         | enough tokens for the task at hand. The fact that this
         | particular method would require more tokens would be
         | irrelevant. If you don't have an affordable model, then you
         | would always be at the mercy of the LLM being biased towards
         | answering with an estimate instead of the actual answer.
         | 
         | Also, most speculative decoding strategies produce identical
         | output compared to running the model sequentially. If the
         | prediction is wrong, the token gets discarded and the speedup
         | is lost.
        
       | bravura wrote:
       | I wonder if, instead of just predicting the next n tokens, it
       | could also predict like 128, 512, 2048 etc tokens ahead. Thus
       | learning long-term discourse structure.
        
         | HanClinto wrote:
         | Might be good to have some flexibility in where those
         | particular tokens are placed, but yeah -- I could see value in
         | creating a "pool" of tokens that should be used at some point
         | in the future in the answer.
        
       | lucidrains wrote:
       | wow, so prophet net does work! i spent so much time experimenting
       | with it back in the day, but just lacked the scale to see a
       | positive result.
        
       | riku_iki wrote:
       | Its interesting that they got good results on 200B and 0.8 epoch
       | training set, but once scaled it to 1T and 4 epoch, got
       | degradation in vast majority of benchmarks (Table 1).
        
       | jmount wrote:
       | After inventing multi-token, one then invents a useful language
       | oriented hierarchy (such as sections, paragraphs, sentences, and
       | words).
        
       ___________________________________________________________________
       (page generated 2024-05-01 23:01 UTC)