[HN Gopher] Better and Faster Large Language Models via Multi-To...
___________________________________________________________________
Better and Faster Large Language Models via Multi-Token Prediction
Author : jasondavies
Score : 277 points
Date : 2024-05-01 08:28 UTC (14 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| Havoc wrote:
| How does that still end up making grammatical sense?
|
| If token/word +1 and +2 are predicted independently then surely
| often it won't ?
| vletal wrote:
| The n+1-th token is discarded if it is unlikely given the n-th
| token.
| wongarsu wrote:
| They just throw the predictions for +1 and +2 away, and only
| generate them for more efficient training.
|
| The abstract doesn't make that clear, but from the description
| of figure 1: "During inference, we employ only the next-token
| output head. Optionally, the other three heads may be used to
| speed-up inference time"
|
| Maybe you can use all three heads if you take the top
| prediction from all of them, but that prevents you from doing
| any of the common sampling strategies. I'm not sure how many
| people actually run an LLM with temperature 0 outside of
| benchmarks, unless they do something even better than applying
| a temperature
| faabian wrote:
| Exactly, but there is also a rejection sampling based method
| for speculative sampling: https://arxiv.org/abs/2302.01318
| Havoc wrote:
| Thanks for explaining
| throw310822 wrote:
| Apologies in advance for the super naive question; but assuming
| that we can create vectors that encode for the meaning of entire
| sentences, what prevents us from training llms to predict those
| vectors instead of single words?
| yorwba wrote:
| You still need to convert between words and sentence vectors
| somehow. You could try using a faster model for that, but I
| suspect that the output quality will suffer.
| magicalhippo wrote:
| LLMs somehow need to do this anyway, implicitly or not. Has
| anyone tried to do it more explicitly?
|
| That is to break out the idea that characters are formed into
| words, and words into sentences, and a sentence is a sequence
| of "concepts" for the lack of a better description.
|
| So have one NN which takes a sequence of tokens and predicts
| an moderately-dimensional "word vector", which is fed into
| another which predicts a high-dimensional "concept vector".
|
| Then the "thinking layer" would map a sequence of "concept
| vectors" to "concept vectors", and then you'd have some
| layers which does the reverse of the input layers to output
| tokens which can be printed.
|
| Thought being that by splitting it up like this you could
| swap out the decode and encode layers independently to
| translate, for example, and so on.
|
| Just a shower thought.
| flawsofar wrote:
| A sentence is both a sequence and a tree of phrases.
| Phrases have a head and one or more valents; they're
| relations.
|
| If you wanted to create an embedding algorithm for phrases,
| you could and you could throw a transformer at it.
|
| I don't know how you get the output of higher levels to
| diffuse to phrase and word levels.
| marcyb5st wrote:
| Not a stupid question in my opinion.
|
| The problem is that once you have the vectors representing the
| answers you need something like another model that goes back to
| a word representation of said answers. Something like a
| diffusion model but for text. Additionally, the function that
| this diffusion model will approximate won't be injective, but
| at best surjective and at worst not even a function (in the
| mathematical meaning) since many textual represantions are
| possible given an embedding, and most of those won't be valid
| (not grammatically valid, no sense sentences, ...).
|
| Finally remember that the embeddings are a "lossy"
| representation of some datum and so the inverse function will
| lose a lot of the nuances/context/... .
|
| LLMs avoid the problems above by predicting the next (now next
| n tokens) in a way that is self consistent with the query and
| the previous n tokens, so the function they approximate should
| be mostly surjective.
| throw310822 wrote:
| > The problem is that once you have the vectors representing
| the answers you need something like another model that goes
| back to a word representation of said answers. Something like
| a diffusion model but for text.
|
| Could it be just a smaller llm that takes as input both the
| semantic vector and the prompt, and is trained to predict the
| output tokens based on those? A model with high linguistic
| abilities and very little reasoning skills.
| marcyb5st wrote:
| I think what you suggest would be very similar to a
| encoder-decoder architecture, which has been abandoned in
| favor of decoder-only architectures
| (https://cameronrwolfe.substack.com/p/decoder-only-
| transforme...). So I am guessing that what you suggest has
| already been tried and didn't work out, ut not sure why
| (the problems I mentioned above or something else).
|
| Sorry, that's where the limit of my knowledge is. I work on
| ML stuff, but mostly on "traditional" deep learning and so
| I am not up to speed with the genAI field (also, the sheer
| amount of papers coming out makes it basically impossible
| stay up to date of you're not in the field).
| HanClinto wrote:
| > Something like a diffusion model but for text.
|
| This actually sounds amazingly useful.
| soulofmischief wrote:
| I've played around with the concept in the past but there
| are some issues yet to be solved to make them more
| practical than current generation LLMs.
|
| People are working on it though:
| https://arxiv.org/pdf/2305.09515
| faabian wrote:
| Author here -- that's a very good point and as I understand
| work in progress in different teams. Training autoencoders for
| language is actually super easy given the small amount of
| information contained in text (compared to vision/video), the
| hard part is making the model focus on the semantic part if all
| signal we have comes from exact match in token space. Hence
| Yann LeCun's ideas on joint embedding predictive architectures.
| Note also that there is always a trade-off between auxiliary
| tasks giving more signal but shifting the focus. In our case,
| we noticed degradation if the number of predicted tokens is too
| high. So latent prediction methods need to sort out what is
| useful.
| mike_hearn wrote:
| Aren't the models already doing this, in a way? We know they
| can do things like write rhyming poems and song lyrics that
| do make perfect sense, so at some point the activations must
| be encoding some sort of overall plan for the upcoming
| sentences, even if maybe every word isn't predicted yet.
| faabian wrote:
| Yes. Otherwise next-token models wouldn't be nearly as good
| as they are. But the question is how to train these
| capabilities most efficiently! We had some interesting
| findings on how with increasing model/dataset scale/data
| quality, capabilities can move from "only learnable with
| multi-token prediction" to "indifferent" and "multi-token
| prediction actually hurts". This depends on the capability
| itself, induction e.g. matures way earlier in this sense
| than code generation capabilities.
| mike_hearn wrote:
| Is it possible that anti-scaling effect occurs because
| you are removing some middle layers to free up space for
| the extra output heads? I only scanned the paper quickly
| but what happens if you treat the technique as strictly
| additive and don't keep parameter sizes fixed?
| mjburgess wrote:
| > so at some point the activations must be encoding some
| sort of overall plan for the upcoming sentences
|
| This isn't obviously the case, compare this "intelligent
| designer" view with evolution: there was no prior plan for
| rabbits. it's sufficient to create the appearance of design
| that sequential steps are simply probabilistically
| modulated by prior ones.
|
| Consider a continuation of "the cat..." merely a
| distribution over all possible words suffices to create the
| illusion of a plan, suppose: "the cat sat..." then, "on..,
| the..." etc. follow from the training data.
|
| I think there's a strong argument against trying to model
| entire sentences exactly because the system isn't modelling
| semantics: one _should_ expect accuracy to drop off a cliff
| if there is no actual plan. ie., predicting "sat on the
| mat" from "cat" _shouldnt_ be a valid prediction, because
| of the infinite number of possible continuations that _as a
| whole_ is terrible (eg., what about "chased the mouse"
| etc.). The space of all possible _sentences_ to continue
| from "the cat" is infinite, which much of that space
| actually useful; whereas the number of words is very small,
| very fininte, and many of them not useful.
|
| The only reason that "the cat sat..", "the cat sat on..."
| is reasonable is because each sequential word can be
| modulated by the prompt to seem as if planned.
| edmara wrote:
| The modelling is advanced enough that you can't
| fundamentally distinguish it from (lossy, limited)
| planning in the way you're describing.
|
| If the KQV doesn't encode information about likely future
| token sequences then a transformer empirically couldn't
| outperform Markov text generators.
| mjburgess wrote:
| No one is spending $10-50mil building a markov text model
| of everything ever digitised; if they did so, their
| performance would approach a basic LLM.
|
| Though, more simply, you can just take any LLM and
| rephrase it as a markov model. All algorithms which model
| conditional probability are equivalent; you can even
| unpack a NN as a kNN model or a decision tree.
|
| They all model 'planning' in the same way: P(C|A, B) is a
| 'plan' for C following A, B. There is no model of P("A B
| C" | "A B"). Literally, at inference time, no computation
| whatsoever is performed to anticipate any future
| prediction -- this follows both trivially form the
| mathematical formalism (which no one seems to want to
| understand); or you can also see this empirically:
| inference time is constant _regardless_ of prompt
| /continuation.
|
| The reason 'the cat sat...' is completed by 'on the mat'
| is that it's maximal that P(on|the cat sat...), P(the|the
| cat sat on...), P(mat|the cat sat on the...)
|
| _Why_ its maximal is not in the model at all, nor in the
| data. It 's in the data generating process, ie., us. It
| is we who arranged text by these frequencies and we did
| so because the phrase is a popular one for academic
| demonstrations (and so on).
|
| As ever, people attribute "to the data" or worse, "to the
| LLM" no properties it has.. rather it replays the data to
| us and we suppose the LLM must have the property that
| generates this data originally. Nope.
|
| Why did the tape recorder say, "the cat sat on the mat"?
| What, on the tape or in the recorder made "mat" the right
| word? Surely, the tape must have planned the word...
| flawsofar wrote:
| In case you're thinking that rhyming requires planning,
| that's just as silly as a rabbit tanning.
|
| You can make things up as you go, and the constraints
| emerge from the flow.
| gbasin wrote:
| great comment
| jerpint wrote:
| My understanding is that tokenization is part of the
| bottleneck. When you break up a sentence to tokens, each token
| gets a vector representation. The dictionary of all tokens
| would be infinite if it was at the sentence level
| faabian wrote:
| Vectors can do what one-hot vectors cannot do -- no one said
| inputs need to be rows from an token_id -> vector embeddings
| map. Basically, we are doing this already by moving from one-
| hot vectors to n-tuples of one-hot vectors, increasing the
| effective vocabulary size from V to V^n.
| bjourne wrote:
| That's hierarchical prediction. On one level you predict the
| style of the paragraph, on another the tone and form of the
| sentence, and on the third the next word. However, this form of
| prediction is quite difficult since predictions from the layers
| affect each other.
| wangii wrote:
| the problem is then the total number of computation drops
| dramatically therefore leads to much less "thinking" power. i
| think the idea originated from an understanding that when we
| write/speak, we have an overall idea. my current hypothesis is
| it's probably an illusion.
|
| you may want to search for "filler" papers to read.
| everforward wrote:
| Also a noob here, if we encoded, trained on and synthesized
| sentence vectors, wouldn't that move the AIs ability to create
| novel things up from sentences to words?
|
| I.e. we currently operate on words (roughly) so the AI can only
| use words it knows but can synthesize unique sentences from
| words. If the AI operates on sentences, wouldn't it only be
| able to regurgitate sentences it has seen before? So it could
| synthesize novel paragraphs, but not sentences?
|
| I'm not convinced that sentences are a useful abstraction for
| AI (in English, anyways). They're barely useful to humans.
| Check out your average chat conversation, email, YouTube
| comment, etc. There's a very good chance the sentences aren't
| actually sentences, or that they haven't even bothered to use
| punctuation.
|
| I just don't think sentences map to a semantic device. A
| sentence could be two words or half an English paper depending
| on the writer. It could traverse a half dozen ideas or a single
| one. Where a sentence ends generally is more about the writer
| than the semantics.
| bradley13 wrote:
| I read an article that pointed out that LLMs literally have a
| one-dimensional window onto the world. Everything is just a
| sequence of tokens.
|
| Maybe this sort of multi more fiction takes their view into 1.1
| dimensions? In any gas, there us s real argument for expanding
| that window, somehow, into two or more dimensions.
| mike_hearn wrote:
| Well, it feels like architecturally there's a lot of scope to
| do better for coding tasks specifically. Like, if you had FAIR
| level resources and wanted to train a really great Java coding
| model for example it would make sense to train the model to
| predict ASTs rather than tokens. You'd still need some kind of
| joint normal LLM for predicting comments, identifier names and
| so on, but you wouldn't model the program itself as a stream of
| tokens. Instead it would predict things like "add an if block",
| "add a method call block with 4 parameters" and so on.
|
| You could also train the model to expect certain context window
| positions to be reserved for things like "type members at the
| current cursor" and then integrate the inferencing loop with
| IDE/LSP-style static analysis. This would allow the model to
| see more information than is actually contained in the text.
|
| I think the reason we're not seeing models like this right now
| is the cost of doing such research combined with the fact that
| AI people are all Python-heads, and Python doesn't benefit from
| much IDEs.
| bradley13 wrote:
| That sounds right. My vague idea of a "second dimension"
| could well be some sort of structure - be it ASTs for
| programming languages or for natural language.
|
| Another possibility would be some sort of fixed knowledge
| base, which could be program language documentation or
| "common sense" like CYC wants to provide.
| albertzeyer wrote:
| For those who know speculative decoding: This is basically self-
| speculative decoding. It still auto-regressively feeds the
| predicted label sequence through the network again, and only
| keeps the prediction up to the point where it matches. So it will
| not get worse in performance but only faster (here up to 3 times,
| which is normal for speculative decoding).
|
| Due to the multi-task training, it will however also get better.
| (This idea is already quite old, to predict multiple targets into
| the future as an auxiliary loss.)
|
| Nice work.
| imtringued wrote:
| The problem with speculative decoding is that there are hardly
| any models that support it and adding support takes extra GPU
| time. If speculative decoding also improves planning
| performance, then it will be more readily adopted.
| albertzeyer wrote:
| What do you mean? Speculative decoding can be done with any
| auto-regressive model. Normally you use another much faster
| model to predict the next N subwords, and then you use the
| big model to verify whether it gets the same output, or maybe
| just reranked. Evaluating N subwords in one go is much faster
| compared to doing it subword by subword. That's why this is
| faster. Not all N words might match, so then you might need
| to redo the prediction for M < N subwords, but there are many
| simple cases where a faster and weaker model is still
| accurate enough. In the very extreme case, where N-1 subwords
| are always wrongly predicted, it would be slightly slower,
| but usually you get quite a big speedup, e.g. 3x faster or
| so.
|
| The nice thing here is that you actually don't need another
| smaller model but the model itself already predicts the next
| N subwords.
|
| Or maybe you mean it's not implemented in some of the common
| software? I'm not sure about that, but I thought it's a quite
| popular feature now.
| HanClinto wrote:
| For anyone interested in exploring this, llama.cpp has an
| example implementation here:
|
| https://github.com/ggerganov/llama.cpp/tree/master/examples
| /...
| techbruv wrote:
| > So it will not get worse in performance but only faster
|
| A bit confused by this statement. Speculative decoding does not
| decrease the performance of the model in terms of "accuracy" or
| "quality" of output. Mathematically, the altered distribution
| being sampled from is identical to the original distribution if
| you had just used regular autoregressive decoding. The only
| reason you get variability between autoregressive vs
| speculative is simply due to randomness.
|
| Unless you meant performance as in "speed", in which case it's
| possible that speculative decoding could degrade speed (but on
| most inputs, and with a good selection of the draft model, this
| shouldn't be the case).
| jasonjmcghee wrote:
| I think parent is saying the same thing as you. Pointing out
| to folks unfamiliar, speculative decoding doesn't trade
| quality for speed.
| albertzeyer wrote:
| Yes that's what I mean, speculative decoding does not
| decrease the performance in terms of quality. I guess my
| wording was confusing on this.
| mg wrote:
| Currently, LLMs start from scratch for each output token, right?
|
| Lets say you ask an LLM What makes bananas
| yellow?
|
| And it replies Bananas are yellow due to a
| pigment called bromelain.
|
| I would think that the concept of "pigment" and "bromelain" are
| already somehow activated in the neural net when it outputs "a".
| Because now it can't change its mind anymore and follow up with
| "an optical illusion that makes humans perceive every bent object
| as yellow". So it seems to have already planned ahead to talk
| about the pigment called bromelain.
|
| Would it be possible to capitalize on the work that has already
| been done when the LLM outputs "a"? Could the state of the neural
| net be somehow preserved for the next answer?
| pbh101 wrote:
| The alternative theory is that any word starting with a vowel
| sound is exceedingly uncommon after 'a' in its training set, so
| it doesn't need to plan ahead, just predict the distribution of
| the most likely next words and choose.
|
| Which is my understanding of how they work and the dynamic at
| play.
| faabian wrote:
| To some degree, attention is already a mechanism to make
| computations from previous tokens useful later. (You can think
| of the KV cache as a representation of the text so far and all
| the models thoughts on it.) And since language models are
| trained on sequences end-to-end, I think this is likely to
| happen. Multi-token prediction encourages this behavior
| explicitly but only for the small n token window you define.
|
| That said, there are many works attempting to increase the
| compute utilization of transformer language models (early exit,
| mixture of depths) and novel architectures (SSMs etc.).
| jacobsimon wrote:
| Thanks for highlighting the KV cache, I've been wondering the
| same thing and hadn't come across that or didn't remember.
| edmara wrote:
| Transformers are still stateless, KV cache is just a
| compute-saving measure (but otherwise correctly described)
| jacobsimon wrote:
| Oh huh. Why not make it stateful, like re-use and compute
| just the "diff" when you add a new token? Assuming it's
| not that easy because each token can affect attention
| globally.
|
| I think I've read something about this but I wonder if
| you could abstract attention to sentence/page levels and
| then only recalculate the parts that are relevant.
| avianlyric wrote:
| The output of most LLMs is stochastic. The core LLM is given
| token, and outputs a set of ranked tokens, with a "confidence",
| to go next. Then there's normally a filtering and search stage,
| where those ranked token are either feed back into the LLM to
| get more ranked tokens and used to for a short probability
| tree. I.e. if we pick the top N-ranked tokens and put them back
| in, each of those tokens results in a new set of N-ranked
| tokens.
|
| By looking at that tree some basic filtering is done. Such as
| picking the branch that has the highest summed confidence, or
| the branch that has the fewest repeated tokens, or the fewest
| tokens that match with input tokens, or more often some
| combination of the above, plus a random choice weighed by
| summed confidences.
|
| That how you can give a LLM with complete fixed weights, which
| is all LLM, the same input multiple times, but get different
| outputs.
|
| So to answer your specific question, it can "change its mind".
| Every token produced creates a new opportunity for the
| stochastic output filters to pick a new path through all the
| possible outputs.
| berkes wrote:
| I always presumed that "a pigment" is one token.
|
| I'm a total amateur in this field though.
| ralusek wrote:
| Tokens are not multiple words, but are actually usually parts
| of words. The rule of thumb OpenAI uses is that there will be
| 100 tokens for every 75 words.
|
| If you want to just see the tokens for yourself, though, just
| enter some text here:
|
| https://platform.openai.com/tokenizer
| sanxiyn wrote:
| No, it isn't, at least on OpenAI tokenizer.
| jasonjmcghee wrote:
| what constitutes a token is super unintuitive.
|
| my gut said "pig" and "ment" on this one, which happens to be
| right, but my gut would also say "para" and "graph" but no,
| "paragraph" is a single token which falls way outside the
| "normal" length I see of 3-4 characters
|
| In either case, I do consistently see spaces between
| characters included as part of the token following the space.
|
| " paragraph" (10 characters) is the longest token I've seen-
| and now I wonder what the longest token is
| jasonjmcghee wrote:
| New winner " communication" at 14
| matthewdgreen wrote:
| This post is interesting:
| https://clementneo.com/posts/2023/02/11/we-found-an-neuron
| nicklecompte wrote:
| Maybe look at it another way: ask GPT to complete the following
| Bananas are yellow due to a Bananas are yellow due
| to an
|
| In the first case it might respond Bananas are
| yellow due to a pigment called bromelain.
|
| In the second case it might respond Bananas are
| yellow due to an organic compound called bromelain, which is a
| yellow pigment.
|
| So in either case GPT could have picked "a" or "an" without any
| impact on the semantic meaning of its response. In the extreme
| case, you could see the LLM operating according to a dumb
| heuristic: The token following "due to" is "a"
| with 55% probability, "an" with 45% probability.
|
| In reality it is of course more sophisticated than this. But
| this dumb heuristic would explain the behavior.
|
| And if you didn't actually include any facts about bromelain in
| the pretraining data, LLMs absolutely could autocomplete this
| with something about "an optical illusion." GPT-3 made factual
| mistakes like that pretty routinely, but I recall it figured
| out the grammatical rules of "a" and "an."
|
| I don't think the concept actually needs to be pre-activated as
| you said, though I agree with faabian that this "preactivation"
| probably does happen in some implicit/emergent sense.
| HarHarVeryFunny wrote:
| 1. Bananas are yellow due to a biochemical process known as
| carotenogenesis, which involves the synthesis and accumulation
| of carotenoid pigments.
|
| 2. Bananas are yellow due to a specific carotenoid called beta-
| cryptoxanthin, which gives the fruit its characteristic yellow
| hue.
|
| 3. Bananas are yellow due to a gradual increase in the
| concentration of carotenoid pigments as the fruit ripens and
| chlorophyll levels decrease.
|
| 4. Bananas are yellow due to a series of enzymatic reactions
| that convert starch into sugars and break down the green
| chloroplasts, revealing the underlying yellow carotenoids.
|
| 5. Bananas are yellow due to a change in the pH levels within
| the fruit cells during ripening, which triggers the production
| of yellow carotenoid pigments.
|
| 6. Bananas are yellow due to a genetic trait inherited from
| their wild ancestors, which enabled the development of
| carotenoid pigments as a way to attract seed dispersers.
|
| 7. Bananas are yellow due to a complex interplay between
| various plant hormones, such as ethylene and abscisic acid,
| which regulate the ripening process and pigment formation.
|
| 8. Bananas are yellow due to a metabolic shift from chlorophyll
| synthesis to carotenoid synthesis as the fruit reaches
| maturity.
|
| 9. Bananas are yellow due to a natural defense mechanism that
| involves the production of carotenoid pigments, which protect
| the fruit from oxidative stress during ripening.
|
| 10. Bananas are yellow due to a evolutionary adaptation that
| helps the fruit stand out against the green foliage, making it
| more visible to potential seed dispersers.
|
| The output of an LLM is usually randomly sampled from the top
| few highest probability next token/word predictions, but the
| model itself has no idea which word the sampler will pick. It
| presumably has some conceptual plan of what could follow "a",
| or any of it's other suggestions, but any such plan (high level
| prediction) is then rethought from scratch once "a" is
| generated.
|
| The model not only can, but has to, change it's mind after each
| word generated, so this "planning ahead" is very ephemeral -
| more like a freestyle rapper making it up on the fly than
| someone thinking deeply about how best to reply and how to
| express it.
| scoot wrote:
| > 10. Bananas are yellow due to _a evolutionary_ adaptation
|
| Did an LLM really make this basic grammatical error?
| HarHarVeryFunny wrote:
| Yes it did (free Claude Sonnet), but presumably only
| because it was trained on examples of us making the same
| mistake!
| scoot wrote:
| > presumably only because it was trained on examples of
| us making the same mistake
|
| That's what makes it surprising.
| Workaccount2 wrote:
| There has to be more going on somewhere in the system. You
| can ask GPT4 to describe something (Vermont) and ask it to
| end to description with a chosen word (house). It will then
| usually be able to write out a descriptive paragraph that
| ultimately lands on your chosen word pretty seamlessly -
| "...embodying a serene and rural charm that culminates in the
| warm, welcoming feel of a cozy house."
|
| Models do struggle with these tests, for sure, but from an
| analytical standpoint, a "next token predictor" should not be
| able to ever correctly land on the right token 100 tokens in
| the future.
|
| Edit: Thinking about it, I suppose it is possible that the
| model can encode a "destination" in the first token. Like a
| pool shot that is artfully bounced off many bumpers to hit a
| ball, perhaps the LLM can encode a "path" to a destination
| token in the first token generated. Which might be even
| crazier as it suggests that the model is playing a meta-game
| with being able to precisely manipulate the individual layers
| of output, even though those layers are disparate from token
| to token.
| HarHarVeryFunny wrote:
| The "next token predictor" description is a bit too
| literal, and anyways incomplete. A transformer has a lot of
| layers (e.g. 96 for GPT-2) and it's only at the input that
| it has pure token embeddings (ignoring positional
| encoding). At the output it's a built-up embedding that's
| decodable into a token.
|
| One way to think of what all the intermediate layers are
| doing is to consider them as levels of a linguistic parse
| tree with the leaves (words) at the bottom and trunk
| ("sentence") at the top, except in the transformer the
| evolving embeddings at each level contain semantic as well
| as syntactic information. This largely hierarchical view of
| language was the motivation for the transformer design.
|
| It seems we should really think of each layer of the
| transformer as an independent predictor, with increasingly
| abstract and more semantically complete information
| available as we ascend the transformer layers towards the
| output. Predict next token is only what the transformer is
| being trained to do at the output layer. At the inner
| layers (i.e. the bulk of what the transformer is doing), it
| will be predicting at these higher levels of representation
| held at those layers.
|
| These models do struggle (although getting better) at
| ending on a given word rather than starting on it, and
| understandably so since random sampling and continual
| resetting after each output token (= new input sequence)
| means that planning ahead at level of word specificity is
| simply not an option. They have to continually adapt to
| next sampled token, and take it from there.
|
| I'm guessing that ability to end on a chosen word is due to
| continued salience of that word during generation,
| prediction of sentence fragments using/ending with that
| word, and opportunistic stopping when it has been emitted
| and the sentence is complete. Kind of the same way you
| might do it yourself if you just started talking
| immediately without planning, while trying to end on a
| given word.
| wewtyflakes wrote:
| I wonder if this means that these types of models would perform
| better (or worse?) for languages that do not have this sort of
| forward-looking grammar.
| hhcoder wrote:
| I am curious what happens if the multiple tokens predicted
| interfere with one another. Say I ask "What are the colors of the
| rainbow?", if one of the tokens is a repeated color, how do we
| resolve that?
| bjornsing wrote:
| I've been thinking about this, but I'm leaning more towards
| letting the LLM output a small PixelCNN or similar model over the
| next N tokens. That way the LLM can describe conditional
| probabilities over the coming tokens.
| deskamess wrote:
| Side track: There is so much going on this space. I wish there
| was a chronological flow of a machine learning scenario/story
| with all the terms being introduced as we meet them (data, pre-
| training, training, inference, mixture of experts, RAG). Like
| someone walking me through a factory explaining what happens at
| each stage (like Mr Rogers used to do). Most of the time I do not
| know where the terms fit in the big picture. When I first came
| across pre-training I thought it was something done to the data
| before training happened but it was actually another training.
| berkes wrote:
| > Most of the time I do not know where the terms fit in the big
| picture.
|
| Nor do the majority of "AI" experts and consultants that I see
| on LinkedIn, Twitter or in podcasts.
|
| The S/N ratio is very low in this field. Just pick some
| documentation from "industry leaders" like Langchain and see
| that not only is it already and always outdated, it sometimes
| simply contradicts itself.
|
| In the "blockchain hype" this was similar, so I guess it's a
| trait of the hype train.
| highwaylights wrote:
| Totally agree with the above, although I'm not sure that
| documentation on tools like Langchain is a reflection of the
| hype in the way social media is. I think in that case it's
| just a reflection of the pace things are moving at.
| pixl97 wrote:
| >Nor do the majority of "AI" experts...
|
| I mean yes, this is what a rapidly expanding field looks like
| that's probing the boundaries of its problem space. Kind of
| like following physics in the early to mid 1900s. Different
| classes of problems have barely been tested against each
| other, much less fully explored themselves.
|
| In some ways it reminds me of the earlier days of the
| internet when progress was still very rapid.
| highwaylights wrote:
| I feel your pain and excitement in equal measure.
|
| It can be hard to know where to start with some of these
| concepts, especially so given that a lot of recent developments
| (e.g. RAG) are developing so rapidly that there's unlikely to
| be a reference book you could refer to anytime soon that would
| be current.
|
| That said, I do find that documentation is getting better
| depending on where you look. The documentation for higher level
| tools like LlamaIndex is a good starting point for
| understanding the concepts (not so much in terms of
| _explaining_ the concepts, but showing where they fit into the
| overall picture, then you can deep-dive elsewhere on the
| different parts).
|
| YouTube has always been a mixed bag of very little solid
| information in a sea of non-experts trying to attract clicks
| for the latest trends, so it's not a great starting point IMHO.
| phkahler wrote:
| >> YouTube has always been a mixed bag
|
| As an outsider but avid reader of this stuff linked from HN,
| I would recommend the channel 3blue1brown. He's got several
| NN and AI related videos, and the couple I've seen were
| pretty good.
|
| https://www.youtube.com/watch?v=aircAruvnKk
| renonce wrote:
| Yeah but the other side of the coin is that they only
| explain the very basic concepts that are already settled
| for several years, not any of these "latest trends"
| objektif wrote:
| Which latest trends you mean?
| objektif wrote:
| Llamaindex docs are absolutely terrible IMO. I have gone
| through it so many times but still do not understand the
| terms and organization. Router for querying router query
| engine?
| mercer wrote:
| how would you rate Yannic Kilcher, if you're aware of him?
| saddabbas wrote:
| I recommend checking out Machine Learning Q and AI by Sebastian
| Raschka
| jstummbillig wrote:
| People waste too much time building out stuff that is really
| bad in AI right now.
|
| Of course, everything is, but instead of taking on the task of
| patching that up, the better approach would be to pretend there
| will be something that is a lot better than GPT-4 in the near
| future (because there will be) and design a differentiated
| product under that premise.
| sunir wrote:
| That's an interesting idea. What do you mean?
|
| I understand the prompts as a service is short term... but
| what is a long term product you see?
| redblacktree wrote:
| I may be misunderstanding your meaning, but I'm not
| convinced that "prompts as a service" is short term. I
| think we'll see a number of apps pop up that will be
| essentially that, i.e. powered by a generative AI, but with
| a great UX. Not everyone is good at prompting, and although
| it is a skill many will develop, packaging up great prompts
| in niche problem areas still looks like an area of
| opportunity to me. I'm not talking necessarily about chat
| experiences, but apps that can, as an example, maintain
| task lists for you after consuming your incoming
| communications.
| objektif wrote:
| How many times did you have issues communicating with
| your spouse because of prompting issues? And how did you
| resolve it?
|
| Why would you need an extra layer here?
| malnourish wrote:
| Why do PR firms and copy editors exist?
| jstummbillig wrote:
| I think a general way to answer this is by considering for
| any domain you know: What would you pay a human to do right
| now, that LLMs frustratingly can't, but should in theory,
| if only they were a bit better and more consistent?
|
| This could mean: Instead of diving into langchain and
| trying to program your way out of a bad model, or trying to
| do weird prompts, just write a super clear set of
| instructions and wait for a model that is capable of
| understanding clear instructions, because that is an
| obvious goal of everyone working on models right now and
| they are going to solve this better than your custom
| workaround can.
|
| This is not a rigid rule, just a matter of proportions. For
| example, you should probably be willing to try a few weird
| intermediary prompt hacks, if you want to get going with AI
| dev right now. But if most of what most people do will
| probably be solved by a somewhat better model, that's
| probably a cause for pause.
| mercer wrote:
| I suppose with an eye on open-source, an interesting
| 'rule' would be to set a cut-off point for models that
| can run locally, and/or are considered to be feasible
| locally soon.
| JKCalhoun wrote:
| Can I assume AI are continuing their training as they
| interact with people when deployed? Are ChatGPT, Claude,
| learning from my interactions with them? I do, BTW, correct
| them when they unknowingly (I assume) steer me wrong.
|
| One wonders, if that's the case, how quickly an AI might
| improve if it has something close to Google's search site
| throughput. I mean fielding several billion queries a day,
| for a year -- that would be some pretty stellar training
| right there I would think.
| jstummbillig wrote:
| > Can I assume AI are continuing their training as they
| interact with people when deployed?
|
| Yes, you can. Some of the big providers are fairly clear on
| where in their products this happens, and all offer a way
| out (mostly when paying for api access)
|
| > One wonders, if that's the case, how quickly an AI might
| improve if it has something close to Google's search site
| throughput
|
| Indeed. Another possibility is that user input will turn
| out to be increasingly less important for upcoming state of
| the art models.
| astrange wrote:
| They don't train as they go. Training is incredibly
| expensive.
|
| They do take your feedback and presumably do something with
| it. Your actual queries are only indirectly useful since
| they might have private info in them.
| reece_omahoney wrote:
| Check out Lilian Weng's blog
| https://lilianweng.github.io/posts/2023-01-27-the-transforme...
| snorkel wrote:
| Strongly recommend watching Andrej Karpathy's "Lets build
| GPT-2" videos on YouTube which dives into an actual PyTorch
| implementation, then download the code and study it carefully.
| Then study "Spreadsheets is all you need" to see what the
| internal data structures look like.
| nicklecompte wrote:
| I haven't read the paper in full detail yet, but I do have a
| minor editorial comment: while the appendix L.2 was satisfactory,
| I thought the condensed argument in 5.2 was a bit too sloppy. In
| particular, H(X) + H(Y) = H(X | Y) + 2I(X ; Y) +
| H(Y | X) By discarding H(Y | X) - which appears again
| when predicting at the following position - we observe that
| 2-token prediction increases the importance of I(X ; Y) by a
| factor of 2.
|
| The argument about "discarding" was not clear to me - if you're
| predicting the third token Z, then shouldn't H(Y | X) be
| contained in the implicit context C, and therefore can't be
| freely discarded? I don't think this argument was clarified in
| the appendix. But this is mostly about presentation, I wasn't so
| confused as to doubt the gist of the argument.
| faabian wrote:
| Thanks for the feedback! Let me try to state it better:
|
| In the end, we only use the next-token head for generating. So
| which parts of the 2-token target H(X) + H(Y) are "auxiliary"
| in the sense that they help learning and which are "wasted"?
| H(X | Y) and I(X; Y) are useful for next-token generation
| while, by definition, H(Y | X) is the information quantity not
| related to the next token X. So we could say: "multi-token
| prediction trades the useful information I(X; Y) from H(Y) for
| the wasted computations on H(Y | X)". However, note that H(Y |
| X) is a next-token entropy for predicting Y from the prefix (C,
| X). If the attention mechanism allows to transfer computations
| already made for predicting Y|X to the next step, these
| computations may actually _not_ have been wasted -- it was just
| pre-computations.
| stealthcat wrote:
| Did you have some small toy experiments to prove this?
| Xcelerate wrote:
| Do LLMs not consider the probability distribution over all
| combinations of tokens up to a certain output length with regard
| to sequence prediction? I assumed they did that already.
|
| If they don't, I'm amazed they work as well as they do. Consider
| 2-bit sequence prediction with the following possible outcomes
| and associated probabilities: 00: p=0.36
| 01: p=0.04 10: p=0.30 11: p=0.30
|
| So the most likely 2-bit sequence is 00. But on the basis of
| predicting the next token (bit) alone, we have:
| 0: p=0.40 1: p=0.60
|
| which suggests that 1 is the next bit and leads to a suboptimal
| starting point for predicting the bit after that. The error is
| even more prominent with longer sequences as the joint
| probability distribution becomes more unfactorizable into
| marginal distributions (as I would expect any minimal algorithmic
| description of real-world data to be).
|
| Edit: now that I think about this a bit more, a cool research
| project that would be really simple to carry out might be to
| modify the cross-entropy loss function to consider _only_ the nth
| future token in the text training data, and then plot LLM
| performance vs n, assuming that for all current LLM models we
| just have n=1.
|
| My hypothesis is that you can mostly bypass all of the resource
| blow-up involved in predicting the joint probability distribution
| over the next 1 through n tokens (which scales as x^n) by just
| predicting the nth token directly, since doing so would
| implicitly require a better data model (at least for human-
| generated text; this wouldn't be the case for all types of data).
| faabian wrote:
| Language models factor the joint probability p(y, x) as p(y, x)
| = p(y|x) p(x) which is exact. I.e. if you train a language
| model on your distribution _and sample with temperature 1_ ,
| you will get the exact same distribution out. If you sample at
| lower temperature or even greedily, evidently, you will get
| other distributions.
| puttycat wrote:
| You're mixing training loss (cross-entropy/surprisal of next
| token) and post-training prediction decoding (done e.g. with
| beam search)
| Xcelerate wrote:
| Training loss considers only the next single token, right?
| (I'm not up-to-date on the SOTA.)
|
| I thought post-training prediction still only directly
| predicts the next token and beam search is sort of a meta-
| model applied over that (i.e., it is a model on top of the
| output of the model that performs next-token prediction--beam
| search considers at each iteration a subset of the current
| next-token predictions ranked by their probability to use as
| multiple starting points for predicting the next token, while
| keeping track of the joint probabilities to prune the set of
| candidate sequences at each step).
|
| Seems like beam search would fail drastically in cases where
| the true (unknown) probability distribution over all
| sequences of tokens of length n has very low conditional
| probabilities for the first few tokens, each given the
| computed joint probability of the prior predicted tokens.
| That is, the true values of p(t2|t1), p(t3|t2,t1),
| p(t4|t3,t2,t1), ... as derived from the unknown
| p(t1,t2,...,tn) are very small, but very high when computed
| via a next-token prediction model.
|
| I'm suggesting to modify both. Use cross-entropy of the nth
| token for training loss. Use cross-entropy of nth token for
| post-training prediction and then work backward from there to
| the beginning of your sequence prediction.
| namibj wrote:
| The problem is that a position's probability output is
| conditioned via attention on all previous positions.
|
| If you want to be better you need to switch to DDPMs for
| example (e.g. an encoder-only transformer to predict
| diffusion transition probabilities in parallel, then apply
| steps of denoising).
|
| The problem is just that these don't work so well from auto
| regressive decoder transformers, and encoder-decoder
| architectures like e.g. Google's T5 have fallen out of
| favor since about LLAMA dropped.
| sebzim4500 wrote:
| This is how they work and it's a real problem when doing
| prediction with low temperatures.
|
| IIRC you see weird patterns in LLM outputs since "an" is often
| less likely than "a" so you end up with fewer nouns beginning
| with vowels than you would expect.
| hiddencost wrote:
| It's called the Markov assumption. It was basically the single
| most important piece of mathematics in the field for decades.
| It allowed us to solve otherwise intractable problems given the
| limited compute budgets of the time.
| Xcelerate wrote:
| Sure, and it's probably the wrong assumption to make in this
| case if our eventual goal is to capture general reasoning
| ability via LLMs.
| HanClinto wrote:
| This is a fascinating point.
|
| If I'm reading you right, you're saying that a simple way to do
| this would be to calculate logits for not just the next token,
| but also n+1 -- all at the same time. If one of the n+1 logits
| is chosen, then do an infill on the skipped token for the next
| step, then resume.
|
| This could get us around the example that you gave for only a
| linear increase in the vocabulary size -- so looking an extra
| token ahead only increases vocab size by a factor of 2, and
| looking at a third token is a total factor of 3.
|
| This seems really promising!
| ljlolel wrote:
| https://arxiv.org/pdf/2404.19737
| elcomet wrote:
| I think you're not looking at this from the right perspective.
| A LLM is designed to sample text, that follows the training
| distribution. It is not designed to tell you the "most likely"
| text that follows, and we don't actually want that. This would
| mean you have no diversity in your outputs.
|
| In your example, sampling a 0 in 40% of cases and a 1 in 60% of
| cases does make sense for chat applications.
|
| For applications where we _do_ care about the most likely
| sentence (e.g. question answering), then beam search helps, as
| others have mentioned.
|
| Another thing to consider is that the model can "look ahead"
| and precompute what the future tokens might be. And it can then
| use this to predict the current token. In fact, some work have
| been investigating this, such as [1].
|
| And a final note, predicting one token at a time is what we are
| doing as humans when we speak, so clearly it is not a _wrong_
| approach. We are doing this "look ahead" in our mind before
| speaking.
|
| [1] https://arxiv.org/abs/2404.00859
| tmoertel wrote:
| > And a final note, predicting one token at a time is what we
| are doing as humans when we speak...
|
| I wouldn't be surprised if we could predict token groups.
| When speaking off the cuff, people often rely on well-worn
| phrases and cliches.
| ivalm wrote:
| I think that corresponds to well worn phrases being a
| "single token"
| abakker wrote:
| Indeed. An interesting reference to this is the work
| Millman Parry did to describe the key phrases in the
| Odyssey and the queues they gave to help someone memorize
| the poem.
|
| Also, this is maybe a semantic point, but, I am not
| predicting any words I speak. Not in a statistical sense. I
| have intent behind my words, which means I have an
| abstraction of meaning that I want to convey and I assemble
| the correct words to do that. no part of that is
| "predictive"
| thwarted wrote:
| > _describe the key phrases in the Odyssey and the queues
| they gave to help someone memorize the poem._
|
| Queues of words give cues to help memorize.
| abakker wrote:
| lol. Hurray for speech to text, I guess.
| wantsanagent wrote:
| > "It is not designed to tell you the "most likely" text that
| follows,"
|
| It is exactly designed to do that. A temperature of 0 this is
| what you are approximating. The crucial point though is that
| it is the most likely next word given the proceeding multi-
| token context, not just the previous token.
| Xcelerate wrote:
| > It is not designed to tell you the "most likely" text that
| follows, and we don't actually want that. This would mean you
| have no diversity in your outputs.
|
| No, we specifically _do_ want "most likely" to follow; the
| goal is to approximate Solomonoff induction as well as
| possible. See this recent paper by Hutter's team:
| https://arxiv.org/pdf/2401.14953
|
| Quote from the paper:
|
| "LLMs pretrained on long-range coherent documents can learn
| new tasks from a few examples by inferring a shared latent
| concept. They can do so because in-context learning does
| implicit Bayesian inference (in line with our CTW
| experiments) and builds world representations and algorithms
| (necessary to perform SI [Solomonoff Induction]). In fact,
| one could argue that the impressive in-context generalization
| capabilities of LLMs is a sign of a rough approximation of
| Solomonoff induction."
|
| > In your example, sampling a 0 in 40% of cases and a 1 in
| 60% of cases does[n't] make sense for chat applications.
|
| I didn't say anything about sampling. A sequence prediction
| model represents a mapping between an input sequence and a
| probability distribution over all possible output sequences
| up to a certain length.
|
| My example uses a binary alphabet, but LLMs use an alphabet
| of tokens. Any chat application that expresses its output as
| a string of concatenated symbols from a given alphabet has a
| probability distribution defined over all possible output
| sequences. I'm simply comparing the fundamental limitations
| of _any_ approach to inference that restricts its outcome
| space to sequences consisting of one symbol (and then layers
| on a meta-model to generate longer sequences by repeatedly
| calling the core inference capability) vs an approach that
| performs inference over an outcome space consisting of
| sequences longer than one symbol.
| BoiledCabbage wrote:
| > 0: p=0.40 1: p=0.60 which suggests that 1 is the next bit and
| leads to a suboptimal starting point for predicting the bit
| after that. The error is even more prominent with longer
| sequences as the joint probability distribution becomes more
| unfactorizable into marginal distributions (as I would expect
| any minimal algorithmic description of real-world data to be).
|
| Can someone explain this part a bit more? I'm not seeing the
| issue. From what I see, if the first token (t1) output is a
| zero, then the next token (t2) would have probabilities 0:p=.90
| and 1:p=.10. (And t2 0/1:p= .50/.50 if t1=1)
|
| Mathematically, those line up with the initial distribution, so
| what's the concern? That's how conditional probability works.
| cgearhart wrote:
| What you've described is basically the problem with greedy
| sampling in the decoder. Many other local optimization sampling
| strategies exist (e.g., beam search) and there's been a lot of
| work on more global sampling (e.g., speculative decoding).
| ralusek wrote:
| Given that LLMs appear to, in large part, "think" by virtue of
| feeding its own input into itself, people have consistently
| noticed that insisting that the model "think out loud" results in
| higher quality reasoning. i.e. "chain of thought" reasoning will
| contrast simply having the model answer a question directly with
| first having it write out things like:
|
| - restating what it thinks is being asked of it
|
| - expressing a high level strategy over what sort of information
| it might need in order to answer that question
|
| - stating the information it knows
|
| - describing how that information might inform its initial
| reasoning
|
| etc...
|
| I'd be concerned that going about this by having the model
| predict the next multiple tokens at any given time would
| essentially have the opposite effect.
|
| Chain of thought prompting appears to indicate that a model is
| "smarter" when it has n + m tokens than when it just has n tokens
| as input. As such, getting the next 5 tokens for a given n might
| net worse results than getting the next 1 token at n, then the
| next 1 token at n + 1, and so on.
| imtringued wrote:
| If the LLM had an affordable model it would always generate
| enough tokens for the task at hand. The fact that this
| particular method would require more tokens would be
| irrelevant. If you don't have an affordable model, then you
| would always be at the mercy of the LLM being biased towards
| answering with an estimate instead of the actual answer.
|
| Also, most speculative decoding strategies produce identical
| output compared to running the model sequentially. If the
| prediction is wrong, the token gets discarded and the speedup
| is lost.
| bravura wrote:
| I wonder if, instead of just predicting the next n tokens, it
| could also predict like 128, 512, 2048 etc tokens ahead. Thus
| learning long-term discourse structure.
| HanClinto wrote:
| Might be good to have some flexibility in where those
| particular tokens are placed, but yeah -- I could see value in
| creating a "pool" of tokens that should be used at some point
| in the future in the answer.
| lucidrains wrote:
| wow, so prophet net does work! i spent so much time experimenting
| with it back in the day, but just lacked the scale to see a
| positive result.
| riku_iki wrote:
| Its interesting that they got good results on 200B and 0.8 epoch
| training set, but once scaled it to 1T and 4 epoch, got
| degradation in vast majority of benchmarks (Table 1).
| jmount wrote:
| After inventing multi-token, one then invents a useful language
| oriented hierarchy (such as sections, paragraphs, sentences, and
| words).
___________________________________________________________________
(page generated 2024-05-01 23:01 UTC)