[HN Gopher] s-GPTs: A new approach to autoregressive models
___________________________________________________________________
s-GPTs: A new approach to autoregressive models
Author : mehulashah
Score : 222 points
Date : 2024-06-07 13:12 UTC (9 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| hammock wrote:
| Add to this some kind of "autofocus" for the user to click on the
| word that is the "center" of the prompt and you've really got
| something
| skilled wrote:
| > The main idea is to train the model to generate sequences in a
| random order, which allow conditional density estimation,
| infilling and generating sequences by burst using a novel
| rejection sampling method.
|
| > In exploring that idea, we aslo compared to a discrete
| diffusion baseline, which also allows to generate sequences in
| burst. We were surprised to see that diffusion models were able
| to solve path-finding task and we made a short Twitter thread
|
| The said thread:
|
| https://nitter.poast.org/ArnaudPannatier/status/176286434739...
|
| And a showcase here:
|
| https://www.idiap.ch/~apannatier/sigma-gpt/
|
| (excerpt taken from here: https://www.idiap.ch/~apannatier/)
| 3abiton wrote:
| I just wonder if such models based on their method would make
| hallucination even worse.
| arnaudpannatier wrote:
| Hey, I'm Arnaud, first author of the paper. The answer is a
| bit mixed. We actually started looking into this because of a
| repetition problem that appeared in a low-data regime for a
| sequence generation task. Basically, the left-to-right GPT
| was stuck repeating the same token once it sampled twice the
| same in a row during generation. And to mitigate that, we
| tried to generate the sequence in a random order and it
| seemed to help and we see less of this repetition issue. We
| initially thought when we don't have enough data, shuffling
| would be like data-augmentation and might actually help the
| model reach better performance. But this is not what we found
| in the experiments: apparently as learning in any order is a
| harder task, the model memorise the data more.
| nico wrote:
| > We were surprised to see that diffusion models were able to
| solve path-finding task
|
| I wonder if this type of method might allow for faster
| solutions to the traveling salesman problem
| cs702 wrote:
| This looks _great_.
|
| The authors randomly permute (i.e., shuffle) input tokens in
| training and add two positional encodings to each token: one with
| the token's position and another with the position of the token
| to be predicted. Otherwise, the model is a standard
| autoregressive GPT. The consequences of this seemingly "simple"
| modification are significant:
|
| * The authors can prompt the trained model with part of a
| sequence and then decode the missing tokens, all at once, in
| parallel, regardless of order -- i.e., the model can in-fill in
| parallel.
|
| * The authors can compute conditional probability densities for
| every missing token in a sequence, again in parallel, i.e.,
| densities for all missing tokens at once.
|
| * The authors propose a rejection-sampling method for generating
| in-fill tokens, again in parallel. Their method seems to work
| well in practice.
|
| I've added this to my reading list. Thank you for sharing it on
| HN.
| thomashop wrote:
| I don't understand how that parallel prediction can work...
|
| Let's say I give it as input the sentence:
|
| I . . . . . . . . happily.
|
| The second word to be predicted depends on the first word.
| cs702 wrote:
| Give the model the tokens "happily" and "I", and add to each
| input token its respective position embedding _and_ the
| position embedding for the token to be predicted. You can do
| this in parallel for all tokens to be predicted. The model
| has been trained so it can predict tokens in any position.
| hexomancer wrote:
| Yes, but is there any guarantee that the complete sentence
| makes sense?
| entropicdrifter wrote:
| That guarantee didn't exist with regular GPT LLMs, did
| it? It just came about as an emergent property of
| throwing more and more compute, training data, and
| training time at the problem
| amluto wrote:
| I think it's effectively built in to the design. The
| model outputs a probability distribution for the first
| unknown token [0]. Then some code outside the model
| chooses a token and runs the model again _with that token
| provided to the model_. So the second output token's
| probability distribution is automatically conditioned on
| the first output token, etc.
|
| Sometimes people will attempt to parallelize this by
| using a faster model to guess a few tokens and then
| evaluating them in as a batch with the main model to
| determine whether the choices were good.
|
| [0] Usually it outputs "logits", which become a
| probability distribution when combined with a
| "temperature" parameter.
| qeternity wrote:
| > I think it's effectively built in to the design.
|
| It isn't. There is no guarantee that successive tokens
| will be comprehensible.
|
| > Usually it outputs "logits", which become a probability
| distribution when combined with a "temperature"
| parameter.
|
| The logits _are_ the probability distribution (well
| technically, you would apply softmax). Temperature is a
| parameter for how you sample those logits in a non-greedy
| fashion.
| alextheparrot wrote:
| No, but it makes more conceptual sense given the model
| can consider what was said before it
| toxik wrote:
| That is indeed an issue. Their sampling method rejects
| impossible combinations.
| KRAKRISMOTT wrote:
| Isn't this bag of words all over again? Except with
| positional hints?
| taneq wrote:
| Wow, if that works that's wild (and also has that "damn, now
| you say it it's obvious" flavour that so many really cool
| discoveries share...)
| toxik wrote:
| This problem formulation has been around for a while, it's kind
| of the holy grail of modeling. What is new compared to PixelCNN
| and related is this position embedding idea.
| tripplyons wrote:
| The only difference I see from XLNet is how they use it during
| inference.
| arnaudpannatier wrote:
| Hey! I'm Arnaud, first author of the paper. XLNet also
| shuffles the data during training, but they use a masking
| mechanism instead of the causal + double positional encoding.
| The application differs, XLNet is not AFAIK focused on
| generation (even if it can be used for that) and the burst-
| sampling idea is new.
| tripplyons wrote:
| Thanks for the clarification!
| RivieraKid wrote:
| Are there any obvious practical application of this
| algorithm for existing large (10B+) text / image models?
|
| Does the rejection sampling lead to a statistically correct
| sample from the joint probability distribution or is that
| just a (possibly rough) approximation?
| WanderPanda wrote:
| Wait wasn't BERT all about non-causal masking aka predicting
| words in the middle?!
| nico wrote:
| I know this is for tokens/text, but can the same concept be
| applied to images using something like a diffusion model? And
| then be able to upscale images arbitrarily by infilling?
| gwern wrote:
| Yes. See the related work section in the paper: there is a
| long history of models, recently like MAE and MaskGit, which
| predict pixels in basically arbitrary orders, and that is
| useful because it lets you train on subsets of each image,
| upscale/infill during generation, and so on. (If you know
| what MAE is, that might be the fastest way to summarize OP:
| "it's a GPT trained like a MAE".)
| psb217 wrote:
| People also often forget "orderless autoregression", which
| was introduced a while back and has been reinvented many
| times since. See Sec 4 (pg 8) of "Neural Autoregressive
| Distribution Estimation"
| [https://arxiv.org/abs/1605.02226]. The main difference
| from current work is that this 2016 paper used MLPs and
| convnets on fixed-length observations/sequences, so
| sequence position is matched one-to-one with position in
| the network's output, rather than conditioning on a
| position embedding. Of course, Transformers make this type
| of orderless autoregression more practical for a variety of
| reasons -- TFs are great!
|
| Key quote from Sec 4: "In this section we describe an
| order-agnostic training procedure, DeepNADE (Uria et al.,
| 2014), which will address both of the issues above. This
| procedure trains a single deep neural network that can
| assign a conditional distribution to any variable given any
| subset of the others. This network can then provide the
| conditionals in Equation 1 for any ordering of the input
| observations. Therefore, the network defines a factorial
| number of different models with shared parameters, one for
| each of the D! orderings of the inputs. At test time, given
| an inference task, the most convenient ordering of
| variables can be used."
| RivieraKid wrote:
| If there are multiple missing tokens, what's the positional
| encoding for the "token to be predicted"?
| cs702 wrote:
| See this thread, also on this page:
|
| https://news.ycombinator.com/item?id=40609689
| mglikesbikes wrote:
| Off topic, but what do you use for your reading list?
| ofou wrote:
| I use Emergent Mind[1] to keep track of new research
| published on ArXiv. You can bookmark articles once logged in.
| It's very useful for keeping track of articles, reading quick
| summaries, and following conversations on various social
| media.
|
| [1]: https://www.emergentmind.com/papers/2404.09562
| inhumantsar wrote:
| hijacking for a bit of shameless self promotion: if you're an
| obsidian user, I recently built a plugin that simplifies web
| pages, parses out metadata, and saves them to obsidian as
| markdown files: https://github.com/inhumantsar/slurp
|
| arXiv comes through a bit ugly atm but it's high on my to-do
| list. I'm leveraging the library that Firefox uses for reader
| mode, so most sites come through quite well. A lot of my work
| right now is expanding their metadata support and fixing
| parser issues.
| klysm wrote:
| Encoding the sequence like that seems like a really clever
| workaround for some of the data dependency limitations of GPT.
| szvsw wrote:
| Wow, really cool concept! I wonder if this starts to become
| similar dynamics to what we see in image generation models, where
| structure/detail emerges in one region of the image and then the
| surrounding areas start to resolve themselves into place. That
| kind of behavior seems particularly useful for longer
| reasoning/logic/planning, where the big ideas might become
| apparent first, and then the interstitial details and text just
| naturally fill in...
| byteknight wrote:
| The process you describe is referred to as diffusion
| szvsw wrote:
| Yep yep I know, but I was trying to suggest something
| diffusion-like occurring with a language model through a
| totally separate mechanism that does not rely on the
| denoising process (at least not literally).
| immibis wrote:
| I'm fairly certain diffusion refers to the overall
| architecture, not the emergent self-organization process.
| bigyikes wrote:
| Is this applying the learnings from vision transformers to
| language transformers?
|
| If I understand correctly, vision models split an image into
| tiles and append a positional encoding to each so the model can
| understand the relative position of each tile.
|
| I admittedly only read the abstract - a lot of this stuff goes
| over my head - but it seems like this paper proposes a similar
| idea, but for 1D instead of 2D?
| seurimas wrote:
| Positional encoding is standard for transformers of all
| stripes. They introduce a seemingly novel, redundant positional
| encoding scheme. It's more difficult to train, but seems to
| enable producing multiple tokens at once (i.e. you could get an
| answer that is N tokens long in N/x steps instead N steps).
| optimalsolver wrote:
| Yann LeCun would say [0] that it's autoregression itself that's
| the problem, and ML of this type will never bring us anywhere
| near AGI.
|
| At the very least you can't solve the hallucination problem while
| still in the autoregression paradigm.
|
| [0] https://twitter.com/ylecun/status/1640122342570336267
| vessenes wrote:
| I think this method might not be amenable to the exponential
| divergence argument actually.
|
| Depending on token sampling methods, this one could look at a
| proposed generation as a whole and revise it. I'm not sure the
| current token sampling method they propose does this right now,
| but I think it's possible with the information they get out of
| the probabilities.
| modeless wrote:
| Yes, to me this seems to address LeCun's objection, or at
| least point the way to something that does. It seems possible
| to modify this into something that can identify and correct
| its own mistakes during the sampling process.
| andreasmetsala wrote:
| Does everything have to take us towards AGI? If someone makes a
| LLM that's faster (cheaper) to run then that has value.
|
| I don't think we _want_ AGI for most tasks unless the intent is
| to produce suffering in sentient beings.
| ben_w wrote:
| > I don't think we _want_ AGI for most tasks unless the
| intent is to produce suffering in sentient beings.
|
| Each letter of "AGI" means different things to different
| people, and some use the combination to mean something not
| present in any of the initials.
|
| The definition OpenAI uses is for economic impact, so for
| them, they do want what they call AGI for most tasks.
|
| I have the opposite problem with the definition, as for me,
| InstructGPT met my long-standing definition of "artificial
| intelligence" while suddenly demonstrating generality in that
| it could perform arbitrary tasks rather than just next-token
| prediction... but nobody else seems to like that, and I'm a
| linguistic descriptivist, so I have to accept words aren't
| being used the way I expected and adapt rather than huff.
| cs702 wrote:
| LeCun may or may not be right, but I'm not sure this is
| relevant to the discussion here.
|
| The OP's authors make no claims about how their work might help
| get us closer to AGI.
|
| They simply enable autoregressive LLMs to do new things that
| were not possible before.
| TheEzEzz wrote:
| LeCun is very simply wrong in his argument here. His proof
| requires that all decoded tokens are conditionally independent,
| or at least that the chance of a wrong next token is
| independent. This is not the case.
|
| Intuitively, some tokens are harder than others. There may be
| "crux" tokens in an output, after which the remaining tokens
| are substantially easier. It's also possible to recover from an
| incorrect token auto-regressively, by outputting tokens like
| "actually no..."
| behnamoh wrote:
| Title is incorrect: it's s not S.
| modeless wrote:
| S is uppercase s. Maybe this happened automatically? Pretty
| funny if so. Correct in a Greek context; clearly incorrect in a
| math context.
| mehulashah wrote:
| Yes, HN automatically did that.
| modeless wrote:
| For future reference, it is possible to edit the titles of
| stories you've submitted. This allows you to correct any
| errors introduced by HN's title rewriting heuristics at
| submission time, without waiting for a moderator to do it
| for you. Just like for comments, though, the edit window is
| time limited. For comments the window is two hours. I don't
| know if it's the same for story titles.
| smusamashah wrote:
| There is a video on twitter showing it generating text (looks a
| bit like image diffusion)
|
| https://x.com/ArnaudPannatier/status/1799055129829839166
| lukasb wrote:
| Weird that they chose an example that ended up somewhat
| nonsensical.
| mbil wrote:
| I wonder if this would help especially for computer code
| generation, where what is output at a given step may materially
| depend on what would be written at a later step.
| mbil wrote:
| And, maybe prohibitively slow, perhaps integrate some kind of
| linting or syntax checking as part of the rejection sampling.
| I.e. burst sample N potential generated snippets in parallel,
| reject those that are syntactically invalid.
___________________________________________________________________
(page generated 2024-06-07 23:00 UTC)