[HN Gopher] s-GPTs: A new approach to autoregressive models
       ___________________________________________________________________
        
       s-GPTs: A new approach to autoregressive models
        
       Author : mehulashah
       Score  : 222 points
       Date   : 2024-06-07 13:12 UTC (9 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | hammock wrote:
       | Add to this some kind of "autofocus" for the user to click on the
       | word that is the "center" of the prompt and you've really got
       | something
        
       | skilled wrote:
       | > The main idea is to train the model to generate sequences in a
       | random order, which allow conditional density estimation,
       | infilling and generating sequences by burst using a novel
       | rejection sampling method.
       | 
       | > In exploring that idea, we aslo compared to a discrete
       | diffusion baseline, which also allows to generate sequences in
       | burst. We were surprised to see that diffusion models were able
       | to solve path-finding task and we made a short Twitter thread
       | 
       | The said thread:
       | 
       | https://nitter.poast.org/ArnaudPannatier/status/176286434739...
       | 
       | And a showcase here:
       | 
       | https://www.idiap.ch/~apannatier/sigma-gpt/
       | 
       | (excerpt taken from here: https://www.idiap.ch/~apannatier/)
        
         | 3abiton wrote:
         | I just wonder if such models based on their method would make
         | hallucination even worse.
        
           | arnaudpannatier wrote:
           | Hey, I'm Arnaud, first author of the paper. The answer is a
           | bit mixed. We actually started looking into this because of a
           | repetition problem that appeared in a low-data regime for a
           | sequence generation task. Basically, the left-to-right GPT
           | was stuck repeating the same token once it sampled twice the
           | same in a row during generation. And to mitigate that, we
           | tried to generate the sequence in a random order and it
           | seemed to help and we see less of this repetition issue. We
           | initially thought when we don't have enough data, shuffling
           | would be like data-augmentation and might actually help the
           | model reach better performance. But this is not what we found
           | in the experiments: apparently as learning in any order is a
           | harder task, the model memorise the data more.
        
         | nico wrote:
         | > We were surprised to see that diffusion models were able to
         | solve path-finding task
         | 
         | I wonder if this type of method might allow for faster
         | solutions to the traveling salesman problem
        
       | cs702 wrote:
       | This looks _great_.
       | 
       | The authors randomly permute (i.e., shuffle) input tokens in
       | training and add two positional encodings to each token: one with
       | the token's position and another with the position of the token
       | to be predicted. Otherwise, the model is a standard
       | autoregressive GPT. The consequences of this seemingly "simple"
       | modification are significant:
       | 
       | * The authors can prompt the trained model with part of a
       | sequence and then decode the missing tokens, all at once, in
       | parallel, regardless of order -- i.e., the model can in-fill in
       | parallel.
       | 
       | * The authors can compute conditional probability densities for
       | every missing token in a sequence, again in parallel, i.e.,
       | densities for all missing tokens at once.
       | 
       | * The authors propose a rejection-sampling method for generating
       | in-fill tokens, again in parallel. Their method seems to work
       | well in practice.
       | 
       | I've added this to my reading list. Thank you for sharing it on
       | HN.
        
         | thomashop wrote:
         | I don't understand how that parallel prediction can work...
         | 
         | Let's say I give it as input the sentence:
         | 
         | I . . . . . . . . happily.
         | 
         | The second word to be predicted depends on the first word.
        
           | cs702 wrote:
           | Give the model the tokens "happily" and "I", and add to each
           | input token its respective position embedding _and_ the
           | position embedding for the token to be predicted. You can do
           | this in parallel for all tokens to be predicted. The model
           | has been trained so it can predict tokens in any position.
        
             | hexomancer wrote:
             | Yes, but is there any guarantee that the complete sentence
             | makes sense?
        
               | entropicdrifter wrote:
               | That guarantee didn't exist with regular GPT LLMs, did
               | it? It just came about as an emergent property of
               | throwing more and more compute, training data, and
               | training time at the problem
        
               | amluto wrote:
               | I think it's effectively built in to the design. The
               | model outputs a probability distribution for the first
               | unknown token [0]. Then some code outside the model
               | chooses a token and runs the model again _with that token
               | provided to the model_. So the second output token's
               | probability distribution is automatically conditioned on
               | the first output token, etc.
               | 
               | Sometimes people will attempt to parallelize this by
               | using a faster model to guess a few tokens and then
               | evaluating them in as a batch with the main model to
               | determine whether the choices were good.
               | 
               | [0] Usually it outputs "logits", which become a
               | probability distribution when combined with a
               | "temperature" parameter.
        
               | qeternity wrote:
               | > I think it's effectively built in to the design.
               | 
               | It isn't. There is no guarantee that successive tokens
               | will be comprehensible.
               | 
               | > Usually it outputs "logits", which become a probability
               | distribution when combined with a "temperature"
               | parameter.
               | 
               | The logits _are_ the probability distribution (well
               | technically, you would apply softmax). Temperature is a
               | parameter for how you sample those logits in a non-greedy
               | fashion.
        
               | alextheparrot wrote:
               | No, but it makes more conceptual sense given the model
               | can consider what was said before it
        
               | toxik wrote:
               | That is indeed an issue. Their sampling method rejects
               | impossible combinations.
        
             | KRAKRISMOTT wrote:
             | Isn't this bag of words all over again? Except with
             | positional hints?
        
         | taneq wrote:
         | Wow, if that works that's wild (and also has that "damn, now
         | you say it it's obvious" flavour that so many really cool
         | discoveries share...)
        
         | toxik wrote:
         | This problem formulation has been around for a while, it's kind
         | of the holy grail of modeling. What is new compared to PixelCNN
         | and related is this position embedding idea.
        
         | tripplyons wrote:
         | The only difference I see from XLNet is how they use it during
         | inference.
        
           | arnaudpannatier wrote:
           | Hey! I'm Arnaud, first author of the paper. XLNet also
           | shuffles the data during training, but they use a masking
           | mechanism instead of the causal + double positional encoding.
           | The application differs, XLNet is not AFAIK focused on
           | generation (even if it can be used for that) and the burst-
           | sampling idea is new.
        
             | tripplyons wrote:
             | Thanks for the clarification!
        
             | RivieraKid wrote:
             | Are there any obvious practical application of this
             | algorithm for existing large (10B+) text / image models?
             | 
             | Does the rejection sampling lead to a statistically correct
             | sample from the joint probability distribution or is that
             | just a (possibly rough) approximation?
        
         | WanderPanda wrote:
         | Wait wasn't BERT all about non-causal masking aka predicting
         | words in the middle?!
        
         | nico wrote:
         | I know this is for tokens/text, but can the same concept be
         | applied to images using something like a diffusion model? And
         | then be able to upscale images arbitrarily by infilling?
        
           | gwern wrote:
           | Yes. See the related work section in the paper: there is a
           | long history of models, recently like MAE and MaskGit, which
           | predict pixels in basically arbitrary orders, and that is
           | useful because it lets you train on subsets of each image,
           | upscale/infill during generation, and so on. (If you know
           | what MAE is, that might be the fastest way to summarize OP:
           | "it's a GPT trained like a MAE".)
        
             | psb217 wrote:
             | People also often forget "orderless autoregression", which
             | was introduced a while back and has been reinvented many
             | times since. See Sec 4 (pg 8) of "Neural Autoregressive
             | Distribution Estimation"
             | [https://arxiv.org/abs/1605.02226]. The main difference
             | from current work is that this 2016 paper used MLPs and
             | convnets on fixed-length observations/sequences, so
             | sequence position is matched one-to-one with position in
             | the network's output, rather than conditioning on a
             | position embedding. Of course, Transformers make this type
             | of orderless autoregression more practical for a variety of
             | reasons -- TFs are great!
             | 
             | Key quote from Sec 4: "In this section we describe an
             | order-agnostic training procedure, DeepNADE (Uria et al.,
             | 2014), which will address both of the issues above. This
             | procedure trains a single deep neural network that can
             | assign a conditional distribution to any variable given any
             | subset of the others. This network can then provide the
             | conditionals in Equation 1 for any ordering of the input
             | observations. Therefore, the network defines a factorial
             | number of different models with shared parameters, one for
             | each of the D! orderings of the inputs. At test time, given
             | an inference task, the most convenient ordering of
             | variables can be used."
        
         | RivieraKid wrote:
         | If there are multiple missing tokens, what's the positional
         | encoding for the "token to be predicted"?
        
           | cs702 wrote:
           | See this thread, also on this page:
           | 
           | https://news.ycombinator.com/item?id=40609689
        
         | mglikesbikes wrote:
         | Off topic, but what do you use for your reading list?
        
           | ofou wrote:
           | I use Emergent Mind[1] to keep track of new research
           | published on ArXiv. You can bookmark articles once logged in.
           | It's very useful for keeping track of articles, reading quick
           | summaries, and following conversations on various social
           | media.
           | 
           | [1]: https://www.emergentmind.com/papers/2404.09562
        
           | inhumantsar wrote:
           | hijacking for a bit of shameless self promotion: if you're an
           | obsidian user, I recently built a plugin that simplifies web
           | pages, parses out metadata, and saves them to obsidian as
           | markdown files: https://github.com/inhumantsar/slurp
           | 
           | arXiv comes through a bit ugly atm but it's high on my to-do
           | list. I'm leveraging the library that Firefox uses for reader
           | mode, so most sites come through quite well. A lot of my work
           | right now is expanding their metadata support and fixing
           | parser issues.
        
       | klysm wrote:
       | Encoding the sequence like that seems like a really clever
       | workaround for some of the data dependency limitations of GPT.
        
       | szvsw wrote:
       | Wow, really cool concept! I wonder if this starts to become
       | similar dynamics to what we see in image generation models, where
       | structure/detail emerges in one region of the image and then the
       | surrounding areas start to resolve themselves into place. That
       | kind of behavior seems particularly useful for longer
       | reasoning/logic/planning, where the big ideas might become
       | apparent first, and then the interstitial details and text just
       | naturally fill in...
        
         | byteknight wrote:
         | The process you describe is referred to as diffusion
        
           | szvsw wrote:
           | Yep yep I know, but I was trying to suggest something
           | diffusion-like occurring with a language model through a
           | totally separate mechanism that does not rely on the
           | denoising process (at least not literally).
        
           | immibis wrote:
           | I'm fairly certain diffusion refers to the overall
           | architecture, not the emergent self-organization process.
        
       | bigyikes wrote:
       | Is this applying the learnings from vision transformers to
       | language transformers?
       | 
       | If I understand correctly, vision models split an image into
       | tiles and append a positional encoding to each so the model can
       | understand the relative position of each tile.
       | 
       | I admittedly only read the abstract - a lot of this stuff goes
       | over my head - but it seems like this paper proposes a similar
       | idea, but for 1D instead of 2D?
        
         | seurimas wrote:
         | Positional encoding is standard for transformers of all
         | stripes. They introduce a seemingly novel, redundant positional
         | encoding scheme. It's more difficult to train, but seems to
         | enable producing multiple tokens at once (i.e. you could get an
         | answer that is N tokens long in N/x steps instead N steps).
        
       | optimalsolver wrote:
       | Yann LeCun would say [0] that it's autoregression itself that's
       | the problem, and ML of this type will never bring us anywhere
       | near AGI.
       | 
       | At the very least you can't solve the hallucination problem while
       | still in the autoregression paradigm.
       | 
       | [0] https://twitter.com/ylecun/status/1640122342570336267
        
         | vessenes wrote:
         | I think this method might not be amenable to the exponential
         | divergence argument actually.
         | 
         | Depending on token sampling methods, this one could look at a
         | proposed generation as a whole and revise it. I'm not sure the
         | current token sampling method they propose does this right now,
         | but I think it's possible with the information they get out of
         | the probabilities.
        
           | modeless wrote:
           | Yes, to me this seems to address LeCun's objection, or at
           | least point the way to something that does. It seems possible
           | to modify this into something that can identify and correct
           | its own mistakes during the sampling process.
        
         | andreasmetsala wrote:
         | Does everything have to take us towards AGI? If someone makes a
         | LLM that's faster (cheaper) to run then that has value.
         | 
         | I don't think we _want_ AGI for most tasks unless the intent is
         | to produce suffering in sentient beings.
        
           | ben_w wrote:
           | > I don't think we _want_ AGI for most tasks unless the
           | intent is to produce suffering in sentient beings.
           | 
           | Each letter of "AGI" means different things to different
           | people, and some use the combination to mean something not
           | present in any of the initials.
           | 
           | The definition OpenAI uses is for economic impact, so for
           | them, they do want what they call AGI for most tasks.
           | 
           | I have the opposite problem with the definition, as for me,
           | InstructGPT met my long-standing definition of "artificial
           | intelligence" while suddenly demonstrating generality in that
           | it could perform arbitrary tasks rather than just next-token
           | prediction... but nobody else seems to like that, and I'm a
           | linguistic descriptivist, so I have to accept words aren't
           | being used the way I expected and adapt rather than huff.
        
         | cs702 wrote:
         | LeCun may or may not be right, but I'm not sure this is
         | relevant to the discussion here.
         | 
         | The OP's authors make no claims about how their work might help
         | get us closer to AGI.
         | 
         | They simply enable autoregressive LLMs to do new things that
         | were not possible before.
        
         | TheEzEzz wrote:
         | LeCun is very simply wrong in his argument here. His proof
         | requires that all decoded tokens are conditionally independent,
         | or at least that the chance of a wrong next token is
         | independent. This is not the case.
         | 
         | Intuitively, some tokens are harder than others. There may be
         | "crux" tokens in an output, after which the remaining tokens
         | are substantially easier. It's also possible to recover from an
         | incorrect token auto-regressively, by outputting tokens like
         | "actually no..."
        
       | behnamoh wrote:
       | Title is incorrect: it's s not S.
        
         | modeless wrote:
         | S is uppercase s. Maybe this happened automatically? Pretty
         | funny if so. Correct in a Greek context; clearly incorrect in a
         | math context.
        
           | mehulashah wrote:
           | Yes, HN automatically did that.
        
             | modeless wrote:
             | For future reference, it is possible to edit the titles of
             | stories you've submitted. This allows you to correct any
             | errors introduced by HN's title rewriting heuristics at
             | submission time, without waiting for a moderator to do it
             | for you. Just like for comments, though, the edit window is
             | time limited. For comments the window is two hours. I don't
             | know if it's the same for story titles.
        
       | smusamashah wrote:
       | There is a video on twitter showing it generating text (looks a
       | bit like image diffusion)
       | 
       | https://x.com/ArnaudPannatier/status/1799055129829839166
        
         | lukasb wrote:
         | Weird that they chose an example that ended up somewhat
         | nonsensical.
        
       | mbil wrote:
       | I wonder if this would help especially for computer code
       | generation, where what is output at a given step may materially
       | depend on what would be written at a later step.
        
         | mbil wrote:
         | And, maybe prohibitively slow, perhaps integrate some kind of
         | linting or syntax checking as part of the rejection sampling.
         | I.e. burst sample N potential generated snippets in parallel,
         | reject those that are syntactically invalid.
        
       ___________________________________________________________________
       (page generated 2024-06-07 23:00 UTC)