[HN Gopher] Lossless Acceleration of LLM via Adaptive N-Gram Par...
       ___________________________________________________________________
        
       Lossless Acceleration of LLM via Adaptive N-Gram Parallel Decoding
        
       Author : PaulHoule
       Score  : 79 points
       Date   : 2024-04-21 18:02 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | SonOfLilit wrote:
       | Good idea. Generate 8 tokens with a small model, then give them
       | to a large model as one batch (much faster) and if tokens 1..3
       | are in agreement but 4..8 nut, take only 1..4 and start again.
       | Most tokens are easy to guess so you'll get x3.5 gains.
       | 
       | I do have a feeling of dejavu, like I've seen this before on hn.
        
         | sdenton4 wrote:
         | Very reminiscent of parallel wavenet, which speed things up by
         | generating multiple audio samples at a time.
        
         | milkey_mouse wrote:
         | > I do have a feeling of dejavu, like I've seen this before on
         | hn.
         | 
         | You're either thinking of speculative encoding more generally,
         | or Medusa: https://arxiv.org/abs/2401.10774
        
         | outofpaper wrote:
         | I think a number of us are using the method and these folks
         | decided to do a paper and release their findings.
        
         | johntb86 wrote:
         | They mention previous work on speculative decoding using
         | similar techniques, but "ANPD dynamically generates draft
         | outputs via an adaptive N-gram module using real-time
         | statistics, after which the drafts are verified by the LLM.
         | This characteristic is exactly the difference between ANPD and
         | the previous speculative decoding methods."
        
       | MyFirstSass wrote:
       | If this is "plug and play" can it be added to say llama.cpp and
       | give ~3.67 speedup to existing models or is there some
       | complication?
        
         | jncraton wrote:
         | The speedup would not be that high in practice for folks
         | already using speculative decoding[1]. ANPD is similar but uses
         | a simpler and faster drafting approach. These two enhancements
         | can't be meaningfully stacked. Here's how the paper describes
         | it:
         | 
         | > ANPD dynamically generates draft outputs via an adaptive
         | N-gram module using real-time statistics, after which the
         | drafts are verified by the LLM. This characteristic is exactly
         | the difference between ANPD and the previous speculative
         | decoding methods.
         | 
         | ANPD does provide a more general-purpose solution to drafting
         | that does not require training, loading, and running draft
         | LLMs.
         | 
         | [1] https://github.com/ggerganov/llama.cpp/pull/2926
        
       | nsagent wrote:
       | How does this differ from the 2018 NeurIPS paper, Blockwise
       | Parallel Decoding for Deep Autoregressive Models?
       | 
       | https://arxiv.org/abs/1811.03115
        
       | kristjansson wrote:
       | This is speculative decoding with a n-gram Markov chain instead
       | of a weaker transformer model in the "speculating" position.
        
         | huevosabio wrote:
         | Thanks, this is a great, succinct way of summarizing the paper.
        
       | fspeech wrote:
       | The speedup here would be very dependent on the context -- the
       | kind of texts that the models are working with, as it proposes a
       | rather naive n-gram generator (maybe I should say it does not
       | provide any details on this critical component, instead simply
       | refers to Jurafsky textbook). It might not be robust. Instead
       | Apple's work on using the same model to produce n-gram lookahead
       | is robust -- the n-gram generator works as well as the model
       | itself: https://arxiv.org/abs/2402.11131
        
       ___________________________________________________________________
       (page generated 2024-04-21 23:00 UTC)