[HN Gopher] Lossless Acceleration of LLM via Adaptive N-Gram Par...
___________________________________________________________________
Lossless Acceleration of LLM via Adaptive N-Gram Parallel Decoding
Author : PaulHoule
Score : 79 points
Date : 2024-04-21 18:02 UTC (4 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| SonOfLilit wrote:
| Good idea. Generate 8 tokens with a small model, then give them
| to a large model as one batch (much faster) and if tokens 1..3
| are in agreement but 4..8 nut, take only 1..4 and start again.
| Most tokens are easy to guess so you'll get x3.5 gains.
|
| I do have a feeling of dejavu, like I've seen this before on hn.
| sdenton4 wrote:
| Very reminiscent of parallel wavenet, which speed things up by
| generating multiple audio samples at a time.
| milkey_mouse wrote:
| > I do have a feeling of dejavu, like I've seen this before on
| hn.
|
| You're either thinking of speculative encoding more generally,
| or Medusa: https://arxiv.org/abs/2401.10774
| outofpaper wrote:
| I think a number of us are using the method and these folks
| decided to do a paper and release their findings.
| johntb86 wrote:
| They mention previous work on speculative decoding using
| similar techniques, but "ANPD dynamically generates draft
| outputs via an adaptive N-gram module using real-time
| statistics, after which the drafts are verified by the LLM.
| This characteristic is exactly the difference between ANPD and
| the previous speculative decoding methods."
| MyFirstSass wrote:
| If this is "plug and play" can it be added to say llama.cpp and
| give ~3.67 speedup to existing models or is there some
| complication?
| jncraton wrote:
| The speedup would not be that high in practice for folks
| already using speculative decoding[1]. ANPD is similar but uses
| a simpler and faster drafting approach. These two enhancements
| can't be meaningfully stacked. Here's how the paper describes
| it:
|
| > ANPD dynamically generates draft outputs via an adaptive
| N-gram module using real-time statistics, after which the
| drafts are verified by the LLM. This characteristic is exactly
| the difference between ANPD and the previous speculative
| decoding methods.
|
| ANPD does provide a more general-purpose solution to drafting
| that does not require training, loading, and running draft
| LLMs.
|
| [1] https://github.com/ggerganov/llama.cpp/pull/2926
| nsagent wrote:
| How does this differ from the 2018 NeurIPS paper, Blockwise
| Parallel Decoding for Deep Autoregressive Models?
|
| https://arxiv.org/abs/1811.03115
| kristjansson wrote:
| This is speculative decoding with a n-gram Markov chain instead
| of a weaker transformer model in the "speculating" position.
| huevosabio wrote:
| Thanks, this is a great, succinct way of summarizing the paper.
| fspeech wrote:
| The speedup here would be very dependent on the context -- the
| kind of texts that the models are working with, as it proposes a
| rather naive n-gram generator (maybe I should say it does not
| provide any details on this critical component, instead simply
| refers to Jurafsky textbook). It might not be robust. Instead
| Apple's work on using the same model to produce n-gram lookahead
| is robust -- the n-gram generator works as well as the model
| itself: https://arxiv.org/abs/2402.11131
___________________________________________________________________
(page generated 2024-04-21 23:00 UTC)