[HN Gopher] Consistency LLM: converting LLMs to parallel decoder...
___________________________________________________________________
Consistency LLM: converting LLMs to parallel decoders accelerates
inference 3.5x
Author : zhisbug
Score : 183 points
Date : 2024-05-08 19:55 UTC (3 hours ago)
(HTM) web link (hao-ai-lab.github.io)
(TXT) w3m dump (hao-ai-lab.github.io)
| toxik wrote:
| Interesting stuff. I guess the idea has occurred to many but was
| well written and presented.
| andy12_ wrote:
| At first I thoght that this was another Medusa-like paper, simply
| using more unembed heads for guessing subsequent tokes, but damn,
| not at all. This is amazing. And it doesn't even use extra
| parameters, it's just an auxiliary training loss.
| fermuch wrote:
| Would something like this apply to MAMBA/JAMBA too?
| alfalfasprout wrote:
| Wow, I'm mindblown this isn't getting more attention. This seems
| like a clear win for inference. Fine tuning cost for this is
| reasonable (around 0.01% of the original pre-training cost). And
| the performance wins seem fairly consistent.
| lopuhin wrote:
| Similar or greater inference wins are achieved with speculative
| decoding which is already widely used, so while this is really
| interesting (and was tried before with less success AFAIK),
| it's not yet clear how impactful it would be.
| paulclark wrote:
| Is this how Groq (https://groq.com/) is so fast, or are they
| doing something different?
| buildbot wrote:
| Groq is serving an LLM from (100s of chips worth of) SRAM, so
| the effective bandwidth thus token generation speed is an order
| of magnitude higher than HBM. This would 3.5x their speed as
| well, it is orthogonal.
| miven wrote:
| The authors mention that Jacobi decoding is equivalent to greedy
| autoregressive decoding, but in practice don't we often want the
| sampling temperature to be above zero to avoid repetitions and
| excessively generic responses?
|
| I'm completely unfamiliar with this decoding strategy so maybe
| I'm just missing a simple way to account for that.
| matheist wrote:
| Agreed. It's straightforward to check that a token was the
| argmax, but it seems difficult to check that a token appeared
| with the probability you wanted it to. You could still do the
| fine-tuning step I guess, where you train the trajectories to
| approach n-token completions with the statistics you want, but
| I can't see how you can replace the "check for a fixed point"
| step. Maybe "check the result was above this fixed threshold
| for likelihood".
| doctor_eval wrote:
| > Our research shows this process - mimicking human cognitive
| process of forming complete sentences in mind before articulating
| word by word
|
| This is not how I work. Is there something wrong with me?
| jerbear4328 wrote:
| Nor is it how I work, I think that's normal enough. I do have
| an idea of what I'm going to say before I say it, I think
| that's closer to what they meant. I think and speak in
| increments of ideas, not words.
| paulmd wrote:
| > I think and speak in increments of ideas
|
| extremely common among (but not unique to) people with ASD,
| those "increments of ideas" are called "gestalts".
|
| https://kidtherapy.org/helpful-articles/what-is-gestalt-
| lang...
| Filligree wrote:
| You might not have an internal monologue. A lot of us don't,
| and the ones that do are equally shocked every time they find
| out. For what it's worth, I'm in the same boat-- _can_ form
| sentences, but why would I? It 'd slow me down.
|
| People who don't have inner monologues tend to assume that all
| that stuff is some form of analogy or metaphor. It's not. It's
| entirely literal.
| oceanplexian wrote:
| Do you mean in a real time conversation?
|
| Because I definitely dont "have an internal monologue about
| what I'm going to say" in the 100ms between when someone asks
| a casual question and I respond to it.
| DrSiemer wrote:
| They probably do not mean people form entire sentences before
| expressing them, I am not aware of anybody doing that. I assume
| it refers to people first coming up with a global outline of
| what they want to say before they start speaking.
| mdp2021 wrote:
| "Rem tene, verba sequentur" (you hold the matter, then words
| come) is largely "how it works".
|
| You form logical ideas as you speak, as you speak your speech
| develops, so the translation is from ideas to sentences. It is
| not clear in which phase one would mentally form a complete
| sentence, nor why it should be relevant. You "see something
| [that makes sense]", then you describe it - iteratively.
| giardini wrote:
| Probably.
| rcarmo wrote:
| Can't wait to see something like this merged into ollama (I'm
| sure there would be plenty of people fine-tuning models for it).
| Me1000 wrote:
| Ollama doesn't have their own inference engine, they just wrap
| llama.cpp. But yes, it will be awesome when it's more generally
| available.
| helloericsf wrote:
| The lab is tied to the vLLM project. I would say it might get
| picked up sooner by vLLM than other inference frameworks.
| dvt wrote:
| There's no free lunch(tm), so from what I can tell there's some
| pathway loss here. E.g. some Jacobi trajectories definitionally
| exclude higher temperature paths. Which might actually be a
| positive given data retrieval (but a negative if we want to
| maximize for creativity?).
| nico wrote:
| Interesting
|
| I think soon we are going to realize that we don't really need
| training the models
|
| We just need good indexing and sampling
|
| Essentially at some level any LLM is equivalent to a DB of the
| dataset, with a great NLP interface on top
|
| Both are just different methods of navigating stored data
| DoctorOetker wrote:
| This mirrors what I experienced when I enrolled in "free drawing"
| (no teaching) classes:
|
| While people considered me a good drawer since I was a child, I
| remember just repeating either similar detailed drawings I drew
| before, or otherwise just taking plenty of time to draw. I
| believe anyone with time and patience can make a nice drawing of
| a scene.
|
| The "free drawing" class had no rules or lectures: you brought
| the materials you wanted to work with (some brought ink, others
| pencils, while I brought charcoal). The only thing determined was
| the timing between poses for the model: for each session the
| first few poses were very short (say a minute), and then the pose
| durations would progressively lengthen until say 5 minute poses.
| At all times you were free to tear your picture up and retry
| drawing the pose again.
|
| My drawing skills improved considerably. The short "warmups"
| actually force you to get proportions and outlines correct on the
| first tries. Conventional wisdom says haste makes waste, but when
| learning or refining skills, it seems natural selection has
| hardcoded the sensation of haste as a stressor prompting
| attention and learning.
|
| I am convinced I could have drawn similar quality drawings before
| enrolling in those classes, except they would have taken me
| easily 5 or 10 x as long to draw. Being forced not to beat around
| the bush and feeling the penalty of making a hasty mistake
| (further decreasing time left for the second try in the remaining
| time) does seem to work.
|
| My only gripe is that the technique is termed "Consistency"
| whereas I would reserve such a term for an improvement in
| _performance_ not inference speed, although I understand that
| they indicate "consistency with what would ultimately have been
| generated one token at a time". I would rather dub it
| "Proficiency LLM", where the same output is expected, only
| without the inhibition of stuttering to the same conclusion.
| manmal wrote:
| Systems generally become more efficient when under stress. They
| are also forced into local optima - everything has upsides and
| downsides.
| ec109685 wrote:
| Could someone please explain the intuition around this technique
| in more lament terms?
| zozbot234 wrote:
| > Could someone please explain the intuition around this
| technique in more lament terms?
|
| Why should we? Look, if you think this is a really sad paper
| why not just say so outright. Maybe you're right about that,
| the authors are complete lightweights and you're the very smart
| and stable genius who tells it like it is. But you haven't
| provided any evidence of that whatsoever. What a Sad comment!
___________________________________________________________________
(page generated 2024-05-08 23:00 UTC)