[HN Gopher] Consistency LLM: converting LLMs to parallel decoder...
       ___________________________________________________________________
        
       Consistency LLM: converting LLMs to parallel decoders accelerates
       inference 3.5x
        
       Author : zhisbug
       Score  : 183 points
       Date   : 2024-05-08 19:55 UTC (3 hours ago)
        
 (HTM) web link (hao-ai-lab.github.io)
 (TXT) w3m dump (hao-ai-lab.github.io)
        
       | toxik wrote:
       | Interesting stuff. I guess the idea has occurred to many but was
       | well written and presented.
        
       | andy12_ wrote:
       | At first I thoght that this was another Medusa-like paper, simply
       | using more unembed heads for guessing subsequent tokes, but damn,
       | not at all. This is amazing. And it doesn't even use extra
       | parameters, it's just an auxiliary training loss.
        
       | fermuch wrote:
       | Would something like this apply to MAMBA/JAMBA too?
        
       | alfalfasprout wrote:
       | Wow, I'm mindblown this isn't getting more attention. This seems
       | like a clear win for inference. Fine tuning cost for this is
       | reasonable (around 0.01% of the original pre-training cost). And
       | the performance wins seem fairly consistent.
        
         | lopuhin wrote:
         | Similar or greater inference wins are achieved with speculative
         | decoding which is already widely used, so while this is really
         | interesting (and was tried before with less success AFAIK),
         | it's not yet clear how impactful it would be.
        
       | paulclark wrote:
       | Is this how Groq (https://groq.com/) is so fast, or are they
       | doing something different?
        
         | buildbot wrote:
         | Groq is serving an LLM from (100s of chips worth of) SRAM, so
         | the effective bandwidth thus token generation speed is an order
         | of magnitude higher than HBM. This would 3.5x their speed as
         | well, it is orthogonal.
        
       | miven wrote:
       | The authors mention that Jacobi decoding is equivalent to greedy
       | autoregressive decoding, but in practice don't we often want the
       | sampling temperature to be above zero to avoid repetitions and
       | excessively generic responses?
       | 
       | I'm completely unfamiliar with this decoding strategy so maybe
       | I'm just missing a simple way to account for that.
        
         | matheist wrote:
         | Agreed. It's straightforward to check that a token was the
         | argmax, but it seems difficult to check that a token appeared
         | with the probability you wanted it to. You could still do the
         | fine-tuning step I guess, where you train the trajectories to
         | approach n-token completions with the statistics you want, but
         | I can't see how you can replace the "check for a fixed point"
         | step. Maybe "check the result was above this fixed threshold
         | for likelihood".
        
       | doctor_eval wrote:
       | > Our research shows this process - mimicking human cognitive
       | process of forming complete sentences in mind before articulating
       | word by word
       | 
       | This is not how I work. Is there something wrong with me?
        
         | jerbear4328 wrote:
         | Nor is it how I work, I think that's normal enough. I do have
         | an idea of what I'm going to say before I say it, I think
         | that's closer to what they meant. I think and speak in
         | increments of ideas, not words.
        
           | paulmd wrote:
           | > I think and speak in increments of ideas
           | 
           | extremely common among (but not unique to) people with ASD,
           | those "increments of ideas" are called "gestalts".
           | 
           | https://kidtherapy.org/helpful-articles/what-is-gestalt-
           | lang...
        
         | Filligree wrote:
         | You might not have an internal monologue. A lot of us don't,
         | and the ones that do are equally shocked every time they find
         | out. For what it's worth, I'm in the same boat-- _can_ form
         | sentences, but why would I? It 'd slow me down.
         | 
         | People who don't have inner monologues tend to assume that all
         | that stuff is some form of analogy or metaphor. It's not. It's
         | entirely literal.
        
           | oceanplexian wrote:
           | Do you mean in a real time conversation?
           | 
           | Because I definitely dont "have an internal monologue about
           | what I'm going to say" in the 100ms between when someone asks
           | a casual question and I respond to it.
        
         | DrSiemer wrote:
         | They probably do not mean people form entire sentences before
         | expressing them, I am not aware of anybody doing that. I assume
         | it refers to people first coming up with a global outline of
         | what they want to say before they start speaking.
        
         | mdp2021 wrote:
         | "Rem tene, verba sequentur" (you hold the matter, then words
         | come) is largely "how it works".
         | 
         | You form logical ideas as you speak, as you speak your speech
         | develops, so the translation is from ideas to sentences. It is
         | not clear in which phase one would mentally form a complete
         | sentence, nor why it should be relevant. You "see something
         | [that makes sense]", then you describe it - iteratively.
        
         | giardini wrote:
         | Probably.
        
       | rcarmo wrote:
       | Can't wait to see something like this merged into ollama (I'm
       | sure there would be plenty of people fine-tuning models for it).
        
         | Me1000 wrote:
         | Ollama doesn't have their own inference engine, they just wrap
         | llama.cpp. But yes, it will be awesome when it's more generally
         | available.
        
         | helloericsf wrote:
         | The lab is tied to the vLLM project. I would say it might get
         | picked up sooner by vLLM than other inference frameworks.
        
       | dvt wrote:
       | There's no free lunch(tm), so from what I can tell there's some
       | pathway loss here. E.g. some Jacobi trajectories definitionally
       | exclude higher temperature paths. Which might actually be a
       | positive given data retrieval (but a negative if we want to
       | maximize for creativity?).
        
       | nico wrote:
       | Interesting
       | 
       | I think soon we are going to realize that we don't really need
       | training the models
       | 
       | We just need good indexing and sampling
       | 
       | Essentially at some level any LLM is equivalent to a DB of the
       | dataset, with a great NLP interface on top
       | 
       | Both are just different methods of navigating stored data
        
       | DoctorOetker wrote:
       | This mirrors what I experienced when I enrolled in "free drawing"
       | (no teaching) classes:
       | 
       | While people considered me a good drawer since I was a child, I
       | remember just repeating either similar detailed drawings I drew
       | before, or otherwise just taking plenty of time to draw. I
       | believe anyone with time and patience can make a nice drawing of
       | a scene.
       | 
       | The "free drawing" class had no rules or lectures: you brought
       | the materials you wanted to work with (some brought ink, others
       | pencils, while I brought charcoal). The only thing determined was
       | the timing between poses for the model: for each session the
       | first few poses were very short (say a minute), and then the pose
       | durations would progressively lengthen until say 5 minute poses.
       | At all times you were free to tear your picture up and retry
       | drawing the pose again.
       | 
       | My drawing skills improved considerably. The short "warmups"
       | actually force you to get proportions and outlines correct on the
       | first tries. Conventional wisdom says haste makes waste, but when
       | learning or refining skills, it seems natural selection has
       | hardcoded the sensation of haste as a stressor prompting
       | attention and learning.
       | 
       | I am convinced I could have drawn similar quality drawings before
       | enrolling in those classes, except they would have taken me
       | easily 5 or 10 x as long to draw. Being forced not to beat around
       | the bush and feeling the penalty of making a hasty mistake
       | (further decreasing time left for the second try in the remaining
       | time) does seem to work.
       | 
       | My only gripe is that the technique is termed "Consistency"
       | whereas I would reserve such a term for an improvement in
       | _performance_ not inference speed, although I understand that
       | they indicate  "consistency with what would ultimately have been
       | generated one token at a time". I would rather dub it
       | "Proficiency LLM", where the same output is expected, only
       | without the inhibition of stuttering to the same conclusion.
        
         | manmal wrote:
         | Systems generally become more efficient when under stress. They
         | are also forced into local optima - everything has upsides and
         | downsides.
        
       | ec109685 wrote:
       | Could someone please explain the intuition around this technique
       | in more lament terms?
        
         | zozbot234 wrote:
         | > Could someone please explain the intuition around this
         | technique in more lament terms?
         | 
         | Why should we? Look, if you think this is a really sad paper
         | why not just say so outright. Maybe you're right about that,
         | the authors are complete lightweights and you're the very smart
         | and stable genius who tells it like it is. But you haven't
         | provided any evidence of that whatsoever. What a Sad comment!
        
       ___________________________________________________________________
       (page generated 2024-05-08 23:00 UTC)