[HN Gopher] Consistency LLM: converting LLMs to parallel decoder...
       ___________________________________________________________________
        
       Consistency LLM: converting LLMs to parallel decoders accelerates
       inference 3.5x
        
       Author : zhisbug
       Score  : 437 points
       Date   : 2024-05-08 19:55 UTC (1 days ago)
        
 (HTM) web link (hao-ai-lab.github.io)
 (TXT) w3m dump (hao-ai-lab.github.io)
        
       | toxik wrote:
       | Interesting stuff. I guess the idea has occurred to many but was
       | well written and presented.
        
         | programjames wrote:
         | Yep. My roommate and I were talking about this a year ago. You
         | can also do something similar for LLM steering.
        
       | andy12_ wrote:
       | At first I thoght that this was another Medusa-like paper, simply
       | using more unembed heads for guessing subsequent tokes, but damn,
       | not at all. This is amazing. And it doesn't even use extra
       | parameters, it's just an auxiliary training loss.
        
         | snyhlxde wrote:
         | The only similarity between Medusa and CLLM is both train and
         | adapt LLMs for fast inference. But they use completely
         | different training technique, decoding technique and as you
         | pointed out CLLMs don't need extra parameters or configuring
         | attention mask for tree-based verification.
        
       | fermuch wrote:
       | Would something like this apply to MAMBA/JAMBA too?
        
         | wrsh07 wrote:
         | I think any next token predictor will benefit. Iiuc mamba is a
         | next token predictor.
         | 
         | I just skimmed the gradient article, but if their only change
         | is swapping out the transformer block for the mamba block, I
         | don't think it's already using this optimization
        
       | alfalfasprout wrote:
       | Wow, I'm mindblown this isn't getting more attention. This seems
       | like a clear win for inference. Fine tuning cost for this is
       | reasonable (around 0.01% of the original pre-training cost). And
       | the performance wins seem fairly consistent.
        
         | lopuhin wrote:
         | Similar or greater inference wins are achieved with speculative
         | decoding which is already widely used, so while this is really
         | interesting (and was tried before with less success AFAIK),
         | it's not yet clear how impactful it would be.
        
           | WhitneyLand wrote:
           | I don't see where similar wins have ever been achieved.
           | 
           | Speculative decoding can reduce latency, but at the cost of
           | using a lot more compute. The amazing thing here is latency
           | _and_ global throughput improvements would be realized
           | because of the increase in efficiency.
           | 
           | From what I understand speculative decoding can also come
           | with more challenges insofar as trying to maintain overall
           | output quality.
        
         | snyhlxde wrote:
         | Thanks for interesting in our work! Yes we found training with
         | consistency loss + AR loss on even a subset of a dataset
         | results in significant speedup (0.01% pre-training cost).
         | Training on more data permits even further speedup: the model
         | is able to learn from more frequently-appearing collocations
         | and phrases.
         | 
         | For more details, please check out our paper and you can also
         | see speedup saturates as the size of training data grows.
        
         | WhitneyLand wrote:
         | Yes, seems like a huge important result for LLM performance.
         | 
         | I'm not aware of any other paper that has offered to increase
         | inference LLM performance to this degree. Has there ever been
         | one before?
         | 
         | At least while also:
         | 
         | - Maintaining output quality. The benchmarks used were somewhat
         | narrow but so far so good.
         | 
         | - Improving not just query latency but also global throughput
         | 
         | - Not requiring more compute
         | 
         | - Having a relatively practical implementation and not adding
         | big challenges and complexity
         | 
         | You could argue the insight is incremental, as it builds on
         | what's been done with parallel/jacobi decoding. Those previous
         | results were necessary and important, but this may be the one
         | that finally extracts real world value from the promise of
         | parallel decoding.
        
       | paulclark wrote:
       | Is this how Groq (https://groq.com/) is so fast, or are they
       | doing something different?
        
         | buildbot wrote:
         | Groq is serving an LLM from (100s of chips worth of) SRAM, so
         | the effective bandwidth thus token generation speed is an order
         | of magnitude higher than HBM. This would 3.5x their speed as
         | well, it is orthogonal.
        
           | gdiamos wrote:
           | I'm surprised no one has done this for a GPU cluster yet - we
           | used to do this for RNNs on GPUs & FPGAs at Baidu:
           | 
           | https://proceedings.mlr.press/v48/diamos16.pdf
           | 
           | Or better yet - on Cerebras
           | 
           | Kudos to groq for writing that kernel
        
         | wrsh07 wrote:
         | My understanding is that theirs is a pure hardware solution.
         | The hardware is flexible enough to model any current NN
         | architecture.
         | 
         | (Incidentally, there are black box optimization algorithms, so
         | a system as good as grok at inference might be useful for
         | training even if it can't support gradient descent)
        
         | throwawaymaths wrote:
         | According to someone I talked to at groq event I was invited to
         | (I did not sign an nda), They are putting ~8 racks of hardware
         | per llm. Of course coordinating those racks to have exact
         | timings between them to pull tokens through is definitely "part
         | of the hard part".
        
       | miven wrote:
       | The authors mention that Jacobi decoding is equivalent to greedy
       | autoregressive decoding, but in practice don't we often want the
       | sampling temperature to be above zero to avoid repetitions and
       | excessively generic responses?
       | 
       | I'm completely unfamiliar with this decoding strategy so maybe
       | I'm just missing a simple way to account for that.
        
         | matheist wrote:
         | Agreed. It's straightforward to check that a token was the
         | argmax, but it seems difficult to check that a token appeared
         | with the probability you wanted it to. You could still do the
         | fine-tuning step I guess, where you train the trajectories to
         | approach n-token completions with the statistics you want, but
         | I can't see how you can replace the "check for a fixed point"
         | step. Maybe "check the result was above this fixed threshold
         | for likelihood".
        
         | snyhlxde wrote:
         | Yes this is a great question! We are actively working on
         | supporting other sampling strategies other than greedy
         | sampling. In the context of CLLM training, instead of mapping
         | to a static fixed point obtained from Jacobi decoding as the
         | training ojbective, we term it dynamic fixed point. You can
         | keep an eye on our github repo for new progress.
        
       | doctor_eval wrote:
       | > Our research shows this process - mimicking human cognitive
       | process of forming complete sentences in mind before articulating
       | word by word
       | 
       | This is not how I work. Is there something wrong with me?
        
         | jerbear4328 wrote:
         | Nor is it how I work, I think that's normal enough. I do have
         | an idea of what I'm going to say before I say it, I think
         | that's closer to what they meant. I think and speak in
         | increments of ideas, not words.
        
           | paulmd wrote:
           | > I think and speak in increments of ideas
           | 
           | extremely common among (but not unique to) people with ASD,
           | those "increments of ideas" are called "gestalts".
           | 
           | https://kidtherapy.org/helpful-articles/what-is-gestalt-
           | lang...
        
         | Filligree wrote:
         | You might not have an internal monologue. A lot of us don't,
         | and the ones that do are equally shocked every time they find
         | out. For what it's worth, I'm in the same boat-- _can_ form
         | sentences, but why would I? It 'd slow me down.
         | 
         | People who don't have inner monologues tend to assume that all
         | that stuff is some form of analogy or metaphor. It's not. It's
         | entirely literal.
        
           | oceanplexian wrote:
           | Do you mean in a real time conversation?
           | 
           | Because I definitely dont "have an internal monologue about
           | what I'm going to say" in the 100ms between when someone asks
           | a casual question and I respond to it.
        
             | int_19h wrote:
             | Yes, it is possible to maintain an internal monologue in
             | real time conversation. That is one of the reasons why some
             | people usually take longer than 100ms to respond.
        
         | DrSiemer wrote:
         | They probably do not mean people form entire sentences before
         | expressing them, I am not aware of anybody doing that. I assume
         | it refers to people first coming up with a global outline of
         | what they want to say before they start speaking.
        
         | mdp2021 wrote:
         | "Rem tene, verba sequentur" (you hold the matter, then words
         | come) is largely "how it works".
         | 
         | You form logical ideas as you speak, as you speak your speech
         | develops, so the translation is from ideas to sentences. It is
         | not clear in which phase one would mentally form a complete
         | sentence, nor why it should be relevant. You "see something
         | [that makes sense]", then you describe it - iteratively.
        
         | giardini wrote:
         | Probably.
        
         | snyhlxde wrote:
         | In some conversations, maybe it's easier to form complete
         | sentences. In some others, the best we can do is: have a rough
         | draft about what to say in mind and then refine it word by word
         | while speaking.
        
         | throwawaymaths wrote:
         | Are you sure. It might not be the whole sentence, but I would
         | find it hard to believe that in practice the way you speak or
         | write is like
         | 
         | hello <think> May <think> be <think> I'll <think> go <think>
         | get <think> break <think> fast
        
         | causal wrote:
         | You are probably pretty far from the LLM extreme, though, of
         | thinking one token at a time.
        
       | rcarmo wrote:
       | Can't wait to see something like this merged into ollama (I'm
       | sure there would be plenty of people fine-tuning models for it).
        
         | Me1000 wrote:
         | Ollama doesn't have their own inference engine, they just wrap
         | llama.cpp. But yes, it will be awesome when it's more generally
         | available.
        
         | helloericsf wrote:
         | The lab is tied to the vLLM project. I would say it might get
         | picked up sooner by vLLM than other inference frameworks.
        
       | dvt wrote:
       | There's no free lunch(tm), so from what I can tell there's some
       | pathway loss here. E.g. some Jacobi trajectories definitionally
       | exclude higher temperature paths. Which might actually be a
       | positive given data retrieval (but a negative if we want to
       | maximize for creativity?).
        
         | wrsh07 wrote:
         | There are better and worse algorithms. I'm not sure "there is
         | no free lunch" always applies in a particularly meaningful way.
         | Some things aren't on the pareto frontier.
        
           | factormeta wrote:
           | Kinda like the aiff -> mp3 conversion process. A lot of data
           | is lost, but we human can really tell the too much of a
           | difference?
        
             | wrsh07 wrote:
             | There's no reason to think the current next token
             | prediction models are optimal for predicting sentences
             | (they aren't!)
             | 
             | > An algorithm may outperform another on a problem when
             | neither is specialized to the problem
             | 
             | https://en.m.wikipedia.org/wiki/No_free_lunch_in_search_and
             | _...
        
               | stkdump wrote:
               | I would go even further and say there isn't any
               | indication that we are even close to what is possible. My
               | subjective feeling is that with the current rate of
               | progress it is entirely possible that we will have GPT-4
               | level performance locally on smartphone hardware within
               | 3-10 years (unless companies decide again that they don't
               | want to give this kind of power away)
        
               | naasking wrote:
               | Probably. Advancements in ML algorithms, like this one,
               | have been outpacing advancements in hardware for awhile
               | now, so both are converging on making ML faster and
               | ubiquitous.
        
       | nico wrote:
       | Interesting
       | 
       | I think soon we are going to realize that we don't really need
       | training the models
       | 
       | We just need good indexing and sampling
       | 
       | Essentially at some level any LLM is equivalent to a DB of the
       | dataset, with a great NLP interface on top
       | 
       | Both are just different methods of navigating stored data
        
         | nsagent wrote:
         | You might like, the Infinigram paper then. It was discussed
         | recently:
         | 
         | https://news.ycombinator.com/item?id=40266791
        
         | sdrg822 wrote:
         | But indexing *is* training. It's just not using end-to-end
         | gradient descent.
        
         | tempusalaria wrote:
         | LLMs can easily produce data not in training dataset.
         | 
         | LLMs do not navigate stored data. An LLM is not a DB of the
         | training data.
        
           | carlthome wrote:
           | I've had the same thought as above but unfounded (just a
           | feeling, pretty much) so I'm curious to learn more. Do you
           | have any references I can check out that supports these
           | claims?
        
             | int_19h wrote:
             | Come up with a novel puzzle that is guaranteed to not be in
             | the training set, and ask GPT-4 to solve it.
        
         | PeterisP wrote:
         | The models are multiple orders of magnitude smaller than the
         | compressed versions of their training data, they can not be the
         | equivalent of a DB of it.
        
           | lainga wrote:
           | The training data is ideo-semantically compressed? News to
           | me... is it perhaps stored in kanji?
        
       | DoctorOetker wrote:
       | This mirrors what I experienced when I enrolled in "free drawing"
       | (no teaching) classes:
       | 
       | While people considered me a good drawer since I was a child, I
       | remember just repeating either similar detailed drawings I drew
       | before, or otherwise just taking plenty of time to draw. I
       | believe anyone with time and patience can make a nice drawing of
       | a scene.
       | 
       | The "free drawing" class had no rules or lectures: you brought
       | the materials you wanted to work with (some brought ink, others
       | pencils, while I brought charcoal). The only thing determined was
       | the timing between poses for the model: for each session the
       | first few poses were very short (say a minute), and then the pose
       | durations would progressively lengthen until say 5 minute poses.
       | At all times you were free to tear your picture up and retry
       | drawing the pose again.
       | 
       | My drawing skills improved considerably. The short "warmups"
       | actually force you to get proportions and outlines correct on the
       | first tries. Conventional wisdom says haste makes waste, but when
       | learning or refining skills, it seems natural selection has
       | hardcoded the sensation of haste as a stressor prompting
       | attention and learning.
       | 
       | I am convinced I could have drawn similar quality drawings before
       | enrolling in those classes, except they would have taken me
       | easily 5 or 10 x as long to draw. Being forced not to beat around
       | the bush and feeling the penalty of making a hasty mistake
       | (further decreasing time left for the second try in the remaining
       | time) does seem to work.
       | 
       | My only gripe is that the technique is termed "Consistency"
       | whereas I would reserve such a term for an improvement in
       | _performance_ not inference speed, although I understand that
       | they indicate  "consistency with what would ultimately have been
       | generated one token at a time". I would rather dub it
       | "Proficiency LLM", where the same output is expected, only
       | without the inhibition of stuttering to the same conclusion.
        
         | manmal wrote:
         | Systems generally become more efficient when under stress. They
         | are also forced into local optima - everything has upsides and
         | downsides.
        
           | sheepscreek wrote:
           | Interestingly - this is the idea behind Nassim Taleb's book
           | "Antifragile" and the concept of "anti-fragility".
           | 
           | In essence, it promotes dynamic/evolutionary/always learning
           | behaviour than performing the same set of steps every time,
           | and in the process, becoming stronger than before.
           | 
           | An example he shares is: how the breakdown of muscle tissue
           | through exercise leads to more muscle development and an
           | increase in strength. I guess it's similar to LLM training
           | using error/loss reducing functions (practice makes perfect)
           | but dissimilar in the sense that training is a one--time
           | action.
        
           | TeMPOraL wrote:
           | > _They are also forced into local optima_
           | 
           | The good ol', "under pressure, you don't rise to the
           | occasion, but sink to the level of your training"?
        
         | snyhlxde wrote:
         | Hi we are CLLM authors and thanks for sharing your experience
         | and insights! I can see this drawing skill refining process
         | echoes with the training process in CLLM, the only thing is at
         | this point stressor in CLLM training is not getting
         | progressively demanding.
         | 
         | For example, while drawing, you can set very specific time
         | limit on how long you are allowed to draw in each trial and
         | make the time progressively shorter. In CLLM, maybe we can make
         | this the learning process more and more difficult by mapping
         | more and more distant states in Jacobi trajectory to its final
         | state.
         | 
         | We are using the term "consistency" because we draw parallelism
         | between consistency LLM and the consistency model in diffusion
         | image generation where the training processes are analogous.
        
           | Quarrel wrote:
           | Is it just me, or does this read like it was written by an
           | LLM ... ?!
        
             | snyhlxde wrote:
             | lol I take that as a compliment. Good try but sadly no LLM
             | in this writing :)
        
             | jasonjmcghee wrote:
             | It's just much more formal than people generally speak on
             | HN.
        
           | boroboro4 wrote:
           | Do you use same dataset to train / eval the model? Was the
           | model used for example trained on GSM8K dataset for example?
        
             | snyhlxde wrote:
             | Yes, we consider both domain-specific applications (spider
             | for text2SQL, gsm8k for math, codesearchnet for python) as
             | well as open-domain conversational applications (ShareGPT).
             | We use test set from each application to evaluate CLLMs'
             | performance in our paper.
             | 
             | On the other hand, technically CLLM works on any kind of
             | queries. But the speedup might vary. Feel free to try out
             | our codebase for your use cases!
        
         | aamargulies wrote:
         | I had an interesting experience in an Invertebrate Zoology lab
         | class one summer.
         | 
         | We students were brought into a lab, given specimens to draw,
         | and the only instructions we received were 'You have 30 minutes
         | to draw this. Go.'
         | 
         | There was no "here's how to draw. here's what to do and not to
         | do". It was just basically "We don't care about any
         | insecurities you might have. We don't care if you think you
         | can't draw. No excuses, just fucking draw it. Now."
         | 
         | Not only did we draw, but we (all of us) improved enormously
         | over the course of the class as more animals were brought in
         | and the exercise was repeated over and over and over again
         | throughout the summer.
         | 
         | What it taught us is that everyone, and I mean _everyone_ , can
         | draw. Our collective attitude shifted from "don't know if this
         | is even possible" to "of course we can do this. this is easy.
         | routine. trivial."
         | 
         | Highly recommended approach.
         | 
         | It was the most freeing and amazing class I had in college.
        
           | Version467 wrote:
           | That sounds like a pretty awesome experience. Thanks for
           | sharing.
        
       | ec109685 wrote:
       | Could someone please explain the intuition around this technique
       | in more lament terms?
        
         | TomatoCo wrote:
         | For all of these "how can we batch predicting the next n
         | tokens?" the intuition is basically that it takes a buttload of
         | math to predict _some_ of the tokens, but that most tokens are
         | actually easy to guess. For example, if I asked  "What was that
         | phone number from that 80's song?" as soon as a model generates
         | 867- it shouldn't take that much math at all to finish
         | predicting 5309.
        
           | snyhlxde wrote:
           | A bit more intuition on how training works: in natural
           | language processing, some phrases/collocations, for example
           | "remind ... of ...", "make a decision", "learn a skill" etc.
           | are used together. We can ask LLMs to learn such collections
           | & frequently appearing n-grams. After learning, the model can
           | use parallel decoding to predict many tokens that are
           | frequently appear together in one forward pass.
        
         | programjames wrote:
         | "Try to fix all the words in a sentence at once. Keep iterating
         | until you don't think it needs fixing."
        
       | m3kw9 wrote:
       | They can quickly try with one of the open source models, then
       | show a side by side demo
        
       | JKCalhoun wrote:
       | Anyone know somewhere someone dumb like me can "Ask an AI
       | expert"?
       | 
       | I want to ask, for example, how is it that an LLM when given the
       | same prompt does not respond in the same deterministic way?
       | 
       | I guess I want to learn this stuff and should maybe follow one of
       | those "write an LLM in an hour" type videos on YouTube.
        
         | 8note wrote:
         | For that answer, you can refer to the 3blue1brown videos
         | 
         | The llm model outputs a vector of probabilities for tokens, and
         | the llm user picks a token from the most likely list using a
         | random number
        
         | zozbot234 wrote:
         | > I want to ask, for example, how is it that an LLM when given
         | the same prompt does not respond in the same deterministic way?
         | 
         | You can control that in most systems with an inference-set
         | parameter called "temperature". But setting the temperature as
         | low as possible tends to lead to very low-quality answers - the
         | system can't crawl out of some local optimum and ends up
         | repeating itself over and over. Such answers may be
         | "deterministic" but they're also not good.
        
         | rahimnathwani wrote:
         | For this particular question, ask chatgpt how temperature
         | affects llm softmax sampling.
         | 
         | For other things, study using Karpathy's videos.
        
         | zipfcharge wrote:
         | It's because an LLM is essentially a probability matrix. You
         | type a prompt, then it calculates what's the probability of
         | getting a next word and so on, eventually forming a sentence.
         | The probability learned is based on the training data.
         | 
         | Because of the underlying probability model, it's not going to
         | be 100% deterministic. Plus a model like ChatGPT purposefully
         | have "temperature" parameter that will further add
         | randomisation to the whole process.
         | 
         | My answer is based on this paper if you're interested to read
         | more: The Matrix: A Bayesian learning model for LLMs,
         | https://arxiv.org/abs/2402.03175
        
           | flopriore wrote:
           | Are there any ways to show the source of the information
           | retrieved by the model? For instance, the LLM forms a
           | sentence and it points to a stackoverflow answer with the
           | same or similar content.
        
             | JKCalhoun wrote:
             | As I understand it, pretty sure that is impossible. When it
             | is input a single datum, sure, trivial. As soon as it is
             | fed a second one though the weights are already a kind of
             | blend of the two tokens (so to speak).
        
               | spmurrayzzz wrote:
               | Its not impossible, but its definitely difficult. There
               | is some overlap in the methods used to detect benchmark
               | data contamination, though its not entirely the same
               | thing. For the detection use case, you already know the
               | text you're looking for and you are just trying to
               | demonstrate that the model has "seen" the data in its
               | training set. The challenge is proving that it is
               | statistically improbable that the model could
               | stochastically generate the same tokens without having
               | seen them during training.
               | 
               | Some great research exists in this area [1] and I expect
               | much of it may be repurposed for black box attribution in
               | the future (in addition to all the work being done in the
               | mechanistic interpretability field)
               | 
               | [1] https://arxiv.org/abs/2311.04850
        
         | throwawaymaths wrote:
         | > how is it that an LLM when given the same prompt does not
         | respond in the same deterministic way?
         | 
         | In software (not in the model) here's literally a random number
         | generator that picks from a weighted set of "next-token"
         | choices that the model spits out. The selection process can
         | have a series of knobs to manipulate the responses. If you want
         | it to be deterministic (if you have direct access to the
         | software) you can tell it to set "top-k = 1" or "temperature =
         | 0.0" (depending on your software) and it will be deterministic.
         | 
         | Usually the default settings are not for determinism, because
         | for whatever reason the quality of the results tends to not be
         | that good when you go fully d.
        
         | int_19h wrote:
         | I found this to be a good start that explains things fairly
         | methodically, but without losing the high-level perspective.
         | 
         | https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...
        
       | snyhlxde wrote:
       | from CLLM authors:
       | 
       | Thank you guys for the great questions and insights! We have made
       | a Twitter posts with some more details and we invite you to
       | engage with us on Twitter as well.
       | 
       | https://twitter.com/haoailab/status/1788269848788869299
        
       | renonce wrote:
       | > ... speculative decoding methods ... incurs extra memory cost
       | during inference time.
       | 
       | Any detail on this? For speculative decoding you need a smaller
       | model to generate "branches" which are fast but maybe inaccurate
       | and verify these branches later with a larger model. However,
       | only memory equivalent to a single token is needed for
       | speculative decoding, and tokens in other branches are simply
       | masked out during inference. With a context size of 1000 and ~30
       | branches for 5 tokens, the memory overhead would be 3% which is
       | negligible. If your context size is much smaller compared to the
       | number of branches - would someone who use a generative LLM with
       | a context window of just 50 tokens care about generation speed?
       | 
       | Also, speculative decoding techniques are not restricted to
       | greedy sampling - it's expected to behave exactly the same as the
       | original model and sample with the expected probabilities. Most
       | literature on speculative decoding already reports 2.6x-3.5x
       | speedup. The blog post here reports 2.4x-3.4x generation speed -
       | which isn't that much of an upgrade?
       | 
       | While I mentioned speculative decoding above and Medusa2 and
       | Eagle seems to be the techniques that the author compares
       | against, the core problem remains: whatever method you use to
       | predict tokens ahead of time, there is a specific point where the
       | previous tokens are absolutely needed before predicting the next
       | token. It doesn't depend on what your model is or what your
       | techniques are, it's just about what is mathematically
       | achievable. How can you predict 5 tokens at once if the
       | probability distribution of the 5th next token depends heavily on
       | the previous 4 tokens? Speculative decoding, Jacobi decoding,
       | multi-token parallel decoding, whatever.
       | 
       | If only greedy sampling is supported for this, then I wonder what
       | are the advantages of this method, not to mention that other
       | techniques already achieve the expected speedup. Comparing greedy
       | sampling speedups to random sampling speedups is comparing apples
       | to oranges, and I doubt if the speedup described by the method
       | would remain after this method is adapted to random sampling (due
       | to the core problem mentioned above).
        
         | Palmik wrote:
         | Speculative decoding requires you to load the smaller model
         | into memory and run inference on it.
        
           | renonce wrote:
           | I think the smaller model is at least 20 times smaller. If
           | you do speculative decoding on a 70B model an 1B model would
           | be appropriate.
        
         | cxczz wrote:
         | `the previous tokens are absolutely needed before predicting
         | the next token'
         | 
         | Maybe this is the key contribution of this paper: demonstrating
         | that LLMs can predict the next n-tokens even if there are
         | incorrect guesses in previous tokens through consistency
         | training?
         | 
         | On the other hand, while mathematically it is true that
         | p(x_t|x_1,...,x_t-1) depends on all x_1 to x_t-1, in practice,
         | it is possible that predicting x_t only requires x_1 to x_t-2,
         | and the attention to x_t-1 is minimal. Thus, predicting x_t
         | with x_1 to x_t-2 and inaccurate x_t-1 is possible.
        
       | wangii wrote:
       | I feel it's a pretty dangerous optimization before we REALLY
       | understand what's going on inside of the LLM. e.g. guys believe
       | in the geometric interpretation will have something to say, and
       | it would probably hurt if you are using "filler" tokens.
       | 
       | Besides, the assumption (not a universal fact) that "forming
       | complete sentences in mind before articulating word by word"
       | seems overly simplifies activities happens in our mind: do we
       | really have a complete planning before start talking/typing? as a
       | Buddhist I lean towards it's an illusion. further more, what
       | about simultaneous thoughts? are we linear thinker in the
       | sentence level?
       | 
       | anyway, pretty neat math!
        
         | Etheryte wrote:
         | That assumption might be useful in this context, but I think
         | it's pretty clearly not true. Ask anyone to tell you about a
         | complex past event with a lot of parallel branches and you'll
         | quickly see them add bits, pieces and tangents midsentence to
         | cover the full range of events. I don't think I've seen the
         | sentence granularity hypothesis in any serious scientific
         | context before.
        
         | renonce wrote:
         | The optimization does not affect the result of LLM, it's
         | guaranteed to produce equivalent results as decoding directly.
         | Let's not treat that LLM as some magic that resembles our mind,
         | it's just another program that produces sentences that happens
         | to make sense.
        
           | sigmoid10 wrote:
           | Lets not treat our mind as something magical. It's just
           | another program that learned to speak by consuming lots of
           | training input. The implementation might look slightly
           | different from the outside, but from a mathematical
           | perspective, artificial neural networks are proven to be at
           | least as capable as the human mind.
        
             | baq wrote:
             | The best part is, your comment works both when sarcastic
             | and completely serious.
        
           | wangii wrote:
           | According to the original Jacobi decoding paper, it's set in
           | the machine translation tasks, with encoder + decoder, in
           | which parallel algo applied only to the decoder part.
        
           | naasking wrote:
           | > Let's not treat that LLM as some magic that resembles our
           | mind,it's just another program that produces sentences that
           | happens to make sense.
           | 
           | "That happen to make sense" is hiding a lot of magic. It
           | would be statistically impossible to make as much sense as
           | LLMs do in response to prompts if it did not actually make
           | semantic distinctions. If it makes semantic distinctions,
           | then it does resemble the human mind in at least one way.
        
         | causal wrote:
         | What is the geometric interpretation?
        
         | hatthew wrote:
         | Can't speak for everyone but I definitely don't mentally form
         | complete sentences before talking. Sometimes I grammatically
         | talk myself into a corner in the middle of a sentence and need
         | to use some awkward words/phrases to finish my thought, or
         | simply pause and restart the phrase from the beginning.
        
         | int_19h wrote:
         | We don't appear to be forming words sequentially from
         | underlying parts, even though in many languages they are broken
         | down in smaller units that carry semantic meaning themselves.
         | There doesn't seem to be any clear reason for this to break
         | down suddenly at sentence level.
        
       | programjames wrote:
       | > Surprisingly, we find such an objective is analogous to that of
       | consistency models
       | 
       | This is why numerical methods should be part of the ML
       | curriculum.
        
       ___________________________________________________________________
       (page generated 2024-05-09 23:01 UTC)