[HN Gopher] Consistency LLM: converting LLMs to parallel decoder...
___________________________________________________________________
Consistency LLM: converting LLMs to parallel decoders accelerates
inference 3.5x
Author : zhisbug
Score : 437 points
Date : 2024-05-08 19:55 UTC (1 days ago)
(HTM) web link (hao-ai-lab.github.io)
(TXT) w3m dump (hao-ai-lab.github.io)
| toxik wrote:
| Interesting stuff. I guess the idea has occurred to many but was
| well written and presented.
| programjames wrote:
| Yep. My roommate and I were talking about this a year ago. You
| can also do something similar for LLM steering.
| andy12_ wrote:
| At first I thoght that this was another Medusa-like paper, simply
| using more unembed heads for guessing subsequent tokes, but damn,
| not at all. This is amazing. And it doesn't even use extra
| parameters, it's just an auxiliary training loss.
| snyhlxde wrote:
| The only similarity between Medusa and CLLM is both train and
| adapt LLMs for fast inference. But they use completely
| different training technique, decoding technique and as you
| pointed out CLLMs don't need extra parameters or configuring
| attention mask for tree-based verification.
| fermuch wrote:
| Would something like this apply to MAMBA/JAMBA too?
| wrsh07 wrote:
| I think any next token predictor will benefit. Iiuc mamba is a
| next token predictor.
|
| I just skimmed the gradient article, but if their only change
| is swapping out the transformer block for the mamba block, I
| don't think it's already using this optimization
| alfalfasprout wrote:
| Wow, I'm mindblown this isn't getting more attention. This seems
| like a clear win for inference. Fine tuning cost for this is
| reasonable (around 0.01% of the original pre-training cost). And
| the performance wins seem fairly consistent.
| lopuhin wrote:
| Similar or greater inference wins are achieved with speculative
| decoding which is already widely used, so while this is really
| interesting (and was tried before with less success AFAIK),
| it's not yet clear how impactful it would be.
| WhitneyLand wrote:
| I don't see where similar wins have ever been achieved.
|
| Speculative decoding can reduce latency, but at the cost of
| using a lot more compute. The amazing thing here is latency
| _and_ global throughput improvements would be realized
| because of the increase in efficiency.
|
| From what I understand speculative decoding can also come
| with more challenges insofar as trying to maintain overall
| output quality.
| snyhlxde wrote:
| Thanks for interesting in our work! Yes we found training with
| consistency loss + AR loss on even a subset of a dataset
| results in significant speedup (0.01% pre-training cost).
| Training on more data permits even further speedup: the model
| is able to learn from more frequently-appearing collocations
| and phrases.
|
| For more details, please check out our paper and you can also
| see speedup saturates as the size of training data grows.
| WhitneyLand wrote:
| Yes, seems like a huge important result for LLM performance.
|
| I'm not aware of any other paper that has offered to increase
| inference LLM performance to this degree. Has there ever been
| one before?
|
| At least while also:
|
| - Maintaining output quality. The benchmarks used were somewhat
| narrow but so far so good.
|
| - Improving not just query latency but also global throughput
|
| - Not requiring more compute
|
| - Having a relatively practical implementation and not adding
| big challenges and complexity
|
| You could argue the insight is incremental, as it builds on
| what's been done with parallel/jacobi decoding. Those previous
| results were necessary and important, but this may be the one
| that finally extracts real world value from the promise of
| parallel decoding.
| paulclark wrote:
| Is this how Groq (https://groq.com/) is so fast, or are they
| doing something different?
| buildbot wrote:
| Groq is serving an LLM from (100s of chips worth of) SRAM, so
| the effective bandwidth thus token generation speed is an order
| of magnitude higher than HBM. This would 3.5x their speed as
| well, it is orthogonal.
| gdiamos wrote:
| I'm surprised no one has done this for a GPU cluster yet - we
| used to do this for RNNs on GPUs & FPGAs at Baidu:
|
| https://proceedings.mlr.press/v48/diamos16.pdf
|
| Or better yet - on Cerebras
|
| Kudos to groq for writing that kernel
| wrsh07 wrote:
| My understanding is that theirs is a pure hardware solution.
| The hardware is flexible enough to model any current NN
| architecture.
|
| (Incidentally, there are black box optimization algorithms, so
| a system as good as grok at inference might be useful for
| training even if it can't support gradient descent)
| throwawaymaths wrote:
| According to someone I talked to at groq event I was invited to
| (I did not sign an nda), They are putting ~8 racks of hardware
| per llm. Of course coordinating those racks to have exact
| timings between them to pull tokens through is definitely "part
| of the hard part".
| miven wrote:
| The authors mention that Jacobi decoding is equivalent to greedy
| autoregressive decoding, but in practice don't we often want the
| sampling temperature to be above zero to avoid repetitions and
| excessively generic responses?
|
| I'm completely unfamiliar with this decoding strategy so maybe
| I'm just missing a simple way to account for that.
| matheist wrote:
| Agreed. It's straightforward to check that a token was the
| argmax, but it seems difficult to check that a token appeared
| with the probability you wanted it to. You could still do the
| fine-tuning step I guess, where you train the trajectories to
| approach n-token completions with the statistics you want, but
| I can't see how you can replace the "check for a fixed point"
| step. Maybe "check the result was above this fixed threshold
| for likelihood".
| snyhlxde wrote:
| Yes this is a great question! We are actively working on
| supporting other sampling strategies other than greedy
| sampling. In the context of CLLM training, instead of mapping
| to a static fixed point obtained from Jacobi decoding as the
| training ojbective, we term it dynamic fixed point. You can
| keep an eye on our github repo for new progress.
| doctor_eval wrote:
| > Our research shows this process - mimicking human cognitive
| process of forming complete sentences in mind before articulating
| word by word
|
| This is not how I work. Is there something wrong with me?
| jerbear4328 wrote:
| Nor is it how I work, I think that's normal enough. I do have
| an idea of what I'm going to say before I say it, I think
| that's closer to what they meant. I think and speak in
| increments of ideas, not words.
| paulmd wrote:
| > I think and speak in increments of ideas
|
| extremely common among (but not unique to) people with ASD,
| those "increments of ideas" are called "gestalts".
|
| https://kidtherapy.org/helpful-articles/what-is-gestalt-
| lang...
| Filligree wrote:
| You might not have an internal monologue. A lot of us don't,
| and the ones that do are equally shocked every time they find
| out. For what it's worth, I'm in the same boat-- _can_ form
| sentences, but why would I? It 'd slow me down.
|
| People who don't have inner monologues tend to assume that all
| that stuff is some form of analogy or metaphor. It's not. It's
| entirely literal.
| oceanplexian wrote:
| Do you mean in a real time conversation?
|
| Because I definitely dont "have an internal monologue about
| what I'm going to say" in the 100ms between when someone asks
| a casual question and I respond to it.
| int_19h wrote:
| Yes, it is possible to maintain an internal monologue in
| real time conversation. That is one of the reasons why some
| people usually take longer than 100ms to respond.
| DrSiemer wrote:
| They probably do not mean people form entire sentences before
| expressing them, I am not aware of anybody doing that. I assume
| it refers to people first coming up with a global outline of
| what they want to say before they start speaking.
| mdp2021 wrote:
| "Rem tene, verba sequentur" (you hold the matter, then words
| come) is largely "how it works".
|
| You form logical ideas as you speak, as you speak your speech
| develops, so the translation is from ideas to sentences. It is
| not clear in which phase one would mentally form a complete
| sentence, nor why it should be relevant. You "see something
| [that makes sense]", then you describe it - iteratively.
| giardini wrote:
| Probably.
| snyhlxde wrote:
| In some conversations, maybe it's easier to form complete
| sentences. In some others, the best we can do is: have a rough
| draft about what to say in mind and then refine it word by word
| while speaking.
| throwawaymaths wrote:
| Are you sure. It might not be the whole sentence, but I would
| find it hard to believe that in practice the way you speak or
| write is like
|
| hello <think> May <think> be <think> I'll <think> go <think>
| get <think> break <think> fast
| causal wrote:
| You are probably pretty far from the LLM extreme, though, of
| thinking one token at a time.
| rcarmo wrote:
| Can't wait to see something like this merged into ollama (I'm
| sure there would be plenty of people fine-tuning models for it).
| Me1000 wrote:
| Ollama doesn't have their own inference engine, they just wrap
| llama.cpp. But yes, it will be awesome when it's more generally
| available.
| helloericsf wrote:
| The lab is tied to the vLLM project. I would say it might get
| picked up sooner by vLLM than other inference frameworks.
| dvt wrote:
| There's no free lunch(tm), so from what I can tell there's some
| pathway loss here. E.g. some Jacobi trajectories definitionally
| exclude higher temperature paths. Which might actually be a
| positive given data retrieval (but a negative if we want to
| maximize for creativity?).
| wrsh07 wrote:
| There are better and worse algorithms. I'm not sure "there is
| no free lunch" always applies in a particularly meaningful way.
| Some things aren't on the pareto frontier.
| factormeta wrote:
| Kinda like the aiff -> mp3 conversion process. A lot of data
| is lost, but we human can really tell the too much of a
| difference?
| wrsh07 wrote:
| There's no reason to think the current next token
| prediction models are optimal for predicting sentences
| (they aren't!)
|
| > An algorithm may outperform another on a problem when
| neither is specialized to the problem
|
| https://en.m.wikipedia.org/wiki/No_free_lunch_in_search_and
| _...
| stkdump wrote:
| I would go even further and say there isn't any
| indication that we are even close to what is possible. My
| subjective feeling is that with the current rate of
| progress it is entirely possible that we will have GPT-4
| level performance locally on smartphone hardware within
| 3-10 years (unless companies decide again that they don't
| want to give this kind of power away)
| naasking wrote:
| Probably. Advancements in ML algorithms, like this one,
| have been outpacing advancements in hardware for awhile
| now, so both are converging on making ML faster and
| ubiquitous.
| nico wrote:
| Interesting
|
| I think soon we are going to realize that we don't really need
| training the models
|
| We just need good indexing and sampling
|
| Essentially at some level any LLM is equivalent to a DB of the
| dataset, with a great NLP interface on top
|
| Both are just different methods of navigating stored data
| nsagent wrote:
| You might like, the Infinigram paper then. It was discussed
| recently:
|
| https://news.ycombinator.com/item?id=40266791
| sdrg822 wrote:
| But indexing *is* training. It's just not using end-to-end
| gradient descent.
| tempusalaria wrote:
| LLMs can easily produce data not in training dataset.
|
| LLMs do not navigate stored data. An LLM is not a DB of the
| training data.
| carlthome wrote:
| I've had the same thought as above but unfounded (just a
| feeling, pretty much) so I'm curious to learn more. Do you
| have any references I can check out that supports these
| claims?
| int_19h wrote:
| Come up with a novel puzzle that is guaranteed to not be in
| the training set, and ask GPT-4 to solve it.
| PeterisP wrote:
| The models are multiple orders of magnitude smaller than the
| compressed versions of their training data, they can not be the
| equivalent of a DB of it.
| lainga wrote:
| The training data is ideo-semantically compressed? News to
| me... is it perhaps stored in kanji?
| DoctorOetker wrote:
| This mirrors what I experienced when I enrolled in "free drawing"
| (no teaching) classes:
|
| While people considered me a good drawer since I was a child, I
| remember just repeating either similar detailed drawings I drew
| before, or otherwise just taking plenty of time to draw. I
| believe anyone with time and patience can make a nice drawing of
| a scene.
|
| The "free drawing" class had no rules or lectures: you brought
| the materials you wanted to work with (some brought ink, others
| pencils, while I brought charcoal). The only thing determined was
| the timing between poses for the model: for each session the
| first few poses were very short (say a minute), and then the pose
| durations would progressively lengthen until say 5 minute poses.
| At all times you were free to tear your picture up and retry
| drawing the pose again.
|
| My drawing skills improved considerably. The short "warmups"
| actually force you to get proportions and outlines correct on the
| first tries. Conventional wisdom says haste makes waste, but when
| learning or refining skills, it seems natural selection has
| hardcoded the sensation of haste as a stressor prompting
| attention and learning.
|
| I am convinced I could have drawn similar quality drawings before
| enrolling in those classes, except they would have taken me
| easily 5 or 10 x as long to draw. Being forced not to beat around
| the bush and feeling the penalty of making a hasty mistake
| (further decreasing time left for the second try in the remaining
| time) does seem to work.
|
| My only gripe is that the technique is termed "Consistency"
| whereas I would reserve such a term for an improvement in
| _performance_ not inference speed, although I understand that
| they indicate "consistency with what would ultimately have been
| generated one token at a time". I would rather dub it
| "Proficiency LLM", where the same output is expected, only
| without the inhibition of stuttering to the same conclusion.
| manmal wrote:
| Systems generally become more efficient when under stress. They
| are also forced into local optima - everything has upsides and
| downsides.
| sheepscreek wrote:
| Interestingly - this is the idea behind Nassim Taleb's book
| "Antifragile" and the concept of "anti-fragility".
|
| In essence, it promotes dynamic/evolutionary/always learning
| behaviour than performing the same set of steps every time,
| and in the process, becoming stronger than before.
|
| An example he shares is: how the breakdown of muscle tissue
| through exercise leads to more muscle development and an
| increase in strength. I guess it's similar to LLM training
| using error/loss reducing functions (practice makes perfect)
| but dissimilar in the sense that training is a one--time
| action.
| TeMPOraL wrote:
| > _They are also forced into local optima_
|
| The good ol', "under pressure, you don't rise to the
| occasion, but sink to the level of your training"?
| snyhlxde wrote:
| Hi we are CLLM authors and thanks for sharing your experience
| and insights! I can see this drawing skill refining process
| echoes with the training process in CLLM, the only thing is at
| this point stressor in CLLM training is not getting
| progressively demanding.
|
| For example, while drawing, you can set very specific time
| limit on how long you are allowed to draw in each trial and
| make the time progressively shorter. In CLLM, maybe we can make
| this the learning process more and more difficult by mapping
| more and more distant states in Jacobi trajectory to its final
| state.
|
| We are using the term "consistency" because we draw parallelism
| between consistency LLM and the consistency model in diffusion
| image generation where the training processes are analogous.
| Quarrel wrote:
| Is it just me, or does this read like it was written by an
| LLM ... ?!
| snyhlxde wrote:
| lol I take that as a compliment. Good try but sadly no LLM
| in this writing :)
| jasonjmcghee wrote:
| It's just much more formal than people generally speak on
| HN.
| boroboro4 wrote:
| Do you use same dataset to train / eval the model? Was the
| model used for example trained on GSM8K dataset for example?
| snyhlxde wrote:
| Yes, we consider both domain-specific applications (spider
| for text2SQL, gsm8k for math, codesearchnet for python) as
| well as open-domain conversational applications (ShareGPT).
| We use test set from each application to evaluate CLLMs'
| performance in our paper.
|
| On the other hand, technically CLLM works on any kind of
| queries. But the speedup might vary. Feel free to try out
| our codebase for your use cases!
| aamargulies wrote:
| I had an interesting experience in an Invertebrate Zoology lab
| class one summer.
|
| We students were brought into a lab, given specimens to draw,
| and the only instructions we received were 'You have 30 minutes
| to draw this. Go.'
|
| There was no "here's how to draw. here's what to do and not to
| do". It was just basically "We don't care about any
| insecurities you might have. We don't care if you think you
| can't draw. No excuses, just fucking draw it. Now."
|
| Not only did we draw, but we (all of us) improved enormously
| over the course of the class as more animals were brought in
| and the exercise was repeated over and over and over again
| throughout the summer.
|
| What it taught us is that everyone, and I mean _everyone_ , can
| draw. Our collective attitude shifted from "don't know if this
| is even possible" to "of course we can do this. this is easy.
| routine. trivial."
|
| Highly recommended approach.
|
| It was the most freeing and amazing class I had in college.
| Version467 wrote:
| That sounds like a pretty awesome experience. Thanks for
| sharing.
| ec109685 wrote:
| Could someone please explain the intuition around this technique
| in more lament terms?
| TomatoCo wrote:
| For all of these "how can we batch predicting the next n
| tokens?" the intuition is basically that it takes a buttload of
| math to predict _some_ of the tokens, but that most tokens are
| actually easy to guess. For example, if I asked "What was that
| phone number from that 80's song?" as soon as a model generates
| 867- it shouldn't take that much math at all to finish
| predicting 5309.
| snyhlxde wrote:
| A bit more intuition on how training works: in natural
| language processing, some phrases/collocations, for example
| "remind ... of ...", "make a decision", "learn a skill" etc.
| are used together. We can ask LLMs to learn such collections
| & frequently appearing n-grams. After learning, the model can
| use parallel decoding to predict many tokens that are
| frequently appear together in one forward pass.
| programjames wrote:
| "Try to fix all the words in a sentence at once. Keep iterating
| until you don't think it needs fixing."
| m3kw9 wrote:
| They can quickly try with one of the open source models, then
| show a side by side demo
| JKCalhoun wrote:
| Anyone know somewhere someone dumb like me can "Ask an AI
| expert"?
|
| I want to ask, for example, how is it that an LLM when given the
| same prompt does not respond in the same deterministic way?
|
| I guess I want to learn this stuff and should maybe follow one of
| those "write an LLM in an hour" type videos on YouTube.
| 8note wrote:
| For that answer, you can refer to the 3blue1brown videos
|
| The llm model outputs a vector of probabilities for tokens, and
| the llm user picks a token from the most likely list using a
| random number
| zozbot234 wrote:
| > I want to ask, for example, how is it that an LLM when given
| the same prompt does not respond in the same deterministic way?
|
| You can control that in most systems with an inference-set
| parameter called "temperature". But setting the temperature as
| low as possible tends to lead to very low-quality answers - the
| system can't crawl out of some local optimum and ends up
| repeating itself over and over. Such answers may be
| "deterministic" but they're also not good.
| rahimnathwani wrote:
| For this particular question, ask chatgpt how temperature
| affects llm softmax sampling.
|
| For other things, study using Karpathy's videos.
| zipfcharge wrote:
| It's because an LLM is essentially a probability matrix. You
| type a prompt, then it calculates what's the probability of
| getting a next word and so on, eventually forming a sentence.
| The probability learned is based on the training data.
|
| Because of the underlying probability model, it's not going to
| be 100% deterministic. Plus a model like ChatGPT purposefully
| have "temperature" parameter that will further add
| randomisation to the whole process.
|
| My answer is based on this paper if you're interested to read
| more: The Matrix: A Bayesian learning model for LLMs,
| https://arxiv.org/abs/2402.03175
| flopriore wrote:
| Are there any ways to show the source of the information
| retrieved by the model? For instance, the LLM forms a
| sentence and it points to a stackoverflow answer with the
| same or similar content.
| JKCalhoun wrote:
| As I understand it, pretty sure that is impossible. When it
| is input a single datum, sure, trivial. As soon as it is
| fed a second one though the weights are already a kind of
| blend of the two tokens (so to speak).
| spmurrayzzz wrote:
| Its not impossible, but its definitely difficult. There
| is some overlap in the methods used to detect benchmark
| data contamination, though its not entirely the same
| thing. For the detection use case, you already know the
| text you're looking for and you are just trying to
| demonstrate that the model has "seen" the data in its
| training set. The challenge is proving that it is
| statistically improbable that the model could
| stochastically generate the same tokens without having
| seen them during training.
|
| Some great research exists in this area [1] and I expect
| much of it may be repurposed for black box attribution in
| the future (in addition to all the work being done in the
| mechanistic interpretability field)
|
| [1] https://arxiv.org/abs/2311.04850
| throwawaymaths wrote:
| > how is it that an LLM when given the same prompt does not
| respond in the same deterministic way?
|
| In software (not in the model) here's literally a random number
| generator that picks from a weighted set of "next-token"
| choices that the model spits out. The selection process can
| have a series of knobs to manipulate the responses. If you want
| it to be deterministic (if you have direct access to the
| software) you can tell it to set "top-k = 1" or "temperature =
| 0.0" (depending on your software) and it will be deterministic.
|
| Usually the default settings are not for determinism, because
| for whatever reason the quality of the results tends to not be
| that good when you go fully d.
| int_19h wrote:
| I found this to be a good start that explains things fairly
| methodically, but without losing the high-level perspective.
|
| https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...
| snyhlxde wrote:
| from CLLM authors:
|
| Thank you guys for the great questions and insights! We have made
| a Twitter posts with some more details and we invite you to
| engage with us on Twitter as well.
|
| https://twitter.com/haoailab/status/1788269848788869299
| renonce wrote:
| > ... speculative decoding methods ... incurs extra memory cost
| during inference time.
|
| Any detail on this? For speculative decoding you need a smaller
| model to generate "branches" which are fast but maybe inaccurate
| and verify these branches later with a larger model. However,
| only memory equivalent to a single token is needed for
| speculative decoding, and tokens in other branches are simply
| masked out during inference. With a context size of 1000 and ~30
| branches for 5 tokens, the memory overhead would be 3% which is
| negligible. If your context size is much smaller compared to the
| number of branches - would someone who use a generative LLM with
| a context window of just 50 tokens care about generation speed?
|
| Also, speculative decoding techniques are not restricted to
| greedy sampling - it's expected to behave exactly the same as the
| original model and sample with the expected probabilities. Most
| literature on speculative decoding already reports 2.6x-3.5x
| speedup. The blog post here reports 2.4x-3.4x generation speed -
| which isn't that much of an upgrade?
|
| While I mentioned speculative decoding above and Medusa2 and
| Eagle seems to be the techniques that the author compares
| against, the core problem remains: whatever method you use to
| predict tokens ahead of time, there is a specific point where the
| previous tokens are absolutely needed before predicting the next
| token. It doesn't depend on what your model is or what your
| techniques are, it's just about what is mathematically
| achievable. How can you predict 5 tokens at once if the
| probability distribution of the 5th next token depends heavily on
| the previous 4 tokens? Speculative decoding, Jacobi decoding,
| multi-token parallel decoding, whatever.
|
| If only greedy sampling is supported for this, then I wonder what
| are the advantages of this method, not to mention that other
| techniques already achieve the expected speedup. Comparing greedy
| sampling speedups to random sampling speedups is comparing apples
| to oranges, and I doubt if the speedup described by the method
| would remain after this method is adapted to random sampling (due
| to the core problem mentioned above).
| Palmik wrote:
| Speculative decoding requires you to load the smaller model
| into memory and run inference on it.
| renonce wrote:
| I think the smaller model is at least 20 times smaller. If
| you do speculative decoding on a 70B model an 1B model would
| be appropriate.
| cxczz wrote:
| `the previous tokens are absolutely needed before predicting
| the next token'
|
| Maybe this is the key contribution of this paper: demonstrating
| that LLMs can predict the next n-tokens even if there are
| incorrect guesses in previous tokens through consistency
| training?
|
| On the other hand, while mathematically it is true that
| p(x_t|x_1,...,x_t-1) depends on all x_1 to x_t-1, in practice,
| it is possible that predicting x_t only requires x_1 to x_t-2,
| and the attention to x_t-1 is minimal. Thus, predicting x_t
| with x_1 to x_t-2 and inaccurate x_t-1 is possible.
| wangii wrote:
| I feel it's a pretty dangerous optimization before we REALLY
| understand what's going on inside of the LLM. e.g. guys believe
| in the geometric interpretation will have something to say, and
| it would probably hurt if you are using "filler" tokens.
|
| Besides, the assumption (not a universal fact) that "forming
| complete sentences in mind before articulating word by word"
| seems overly simplifies activities happens in our mind: do we
| really have a complete planning before start talking/typing? as a
| Buddhist I lean towards it's an illusion. further more, what
| about simultaneous thoughts? are we linear thinker in the
| sentence level?
|
| anyway, pretty neat math!
| Etheryte wrote:
| That assumption might be useful in this context, but I think
| it's pretty clearly not true. Ask anyone to tell you about a
| complex past event with a lot of parallel branches and you'll
| quickly see them add bits, pieces and tangents midsentence to
| cover the full range of events. I don't think I've seen the
| sentence granularity hypothesis in any serious scientific
| context before.
| renonce wrote:
| The optimization does not affect the result of LLM, it's
| guaranteed to produce equivalent results as decoding directly.
| Let's not treat that LLM as some magic that resembles our mind,
| it's just another program that produces sentences that happens
| to make sense.
| sigmoid10 wrote:
| Lets not treat our mind as something magical. It's just
| another program that learned to speak by consuming lots of
| training input. The implementation might look slightly
| different from the outside, but from a mathematical
| perspective, artificial neural networks are proven to be at
| least as capable as the human mind.
| baq wrote:
| The best part is, your comment works both when sarcastic
| and completely serious.
| wangii wrote:
| According to the original Jacobi decoding paper, it's set in
| the machine translation tasks, with encoder + decoder, in
| which parallel algo applied only to the decoder part.
| naasking wrote:
| > Let's not treat that LLM as some magic that resembles our
| mind,it's just another program that produces sentences that
| happens to make sense.
|
| "That happen to make sense" is hiding a lot of magic. It
| would be statistically impossible to make as much sense as
| LLMs do in response to prompts if it did not actually make
| semantic distinctions. If it makes semantic distinctions,
| then it does resemble the human mind in at least one way.
| causal wrote:
| What is the geometric interpretation?
| hatthew wrote:
| Can't speak for everyone but I definitely don't mentally form
| complete sentences before talking. Sometimes I grammatically
| talk myself into a corner in the middle of a sentence and need
| to use some awkward words/phrases to finish my thought, or
| simply pause and restart the phrase from the beginning.
| int_19h wrote:
| We don't appear to be forming words sequentially from
| underlying parts, even though in many languages they are broken
| down in smaller units that carry semantic meaning themselves.
| There doesn't seem to be any clear reason for this to break
| down suddenly at sentence level.
| programjames wrote:
| > Surprisingly, we find such an objective is analogous to that of
| consistency models
|
| This is why numerical methods should be part of the ML
| curriculum.
___________________________________________________________________
(page generated 2024-05-09 23:01 UTC)