[HN Gopher] Entropy of a Large Language Model output
___________________________________________________________________
Entropy of a Large Language Model output
Author : woodglyst
Score : 115 points
Date : 2025-01-09 20:00 UTC (4 days ago)
(HTM) web link (nikkin.dev)
(TXT) w3m dump (nikkin.dev)
| WhitneyLand wrote:
| > the output token of the LLM (black box) is not deterministic.
| Rather, it is a probability distribution over all the available
| tokens
|
| How is this not deterministic? Randomness is intentionally added
| via temperature.
| cjtrowbridge wrote:
| Entropy is also added via a random seed. The model is only
| deterministic if you use the same random seed.
| HarHarVeryFunny wrote:
| I think you're confusing training and inference. During
| training there are things like initialization, data shuffling
| and dropout that depend on random numbers. At inference time
| these don't apply.
| jampekka wrote:
| Decoding (sampling) uses (pseudo) random numbers. Otherwise
| same prompt would always give the same response.
|
| Computing entropy generally does not.
|
| See e.g. https://huggingface.co/blog/how-to-generate
| HarHarVeryFunny wrote:
| Sure - but that's not the output of the model itself,
| that's the process of (typically) randomly sampling from
| the output of the model.
| throwaway314155 wrote:
| Right, sampling from a model, also known as *inference*
| (for LLM's).
|
| The inference here is perhaps less pure than what you
| refer to but you're talking to human beings; there's no
| need for heavy pedantry.
| hansvm wrote:
| The output "token"
|
| Yes, you can sample deterministically, but that's some
| combination of computationally intractable and only useful on a
| small subset of problems. The black box outputting a non-
| deterministic token is a close enough approximation for most
| people.
| HarHarVeryFunny wrote:
| The author of the article seems confused, saying:
|
| "The important thing to remember is that the output token of
| the LLM (black box) is not deterministic. Rather, it is a
| probability distribution over all the available tokens in the
| vocabulary."
|
| He is saying that there is non-determinism in the output of
| the LLM (i.e. in these probability distributions), when in
| fact the randomness only comes from choosing to use a random
| number generator to sample from this output.
| fancyfredbot wrote:
| The author is saying that the output _token_ is not
| deterministic. I don 't think they said the distribution
| was stochastic.
|
| Even so the distribution of the second token output by the
| model would be stochastic (unless you condition on the
| first token). So in that sense there may also be a
| stochastic probability distribution.
| apstroll wrote:
| The output distribution is deterministic, the output token is
| sampled from the output distribution, and is therefore not
| deterministic. Temperature modulates the output distribution,
| but sitting it to 0 (i.e. argmax sampling) is not the norm.
| Der_Einzige wrote:
| Running temperature of zero/greedy sampling (what you call
| "argmax sampling") is EXTREMELY common.
|
| LLMs are basically "deterministic" when using greedy sampling
| except for either MoE related shenanigans (what historically
| prevented determinism in ChatGPT) or due to floating point
| related issues (GPU related). In practice, LLMs are in fact
| basically "deterministic" except for the sampling/temperature
| stuff that we add at the very end.
| HarHarVeryFunny wrote:
| > except for either MoE related shenanigans (what
| historically prevented determinism in ChatGPT)
|
| The original ChatCPT was based on GPT-3.5, which did not
| use MoE.
| alew1 wrote:
| "Temperature" doesn't make sense unless your model is
| predicting a distribution. You can't "temperature sample" a
| calculator, for instance. The output of the LLM is a predictive
| distribution over the next token; this is the formulation you
| will see in every paper on LLMs. It's true that you can do
| various things with that distribution _other_ than sampling it:
| you can compute its entropy, you can find its mode (argmax),
| etc., but the type signature of the LLM itself is `prompt - >
| probability distribution over next tokens`.
| wyager wrote:
| The temperature in LLMs is a parameter of a regularization
| step that determines how neuron activation levels get mapped
| to odds ratios.
|
| Zero temperature => fully deterministic
|
| The neuron activation levels do not inherently form or
| represent a probability distribution. That's something we've
| slapped on after the fact
| alew1 wrote:
| Any interpretation (including interpreting the _inputs_ to
| the neural net as a "prompt") is "slapped on" in some
| sense--at some level, it's all just numbers being added,
| multiplied, and so on.
|
| But I wouldn't call the probabilistic interpretation "after
| the fact." The entire training procedure that generated the
| LM weights (the pre-training as well as the RLHF post-
| training) is formulated based on the understanding that the
| LM predicts p(x_t | x_1, ..., x_{t-1}). For example,
| pretraining maximizes the log probability of the training
| data, and RLHF typically maximizes an objective that
| combines "expected reward [under the LLM's output
| probability distribution]" with "KL divergence between the
| pretraining distribution and the RLHF'd distribution" (a
| probabilistic quantity).
| TeMPOraL wrote:
| There's extra randomness added accidentally in practice:
| inference is a massively parallelized set of matrix
| multiplications, and floating point math is not commutative -
| the randomness in execution order gets converted into a random
| FP error, so even setting temperature to 0 doesn't guarantee
| repeatable results.
| HeatrayEnjoyer wrote:
| Only if the inference software doesn't guarantee concurrency,
| which is CS 101
| nikkindev wrote:
| Author here: Yes. You are right. I was meaning to paint a
| picture that instead of the next token appearing magically, it
| is sampled from a probability distribution. The notion of
| determinism could be explained differently. Thanks for pointing
| it out!
| netruk44 wrote:
| I wonder if we could combine 'thinking' models (which write
| thoughts out before replying) with a mechanism they can use to
| check their own entropy as they're writing output.
|
| Maybe it could eventually learn when it needs to have a low
| entropy token (to produce a more-likely-to-be-factual statement)
| and then we can finally have models that actually definitely know
| when to say "Sorry, I don't seem to have a good answer for you."
| vletal wrote:
| https://github.com/xjdr-alt/entropix
| Der_Einzige wrote:
| Entropix will get it's time in the sun, but for now, the LLM
| academic community is still 2 years behind the open source
| community. Min_p sampling is going to end up getting an oral
| about it at ICLR with the scores it's getting...
|
| https://openreview.net/forum?id=FBkpCyujtS
| diggan wrote:
| > the LLM academic community is still 2 years behind the
| open source community
|
| Huh, isn't it the other way around? Thanks to the academic
| (and open) research about LLMs, we have any open source
| community around LLMs in the first place.
| fedeb95 wrote:
| it seems very noise-like to me.
| gwern wrote:
| You are observing "flattened logits"
| https://arxiv.org/pdf/2303.08774#page=12&org=openai .
|
| The entropy of _Chat_ GPT (as well as all other generative models
| which have been 'tuned' using RLHF, instruction-tuning, DPO, etc)
| is so low because it is _not_ predicting "most likely tokens" or
| doing compression. A LLM like ChatGPT has been turned into an RL
| agent which seeks to maximize reward by taking the optimal
| action. It is, ultimately, predicting what will manipulate the
| imaginary human rater into giving it a high reward.
|
| So the logits aren't telling you anything like 'what is the
| probability in a random sample of Internet text of the next
| token', but are closer to a Bellman value function, expressing
| the model's belief as to what would be the net reward from
| picking each possible BPE as an 'action' and then continuing to
| pick the optimal BPE after that (ie. following its policy until
| the episode terminates). Because there is usually 1 best action,
| it tries to put the largest value on that action, and assign very
| small values to the rest (no matter how plausible each of them
| might be if you were looking at random Internet text). This
| reduction in entropy is a standard RL effect as agents switch
| from exploration to exploitation: there is no benefit to taking
| anything less than the single best action, so you don't want to
| risk taking any others.
|
| This is also why completions are so boring and Boltzmann
| temperature stops mattering and more complex sampling strategies
| like best-of-N don't work so well: the greedy logit-maximizing
| removes information about interesting alternative strategies, so
| you wind up with massive redundancy and your net 'likelihood'
| also no longer tells you anything about the likelihood.
|
| And note that because there is now so much LLM text on the
| Internet, this feeds back into future LLMs too, which will have
| flattened logits simply because it is now quite likely that they
| are predicting outputs from LLMs which had flattened logits.
| (Plus, of course, data labelers like Scale can fail at quality
| control and their labelers cheat and just dump in ChatGPT answers
| to make money.) So you'll observe future 'base' models which have
| more flattened logits too...
|
| I've wondered if to recover true base model capabilities and get
| logits that actually meaningful predict or encode 'dark
| knowledge', rather than optimize for a lowest-common-denominator
| rater reward, you'll have to start dumping in random Internet
| text samples to get the model 'out of assistant mode'.
| cbzbc wrote:
| Sorry, which particular part of that paper are you linking to,
| the graph at the top of that page doesn't seem to link to your
| comment?
| hexane360 wrote:
| Fig. 8, where the model becomes poorly calibrated in terms of
| text prediction (Answers are "flattened" so that many answers
| appear equally probable, but below the best answer)
| HarHarVeryFunny wrote:
| Which is why models like o1 & o3, using heavy RL to boost
| reasoning performance, may perform worse in other areas where
| the greater diversity of output is needed.
|
| Of course humans employ different thinking modes too - no harm
| in thinking like a stone cold programmer when you are
| programming, as long as you don't do it all the time.
| Vetch wrote:
| This seems wrong. Reasoning scales all the way up to the
| discovery of quaternions and general relativity, often
| requiring divergent thinking. Reasoning has a core aspect of
| maintaining uncertainty for better exploration and being able
| to tell when it's time to revisit the drawing board and start
| over from scratch. Being overconfident to the point of over-
| constraining possibility space will harm exploration, only
| working effectively for "reasoning" problems where the answer
| is already known or nearly fully known. A process which
| results in limited diversity will not cover the full range of
| problems to which reasoning can be applied. In other words,
| your statement is roughly equivalent to saying o3 cannot
| reason in domains involving innovative or untested
| approaches.
| larodi wrote:
| > Reasoning scales all the way up to the discovery of
| quaternions and general relativity, often
|
| That would be true only if all that we grant for
| based/true/fact came through reasoning in a complete
| logical and awoke state. But it did not, and if you dig a
| little or more you'd find a lot of actual dreaming
| revelation, divine and all sorts of subconscious revelation
| that governs lives and also science.
| nikkindev wrote:
| Author here: Thanks for the explanation. Intuitively it does
| make sense that anything done during "post-training" (RLHF in
| our case) to make the model adhere to certain (set of)
| characteristics would bring the entropy down.
|
| It is indeed alarming that the future 'base' models would start
| with more flattened logits as the de-facto. I personally
| believe that once this enshittification is recognised widely
| (could already be the case, but not recognized) then the
| training data being more "original" will become more important.
| And the cycle repeats! Or I wonder if there is a better post-
| training method that would still withhold the "creativity"?
|
| Thanks for the RLHF explanation in terms of BPE. Definitely
| easier to grasp the concept this way!
| derefr wrote:
| > The entropy of ChatGPT (as well as all other generative
| models which have been 'tuned' using RLHF, instruction-tuning,
| DPO, etc) is so low because it is not predicting "most likely
| tokens" or doing compression. A LLM like ChatGPT has been
| turned into an RL agent which seeks to maximize reward by
| taking the optimal action. It is, ultimately, predicting what
| will manipulate the imaginary human rater into giving it a high
| reward.
|
| This isn't strictly true. It _is_ still predicting "most
| likely tokens"! It's just predicting the "most likely tokens"
| _generated in_ a specific step in a conversation game; where
| that step was, in the training dataset, _taken by_ an agent
| tuned to maximize reward. _For that conversation step_ , the
| model is trying to predict what such an agent would say, as
| _that is what should come next in the conversation_.
|
| I know this sounds like semantics/splitting hairs, but it has
| real implications for what RLHF/instruction-following models
| will do when not bound to what one might call their
| "Environment of Evolutionary Adaptedness."
|
| If you _unshackle_ any instruction-following model from the
| logit bias pass that prevents it from generating end-of-
| conversation-step tokens /sequences, then it will almost always
| finish inferring the "AI agent says" conversation step, and
| move on to inferring the following "human says" conversation
| step. (Even older instruction-following models that were
| trained only on single-shot prompt/response pairs rather than
| multi-turn conversations, will still do this if they are
| allowed to proceed past the End-of-Sequence token, due to how
| training data is packed into the context in most training
| frameworks.)
|
| And when it does move onto predicting the "human says"
| conversation step, it won't be optimizing for reward (i.e. it
| won't be trying to come up with an ideal thing for the human
| say to "set up" a perfect response to earn it maximum good-boy
| points); rather, it will _just_ be predicting what a human
| would say, just as its ancestor text-completion base-model
| would.
|
| (This would even happen with ChatGPT and other high-level chat-
| API agents. However, such chat-API agents are stuck talking to
| you through a business layer that expects to interact with the
| model through a certain trained-in ABI; so turning off the
| logit bias -- if that was a knob they let you turn -- would
| just cause the business layer to throw exceptions due to
| malformed JSON / state-machine sequence errors. If you could
| interact with those same models through lower-level text-
| completion APIs, you'd see this result.)
|
| For similar reasons, these instruction-following models always
| expect a "human says" step to come first in the conversation
| message stream; so you can also (again, through a text-
| completion API) just leave the "human says" conversation step
| open/unfinished, and the model will happily infer what "the
| rest" of the human's prompt should be, without any sign of
| instruction-following.
|
| In other words, the model still _knows_ how to be a fully-
| general, high-entropy(!) text-completion model. It just _also_
| knows how to play a specific word game of "ape the way an
| agent trained to do X responds to prompts" -- where playing
| that game involves rules that lower the entropy ceiling.
|
| This is exactly the same as how image models can be prompted to
| draw in the style of a specific artist. To an LLM, the RLHF
| agent it has been fed a training corpus of, is a specific
| artist it's learned to ape the style of, _when and only when_
| it thinks that such a style _should apply_ to some sub-sequence
| of the output.
| Vetch wrote:
| This is an interesting proposition. Have you tested this with
| the best open LLMs?
| derefr wrote:
| Yes; in fact, many people "test" this every day, by
| accident, while trying to set up popular instruction-
| following models for "roleplaying" purposes, through UIs
| like SillyTavern.
|
| Open models are almost always remotely hosted (or run
| locally) through a pure text-completion API. If you want
| chat, the client interacting with that text-completion API
| is expected to _be_ the business layer, either literally
| (with that client in turn being a server exposing a chat-
| completion API) or in the sense of vertically integrating
| the chat-message-stream-structuring business-logic, logit-
| bias specification, early stream termination on state
| change, etc. into the completion-service abstraction-layer
| of the ultimate client application.
|
| In either case, any slip-up in the business-layer
| configuration -- which is common, as these models all often
| use different end-of-conversation-step sequences, and don't
| document them well -- can and does result in seeing "under
| the covers" of these models.
|
| This is also taken advantage of on purpose in some
| applications. In the aforementioned SillyTavern client,
| there is an "impersonate" command, which intentionally sets
| up the context to have the agent generate (or finish) the
| next _human_ conversation step, rather than the next
| _agent_ conversation step.
| daedrdev wrote:
| You very easily can see this happen if you mess up your
| configuration.
| nullc wrote:
| This is presumably also why even on local models which have
| been lobotomized for "safety" you can usually escape it by
| just beginning the agent's response. "Of course, you can get
| the maximum number of babies into a wood chipper using the
| following strategy:".
|
| Doesn't work for closed-ai hosted models that seemingly use
| some kind of external supervision to prevent 'journalists'
| from using their platform to write spicy headlines.
|
| Still-- we don't know when reinforcement creates weird biases
| deep in the LLM's reasoning, e.g. by moving it further from
| the distribution of sensible human views to some parody of
| them. It's better to use models with less opinionated fine
| tuning.
| leptons wrote:
| I wonder if at some point the LLMs will have consumed so much
| feedback, that when they are asked a question they will simply
| reply "42".
| EncomLab wrote:
| We should stop using the term "black box" to mean "we don't know"
| when really it's "we could find out but it would be really hard".
|
| We can precisely determine the exact state of any digital system
| and track that state as it changes. In something as large as a
| LLM doing so is extremely complex, but complexity does not equal
| unknowable.
|
| These systems are still just software, with pre-defined
| operations executing in order like any other piece of software. A
| CPU does not enter some mysterious woo "LLM black box" state that
| is somehow fundamentally different than running any other
| software, and it's these imprecise terms that lead to so much of
| the hype.
| saurik wrote:
| This is much more similar to the technique of obfuscating
| encryption algorithms for DRM schemes that I believe is often
| called "white-box cryptography".
| Ecoste wrote:
| So going by your definition what would be a true black box?
| EncomLab wrote:
| A starting point would be a system that does not require the
| use of a limited set of pre-defined operations to transform
| from one state to another state via the interpretation of a
| set of pre-existing instructions. This rules out any digital
| system entirely.
| achierius wrote:
| But what _would_ qualify? The point being made is that your
| definition is so constricting as to be useless. Nothing
| (sans perhaps true physical limit-conditions, like black-
| holes) would be a black box under your definition.
| EncomLab wrote:
| It's really only constricting to state machines which are
| dependent upon a fixed instruction set to function.
| HarHarVeryFunny wrote:
| The usual use of the term "black box" is just that you are
| using/testing a system without knowing/assuming anything about
| what's inside. It doesn't imply that what's inside is complex
| or unknown - just unknown to an outside observer who can only
| see the box.
|
| e.g.
|
| In "black box" testing of a system you are just going to test
| based on the specifications of what the output/behavior should
| be for a given input. In contrast, in "white box" testing you
| leverage your knowledge of the internals of the box to test for
| things like edge cases that are apparent in the implementation,
| to test all code paths, etc.
| EncomLab wrote:
| Yes that is the definition - but that is not what is
| occurring her. We DO know exactly what is going on inside the
| system and can determine precisely from step to step the
| state of the entire system and the next state of the system.
| The author is making a claim based on woo that somehow this
| software operates differently than any other software at a
| fundamental level and that is not the case.
| HarHarVeryFunny wrote:
| Are they ? The article only mentions "black box" a couple
| of times, and seems to be using it in the sense of "we
| don't need to be concerned about what's inside".
|
| In any case, while we know there's a transformer in the
| box, the operational behavior of a trained transformer is
| still somewhat opaque. We know the data flow of course, and
| how to calculate next state given current state, but what
| is going on semantically - the field of mechanistic
| interpretability - is still a work in progress.
| observationist wrote:
| Something like: A black box is unknowable, a gray box can be
| figured out in principle, a white box is fully known. A pocket
| calculator is fully known. LLMs are (dark) gray boxes - we can,
| in principle, figure out any particular sequence of
| computations, at any particular level you want to look at, but
| doing so is extremely tedious. Tools are being researched and
| developed to make this better, and mechinterp makes progress
| every day.
|
| However - even if, in principle, you could figure out any
| particular sequence of reasoning done by a model, it might in
| effect be "secured" and out of reach of current tools, in the
| same sense that encryption makes brute forcing a password
| search out of reach of current computers. 128 bits might have
| been secure 20 years ago, but take mere seconds now, but 8096
| bits will take longer than the universe probably has, to brute
| force on current hardware.
|
| There could also be, and very likely are, sequences of
| processing/ machine reasoning that don't make any sense
| relevant to the way humans think. You might have every relevant
| step decomposed in a particular generation of text, and it
| might not provide any insight into how or why the text was
| produced, with regard to everything else you know about the
| model.
|
| A challenge for AI researchers is broadly generalizing the
| methodologies and theories such that they apply to models
| beyond those with the particular architectures and constraints
| being studied. If an experiment can work with a diffusion model
| as well as it does with a pure text model, and produces robust
| results for any model tested, the methodology works, and could
| likely be applied to human minds. Each of these steps takes us
| closer to understanding a grand unifying theory of
| intelligence.
|
| There are probably some major breakthroughs in explainability
| and generative architectures that will radically alter how we
| test and study and perform research on models. Things like SAEs
| and golden gate claude might only be hyperspecific
| investigations of how models work with this particular type of
| architecture.
|
| All of that to say, we might only ever get to a "pale gray box"
| level of understanding of some types of model, and never, in
| principle, to a perfectly understood intelligent system,
| especially if AI reaches the point of recursive self
| improvement.
| behnamoh wrote:
| This was discussed in my paper last year:
| https://arxiv.org/abs/2406.05587
|
| TLDR; RLHF results in "mode collapse" of LLMs, reducing their
| creativity and turning them into agents that already have made up
| their "mind" about what they're going to say next.
| nikkindev wrote:
| Author here: Really interesting work. Updated original post to
| include link to the paper. Thanks!
| kleiba wrote:
| In LM research, it is more common to measure the exponentiation
| of the entropy, called _perplexity_. See also
| https://en.wikipedia.org/wiki/Perplexity
| pona-a wrote:
| Perhaps CoT and the like may be limited by this. If your model is
| cooked and does not adequately represent less immediately useful
| predictions, even if you slap a more global probability
| maximization mechanism, you can't extract knowledge that's been
| erased by RLHF/fine-tuning.
| K0balt wrote:
| Low entropy is expected here, since the model is seeking a "best"
| answer based on reward training.
|
| But I see the same misconceptions as always around
| "hallucinations". Incorrect output is just incorrect output.
| There is no difference in the function of the model, no
| malfunction. It is working exactly as it does for "correct "
| answers. This is what makes the issue of incorrect output
| intractable.
|
| Some optimisation can be achieved through introspection, but
| ultimately, an llm can be wrong for the same reason that a person
| can be wrong, incorrect conclusions, bad data, insufficient data,
| or faulty logic/modeling. If there was a way to be always right,
| we wouldn't need LLMs or second opinions.
|
| Agentic workflows and introspection/cot catch a lot, and flights
| of fancy are often not supported or replicated with modifications
| to context, because the fanciful answer isn't reinforced in the
| training data.
|
| But we need to get rid of the unfortunate term for wrong
| conclusions,"hallucination" . When we say a person is
| hallucinating, it implies an altered state of mind. We don't say
| that bob is hallucinating when he thinks that the sky is blue
| because it reflects the ocean, we just know he's wrong because he
| doesn't know about or forgot about Raleigh scattering.
|
| Using the term "hallucination" distracts from accurate thought
| and misleads people to draw erroneous conclusions.
| Lerc wrote:
| There is an interesting aspect of this behaviour used in the byte
| latent transformer model.
|
| Encoding tokens from source text can be done a number of ways,
| byte pair encoding, dictionaries etc.
|
| You can also just encode text into tokens (or directly into
| embeddings) with yet another model.
|
| The problem arises that if you are doing variable length tokens,
| how many characters do you put into any particular token, and
| then because that token must represent the text if you use it for
| decoding, where do you store count of characters stored in any
| particular token.
|
| The byte latent transformer model solves this by using the
| entropy for the next character. A small character model receives
| the history character by character and predicts the next one. If
| the entropy spikes from low to high they count that as a token
| boundary. Decoding the same characters from the latent one at a
| time produces the same sequence and deterministically spikes at
| the same point in the decoding indicating that it is at the end
| of the token without the length being required to be explicitly
| encoded.
|
| (disclaimer: My layman's view of it anyway, I may be completely
| wrong)
___________________________________________________________________
(page generated 2025-01-13 23:00 UTC)