https://nikkin.dev/blog/llm-entropy.html

Light/Dark Mode

Blog / About / Custom search / Newsletter

Entropy of a Large Language Model output

created: 2025-01-06 updated: 2025-01-13

Large language models like ChatGPT & Claude have become ubiquitous in
today's world. I visited a friend for the New years eve and was
impressed by how ChatGPT was being extensively used within the
family. The friend's little kid is going to grow up in a world, where
turning to large language models (LLM) for quick answers is going to
be the norm rather than the exception.

LLMs do have their fair share of pitfalls. The primary, being the
problem of hallucination, meaning the LLM spitting out factually
incorrect answers. There has been (and will be) a lot of work into
understanding when and how these models hallucinate, but I wanted to
understand the output of the LLMs from an information theoretic
perspective.

Now, how does an LLM work? Language Models are probability
distributions over sequences of tokens/words^1 (In Natural Language
Processing, tokens are building blocks of text. But they can be
interchangeably used with "word" in our context for now). Most of the
times, we are dealing with something called Autoregressive language
models. The easiest way to think about it is that, it is a black box
to which, if you give a sequence of tokens (or words), it spits out
the next token (or word). Now you add this generated token to the
previous sequence of tokens and send it to the black box again. It
will spit the next one and so on. You get the idea!

LLMs have been possible, thanks to the concept of attention ^2 and a
specific architecture of neural networks called the transformers. For
the interested readers, I would divert your attention to the
wonderful talk by Grant Sanderson ^3 on the same topic

The important thing to remember is that the output token of the LLM
(black box) is not deterministic. Rather, it is a probability
distribution over all the available tokens in the vocabulary. And
this might seem like a really large set. But in reality, most of the
probability is distributed over only a few tokens, while the rest
have close to negligible probability of appearing as the next token.

For the rest of the post, let us use ChatGPT as an example to
understand the concepts. When I type in the following text and look
for the next 3 possible tokens

    The protagonist went around the corner and ...

[image-2025]

We can already see that the model finds certain tokens/words highly
probable than the rest.

When a probability distribution enters the picture, there are lots of
different ways to think about reasoning with it. Here, I wanted to
explore it from the perspective of entropy.

Entropy is a tricky concept and it can mean different things in
different contexts. Quanta Magazine has a nice post on what entropy
means ^4. I would particularly recommend playing around with the
animation by Jonas Parnow & Mark Belan in that post to get an
intuitive understanding of what the measure of disorder actually
means. For the sake of this post, we will stick to an information
theoretic understanding and borrow Shannon's definition of entropy
for the probability distribution. It is defined as

\[\text{Shannon's entropy}=-\sum\limits_xp(x)\times\log{p(x)}\]

where \(x\) are the available tokens and \(p(x)\) refers to the
probability of that token appearing next. Now, what does the value of
entropy say about the underlying probability distribution. If all the
states/tokens are equally probable, the entropy would be the high. If
the probability is concentrated on a very few tokens, the entropy
would be low. In other words, a higher entropy can be interpreted as
the model not being sure of the next token relative to the case where
the entropy is low, meaning that the model is more confident of what
token should come next

Let's put this into practice. Let's ask ChatGPT the following

    write a haiku about ai ...

Now, for each token in the output sequence, we obtain a probability
distribution for which the entropy can be calculated.

Note: The ChatGPT's API only allows to look at the probability values
of the top 20 tokens. This could mean that we might be missing out on
non-zero probable next tokens beyond this range. And the definition
of entropy over a probability distribution assumes that it is
normalized (meaning all the different probability values for a single
distribution sums to 1) So, we normalize the probability values for
the top 20 tokens before calculating the entropy. Now, looking at the
result...

[image-2025]

This is interesting! The entropy is low (meaning the model is
relatively sure of the next token) for the cases where the sentence
ends or if the next token is the broken part of a single word
(Wis-dom above).

What if we did it for a longer text. Asking ChatGPT to

    Write me an essay about Claude Shannon ...

[image-2025]

Too many tokens to visualize. Let's just look at a histogram of the
entropy values

[image-2025]

In most of the cases, the model seems to be pretty sure of the next
token. Since the plot above seems a bit erratic to make sense of it,
I tried taking the average over a moving window to smoothen the plot.
Below are different plots for different window sizes

[image-2025]

The red lines that you see are the tokens which are or contain a full
stop (interpreted as the end of a sentence). The interpretation from
the first simple example still holds that the entropy drops down for
the cases where each sentence ends. I wanted to see what kind of
tokens were selected when there was high and low entropy
respectively. Looking at the top 10 tokens with the lowest entropies
...

    194 | Institute | of | Michigan | th | 191 | Technology | | of |
    ,

and comparing it with the top 10 tokens with the highest entropy
cases ...

    led | wrote | considered | established | revolution | technology
    | , | is | regarded | conducted

it can be crudely interpreted that factual statements (involving
proper nouns) come with low entropy, where the model has to be sure
of what comes next while descriptive statements have multiple options
as synonyms that could increase the pool of possible and probable
next tokens.

Next, I went and asked ChatGPT what kind of analysis can I build on
this and it told me to look at the clusters surrounding high entropy
tokens (words before and after)

    ... and innovative thinking led to groundbreaking contributions ...

    ... time that Shannon wrote his master's thesis ...

    ... This work, considered one of the ...

    ... paper, Shannon established the principles of ...

    ... research that would revolution ize the field ...

    ... have profoundly influenced technology, telecommunications, ...

    ... This work , considered one of ...

    ... circuit design and is regarded as a cornerstone ...

    ... where he conducted research that would ...

Yeah. I honestly didn't know what to make of it. But was interesting
to give a voice to ChatGPT itself to contribute to this post.

I wanted to try a text completion in Tamil ^5 to see how ChatGPT
performs. I asked

    Finish the rest... epporull yaaryaarvaayk keettpinnnum

And it gave me

    epporull yaaryaarvaayk keettpinnnum apporull

    meypporull kaannptu arrivu.

    This is a couplet from the ancient Tamil text "Thirukkural,"
    written by the poet Thiruvalluvar. It emphasizes the importance
    of discerning the truth regardless of who speaks it.

And the probability distribution of tokens look like ...

[image-2025]

Here, you can clearly see that the tokens are individual letters.
While this is not ideal for autoregressive language models leading to
more compute time, it can actually be helpful in fine-grained model
performance evaluation.

Why did I do the things above? I found it interesting that the
OpenAI's API for text generation using ChatGPT models gave back
information of the next token probabilities. I wanted to play around
with it from the point of view of entropy. While no substantial
conclusions can be made from this little experiment, it is of
paramount importance that we pay heed to the Thirukkural couplet
above and discern the truth regardless of who (or what) speaks it!

(You can find the code used to create all the plots here)

Hacker news discussion - https://news.ycombinator.com/item?id=
42649315

Relevant and interesting paper mentioned in the comments there -
Creativity Has Left the Chat: The Price of Debiasing Language Models
by Behnam Mohammadi

---------------------------------------------------------------------

 1. Stanford CS229 I Machine Learning I Building Large Language
    Models (LLMs) (youtube)-[?]

 2. Attention is all you need (wikipedia)-[?]

 3. Visualizing transformers and attention | Talk for TNG Big Tech
    Day '24 - Grant Sanderson (youtube)-[?]

 4. What Is Entropy? A Measure of Just How Little We Really Know. -
    Quanta Magazine-[?]

 5. Tamil language (wikipedia)-[?]

Site rendered with bash script