[HN Gopher] Cross-Entropy and KL Divergence
___________________________________________________________________
Cross-Entropy and KL Divergence
Author : mfrw
Score : 60 points
Date : 2025-04-13 04:48 UTC (18 hours ago)
(HTM) web link (eli.thegreenplace.net)
(TXT) w3m dump (eli.thegreenplace.net)
| meanpp wrote:
| After the phrase:Manipulating the logarithms, we can also get ...
| the formula is incorrect, since p_j have disappeared.
|
| \\[D_{KL}(P,Q)=-\sum_{j=1}^{n}\log_2
| \frac{q_j}{p_j}=\sum_{j=1}^{n}\log_2 \frac{p_j}{q_j}\\]
|
| The post is just basic definitions and simple examples for cross
| entropy and KL divergence.
|
| There is a section about the relation of cross entropy and
| maximum likelihood estimation at the end that seems not so easy
| to understand but implies that the limit of a estimator applied
| to a sample from a distribution is the KL divergence when the
| sample length tends to infinity.
| dynm wrote:
| In the notation of this page, the entropy H(P) is best thought of
| as:
|
| "The mean number of bits to encode a member of P, assuming an
| optimal code."
|
| And the KL divergence KL(P,Q) is probably best thought of as:
|
| "The mean number of WASTED bits if you encode members of P
| assuming that they had come from Q."
| jhrmnn wrote:
| I and clearly many other people have run into what one could only
| call "KL variance", but it doesn't seem to have an established
| name.
|
| https://mathoverflow.net/questions/210469/kullback-leibler-v...
| keepamovin wrote:
| I often wondered about an alternative but related metric called
| "organization"
|
| Entropy, in some sense would seem to measure "complexity", but
| it's more accurately related as "surprise" I think.
|
| It's useful but limited (for example, you can measure the
| "entropy" present in a string -- of keystrokes, or text -- and
| determine how likely it is that it's "coherent" or "intelligent"
| but this is fuzzy, i.e., "too much" entropy, and you are at
| "randomness", too little and you are at "banality"). It seems
| like a more precise (but still 0 - 1 bounded) metric would be
| possible to measure "order" or "organization". Entropy fails at
| this: 0 entropy does not equal "total order". Just "total
| boringness" (heh :))
|
| I considered something related to some archetypal canonical
| compression scheme (like LZ), but didn't flesh it out.
| Considering again now, what about the "self similarity" of the
| dictionary, combined with the diversity of the dictionary?
|
| It's more of a "two-axis" metric but surely we can find a way to
| corral it into 0..1.
|
| Very self-similar, and rather diverse? Highly organized.
|
| Low self-similarity, and highly diverse? High entropy / highly
| disorganized.
|
| Low self-similarity, and low diversity? Low entropy / high
| banality. I.e., simplicity heh :)
|
| High self-similarity, low diversity - organized, but "less
| organized" than something with more diversity.
|
| I don't think this is quite there yet, but there's intuitive sync
| with this.
|
| Any takers???? :)
| chermi wrote:
| We know entropy != Complexity(1). There's still no satisfactory
| answer for what complexity is, and the problem is only getting
| worse I think as more people decide to use the term however
| they wish. I'm partial to kolmogorov(2).
|
| You're train of thought re. compressibility vs. organization is
| very in line with kolmogorov's.
|
| 1.https://www.taylorfrancis.com/books/mono/10.1201/97804295028.
| .. 2.https://en.m.wikipedia.org/wiki/Kolmogorov_complexity
| daveguy wrote:
| I'm not sure you can calculate "organization" of a sequence, in
| a completely generic and universally applicable sense. It seems
| you would already have to know something about how a sequence
| might be organized -- the constraints for a specific example /
| organization scheme. Digging through hypothetical organization
| that includes self reference would require an additional layer
| of understanding. Or in other words, wouldn't a sufficiently
| complex sequence be indistinguishable from random noise and the
| analysis would have to be outside of the expression system for
| at least some examples? This seems very related to the halting
| problem and the existence of proofs impossible to express in
| the base axiomatic system. It's been a long time since I
| studied finite state automatons, just a vague intuition from
| old memories.
|
| Edit: this is not to say that there couldn't be a practical
| measure of organization for specific examples/types/subsets of
| sequences. The halting problem doesn't prevent us from
| analyzing program "correctness" in specific formal systems. It
| just prevents us from having universal ability to fully
| determine the answer for a generic example.
| samsartor wrote:
| Worth clarifying that you are talking about information
| content, not entropy. A single text file or png has
| information, the distribution of all possible text files or all
| possible pngs has entropy.
|
| I'm not an expert, but let me brainstorm a bit here. Something
| closer to the specific correlation might be what you want? In
| vague terms it would measure how much the various bytes of a
| file are related to each other, by considering how much more
| likely they are to be given values taken together vs
| considering each byte individually.
|
| But once again, having extremely high specific correlation
| might indicate a trival low-complexity example? I'd have to
| play around with it some more to get a good intuition for how
| it behaves.
|
| Edit: It seems like this might also be very sensitive to
| parametrization. The specific correlation in terms of byte
| values would not be much more useful than the information
| content, because the marginal distributions aren't very
| interesting (eg over all files, how likely is byte #57 to be
| 0xf3 or whatever). It would be a better measure with more
| meaningful variables, or even something like a markov chain
| where you consider many short substrings.
|
| Anyway, specific correlation (like information) measures a
| specific file. The expected specific correlation over all
| possible files gives the total correlation, which (like
| entropy) measures the whole distribution. Total correlation is
| also the KL divergence between the joint distribution and the
| product of the marginal distributions! Total correlation is
| also the same thing as mutual information, just generalized to
| more than two variables. And specific correlation is the
| generalization of pointwise mutual information.
| kurikuri wrote:
| In the first definition of D(P,Q), the author dropped a p_j
| within the sum.
___________________________________________________________________
(page generated 2025-04-13 23:01 UTC)