[HN Gopher] Cross-Entropy and KL Divergence
       ___________________________________________________________________
        
       Cross-Entropy and KL Divergence
        
       Author : mfrw
       Score  : 60 points
       Date   : 2025-04-13 04:48 UTC (18 hours ago)
        
 (HTM) web link (eli.thegreenplace.net)
 (TXT) w3m dump (eli.thegreenplace.net)
        
       | meanpp wrote:
       | After the phrase:Manipulating the logarithms, we can also get ...
       | the formula is incorrect, since p_j have disappeared.
       | 
       | \\[D_{KL}(P,Q)=-\sum_{j=1}^{n}\log_2
       | \frac{q_j}{p_j}=\sum_{j=1}^{n}\log_2 \frac{p_j}{q_j}\\]
       | 
       | The post is just basic definitions and simple examples for cross
       | entropy and KL divergence.
       | 
       | There is a section about the relation of cross entropy and
       | maximum likelihood estimation at the end that seems not so easy
       | to understand but implies that the limit of a estimator applied
       | to a sample from a distribution is the KL divergence when the
       | sample length tends to infinity.
        
       | dynm wrote:
       | In the notation of this page, the entropy H(P) is best thought of
       | as:
       | 
       | "The mean number of bits to encode a member of P, assuming an
       | optimal code."
       | 
       | And the KL divergence KL(P,Q) is probably best thought of as:
       | 
       | "The mean number of WASTED bits if you encode members of P
       | assuming that they had come from Q."
        
       | jhrmnn wrote:
       | I and clearly many other people have run into what one could only
       | call "KL variance", but it doesn't seem to have an established
       | name.
       | 
       | https://mathoverflow.net/questions/210469/kullback-leibler-v...
        
       | keepamovin wrote:
       | I often wondered about an alternative but related metric called
       | "organization"
       | 
       | Entropy, in some sense would seem to measure "complexity", but
       | it's more accurately related as "surprise" I think.
       | 
       | It's useful but limited (for example, you can measure the
       | "entropy" present in a string -- of keystrokes, or text -- and
       | determine how likely it is that it's "coherent" or "intelligent"
       | but this is fuzzy, i.e., "too much" entropy, and you are at
       | "randomness", too little and you are at "banality"). It seems
       | like a more precise (but still 0 - 1 bounded) metric would be
       | possible to measure "order" or "organization". Entropy fails at
       | this: 0 entropy does not equal "total order". Just "total
       | boringness" (heh :))
       | 
       | I considered something related to some archetypal canonical
       | compression scheme (like LZ), but didn't flesh it out.
       | Considering again now, what about the "self similarity" of the
       | dictionary, combined with the diversity of the dictionary?
       | 
       | It's more of a "two-axis" metric but surely we can find a way to
       | corral it into 0..1.
       | 
       | Very self-similar, and rather diverse? Highly organized.
       | 
       | Low self-similarity, and highly diverse? High entropy / highly
       | disorganized.
       | 
       | Low self-similarity, and low diversity? Low entropy / high
       | banality. I.e., simplicity heh :)
       | 
       | High self-similarity, low diversity - organized, but "less
       | organized" than something with more diversity.
       | 
       | I don't think this is quite there yet, but there's intuitive sync
       | with this.
       | 
       | Any takers???? :)
        
         | chermi wrote:
         | We know entropy != Complexity(1). There's still no satisfactory
         | answer for what complexity is, and the problem is only getting
         | worse I think as more people decide to use the term however
         | they wish. I'm partial to kolmogorov(2).
         | 
         | You're train of thought re. compressibility vs. organization is
         | very in line with kolmogorov's.
         | 
         | 1.https://www.taylorfrancis.com/books/mono/10.1201/97804295028.
         | .. 2.https://en.m.wikipedia.org/wiki/Kolmogorov_complexity
        
         | daveguy wrote:
         | I'm not sure you can calculate "organization" of a sequence, in
         | a completely generic and universally applicable sense. It seems
         | you would already have to know something about how a sequence
         | might be organized -- the constraints for a specific example /
         | organization scheme. Digging through hypothetical organization
         | that includes self reference would require an additional layer
         | of understanding. Or in other words, wouldn't a sufficiently
         | complex sequence be indistinguishable from random noise and the
         | analysis would have to be outside of the expression system for
         | at least some examples? This seems very related to the halting
         | problem and the existence of proofs impossible to express in
         | the base axiomatic system. It's been a long time since I
         | studied finite state automatons, just a vague intuition from
         | old memories.
         | 
         | Edit: this is not to say that there couldn't be a practical
         | measure of organization for specific examples/types/subsets of
         | sequences. The halting problem doesn't prevent us from
         | analyzing program "correctness" in specific formal systems. It
         | just prevents us from having universal ability to fully
         | determine the answer for a generic example.
        
         | samsartor wrote:
         | Worth clarifying that you are talking about information
         | content, not entropy. A single text file or png has
         | information, the distribution of all possible text files or all
         | possible pngs has entropy.
         | 
         | I'm not an expert, but let me brainstorm a bit here. Something
         | closer to the specific correlation might be what you want? In
         | vague terms it would measure how much the various bytes of a
         | file are related to each other, by considering how much more
         | likely they are to be given values taken together vs
         | considering each byte individually.
         | 
         | But once again, having extremely high specific correlation
         | might indicate a trival low-complexity example? I'd have to
         | play around with it some more to get a good intuition for how
         | it behaves.
         | 
         | Edit: It seems like this might also be very sensitive to
         | parametrization. The specific correlation in terms of byte
         | values would not be much more useful than the information
         | content, because the marginal distributions aren't very
         | interesting (eg over all files, how likely is byte #57 to be
         | 0xf3 or whatever). It would be a better measure with more
         | meaningful variables, or even something like a markov chain
         | where you consider many short substrings.
         | 
         | Anyway, specific correlation (like information) measures a
         | specific file. The expected specific correlation over all
         | possible files gives the total correlation, which (like
         | entropy) measures the whole distribution. Total correlation is
         | also the KL divergence between the joint distribution and the
         | product of the marginal distributions! Total correlation is
         | also the same thing as mutual information, just generalized to
         | more than two variables. And specific correlation is the
         | generalization of pointwise mutual information.
        
       | kurikuri wrote:
       | In the first definition of D(P,Q), the author dropped a p_j
       | within the sum.
        
       ___________________________________________________________________
       (page generated 2025-04-13 23:01 UTC)