[HN Gopher] Kullback-Leibler divergence
       ___________________________________________________________________
        
       Kullback-Leibler divergence
        
       Author : dedalus
       Score  : 140 points
       Date   : 2023-08-21 20:26 UTC (1 days ago)
        
 (HTM) web link (en.wikipedia.org)
 (TXT) w3m dump (en.wikipedia.org)
        
       | golwengaud wrote:
       | I found https://www.lesswrong.com/posts/no5jDTut5Byjqb4j5/six-
       | and-a-... very helpful for getting intuition for what the K-L
       | divergence is and why it's useful. The six intuitions:
       | 1. Expected surprise         2. Hypothesis testing         3.
       | MLEs         4. Suboptimal coding         5a. Gambling games --
       | beating the house         5b. Gambling games -- gaming the
       | lottery         6. Bregman divergence
        
       | riemannzeta wrote:
       | KL divergence has also been used to generalize the second law of
       | thermodynamics for systems far from equilibrium:
       | 
       | https://arxiv.org/abs/1508.02421
       | 
       | And to explain the relationship between the rate of evolution and
       | evolutionary fitness:
       | 
       | https://math.ucr.edu/home/baez/bio_asu/bio_asu_web.pdf
       | 
       | The connection between all of these manifestations of KL
       | divergence is that a system far from equilibrium contains more
       | information (in the Shannon sense) than a system in equilibirum.
       | That "excess information" is what drives fitness within some
       | environment.
        
         | tysam_and wrote:
         | I'm not sure if I follow the fitness-driving argument. Without
         | an informational sieve of some kind, isn't it just random
         | entropy?
         | 
         | In this case, sure, far from equilibrium has more information,
         | but that just means the system has more flexibility to move,
         | not that the system actually inherently contains more
         | information (stored information vs inbound throughput).
         | 
         | Unless it's some kind of Nash game, or problem with multiple
         | similarly-deep valleys, I don't see how that would do much more
         | from a fitness perspective other than allow us to effect
         | something approaching an unbiased estimator of the minima
         | density of whatever problem we're working on is. Or...whatever,
         | I'm not quite sure.
         | 
         | Anywho, however it is, I'm a bit confused on the last part
         | there.
        
           | riemannzeta wrote:
           | I will confess that at least the biology work is still
           | somewhat mysterious to me. John Baez used to be active on
           | Twitter. I believe he's still active on Mathstodon now. If
           | you're curious, you might ask him.
        
       | jwarden wrote:
       | Here's how I describe KL Divergence, building up from simple to
       | complex concepts.
       | 
       | surprisal: how surprised I am when I learn the value of X
       | Suprisal(x) = -log p(X=x)
       | 
       | entropy: how surprised I expect to be                 H(p)
       | = E_X -log p(X)                    = [?]_x p(X=x) * -log p(X=x)
       | 
       | cross-entropy: how surprised I expect Bob to be (if Bob's beliefs
       | are q instead of p)                 H(p,q)      = E_X -log q(X)
       | = [?]_x p(X=x) * -log q(X=x)
       | 
       | KL divergence: how much *more* surprised I expect Bob to be than
       | me                 Dkl(p || q) = H(p,q) - H(p,p)
       | = [?]_x p(X=x) * log p(X=x)/q(X=x)
       | 
       | information gain: how much less surprised I expect Bob to be if
       | he knew that Y=y                 IG(q|Y=y)   = Dkl(q(X|Y=y) ||
       | q(X))
       | 
       | mutual information: how much information I expect to gain about X
       | from learning the value of Y                 I(X;Y)      = E_Y
       | IG(q|Y=y)                     E_Y Dkl(q(X|Y=y) || q(X))
        
         | sarma17 wrote:
         | qq, can you explain a few noob thoughts. Why is surprisal
         | defined as `log (1/p(X=x)`? I'm lacking the intuition for 1.
         | why did someone think to use 1/p? 2. why do we have a log term
         | here.
         | 
         | Log shows up a lot but I don't think I ever really understood
         | when/how do people decided to use it in their formulas. One KL
         | divergence youtube video said, `let's normalize the p(P)/p(Q)
         | using (log(p(P)/q(P)))^N` and then shows how to derived the KL
         | formula. I'd also appreciate if you know how the N (sample size
         | ig) is being used here.
         | 
         | For a lot of math stuff I endup being confused about how people
         | use operations. Like using 1/p(X=x) is like magic to me because
         | I don't understand what the context is when someone thinks
         | about the problem and then decides to do something. What's
         | their thinking process here, what tool or process do I not know
         | about that makes me confused?
        
         | mrfox321 wrote:
         | Typo in cross entropy, should be
         | 
         | p log q
        
           | jwarden wrote:
           | Thank you, fixed.
        
         | 3abiton wrote:
         | Forgive my peasant mind, I've heard about K-L divergence
         | before, I just don't know what it is used for nor what's
         | special about it compared to other metrics?
        
           | zmmmmm wrote:
           | I think people who love math expressed in nomenclature well
           | never understand those who don't ...
           | 
           | K-L divergence is the pretty much really dumb, obvious thing
           | you would probably think up if someone asked you to compare
           | two probability distributions.
           | 
           | You would start by going "oh I suppose I will look at how far
           | apart they are at all the different values" and add it up.
           | Then you would say, "oh, but 0.1 and 0.01 are really an order
           | of magnitude apart, just like 0.01 and 0.001. Perhaps it will
           | work better if I use the log of the probability". Then you
           | would pause for thought and say, "Hmmmm hang on some of the
           | values are really extreme but almost never happen, shouldn't
           | it be weighted by how frequently they occur?".
           | 
           | But of course paragraphs of mathematical symbols are the way
           | many people prefer to express this.
        
           | JonyEpsilon wrote:
           | Practically speaking, it's a simple measure of how similar
           | two probability distributions are, minimised (with value
           | zero) when they are the same. So it's often used as a loss
           | term in optimisations when you want two distributions to be
           | pushed towards being similar. Sometimes this motivated by
           | clever reasoning about information/probability ... but often
           | it's more just "slap a KL on it", because it tends to work.
        
       | max_ wrote:
       | K-L Divergence is something that Keeps coming up in my research
       | but I still don't understand what it is.
       | 
       | Could someone give me a simple explanation as to what it's is.
       | 
       | And also, what practical use cases does it have?
        
         | eob wrote:
         | As for practical use cases, one is to find an approximate
         | optimization to a function
         | 
         | - You want to find the min/max of some probability distribution
         | P(x)
         | 
         | - P(x) is too complicated to find a closed-form min, but you
         | can draw samples from it.
         | 
         | - So instead, you carefully construct some OTHER probability
         | distribution Q(x|th) that you claim is structurally similar
         | "enough" to P(x), parameterized by th.
         | 
         | - Now you find the theta which minimizes the KL divergence
         | KL(P(x) || Q(x|th)), which is equivalent to delivering you the
         | parameters of th to Q(x|th) that make it [approximately] "most"
         | similar to P(x) without ever having minimized P(x)
         | 
         | It was a trick that came up a lot when AI consisted of giant
         | Bayesian plate models for each specific task that you had to
         | hand-optimize.
        
           | versteegen wrote:
           | Note that "drawing samples from P(x)" means to have training
           | data drawn from P(x).
           | 
           | You can form the 'empirical' probability distribution P'(x)
           | from your n training samples {x_i}, with P'(x_i) = 1/n and
           | P'(x) = 0 for all other x.
           | 
           | Then finding the th which minimizes KL(P'(x) [?] Q(x|th)) is
           | equivalent to finding the maximum likelihood estimate (MLE)
           | given your training data.
           | 
           | (Note: I don't know what's meant by "the min/max of some
           | probability distribution P(x)" and suggest ignoring that)
        
             | eob wrote:
             | MLE | training data
             | 
             | Just writing hand wavily :)
        
         | leourbina wrote:
         | One intuition is that KL-divergence represents a sort of
         | "distance" between probability distributions. However, this
         | isn't quite right as it doesn't satisfy some basic properties a
         | real distance (a norm) would satisfy, including the fact that
         | it isn't symmetric: KL(Q, P) != KL(P,Q), and it does not
         | satisfy the triangle inequality. Nonetheless, KL(P,Q) gives you
         | a good idea of how "far" is P is from Q: in the context of
         | encoding, if you wanted to come up with an ideal encoding of
         | symbols coming from P, but you guessed Q as the distribution of
         | these symbols, then KL(P, Q) is the extra number of bits you'd
         | have to use. One nice property is that in the case that KL(P,Q)
         | = 0, P and Q are equal (almost everywhere, which for most
         | applications is irrelevant). This makes it useful in the ML
         | context as you can minimize KL divergence and know that the
         | resulting "guessed" distribution is getting closer to the data
         | distribution you're trying to guess using some parametrized
         | function (an NN).
        
           | kgwgk wrote:
           | > it doesn't satisfy some basic properties a real distance (a
           | norm) would satisfy, including the fact that it isn't
           | symmetric [...] and it does not satisfy the triangle
           | inequality.
           | 
           | Not sure about "real" but one can have useful distances which
           | are not symmetric like the distance between cities measured
           | in time or in gallons.
        
         | SleekEagle wrote:
         | One use case is in the training of Diffusion Models. In the
         | original formulation, the likelihood-maximizing objective is
         | recast in terms of KL divergences. This is done because the KL
         | divergence between Gaussians has a closed form, and the
         | transition distributions in Diffusion Models are taken to be
         | Gaussian, which makes the problem tenable.
         | 
         | https://www.assemblyai.com/blog/diffusion-models-for-machine...
        
         | ljlolel wrote:
         | It just describes how two probability distributions are
         | different. If they are the same then it's 0.
         | 
         | Example: an LLM gives a probability distribution of the next
         | word. If it is perfectly accurate at predicting the next word
         | then divergence is 0 (100% probability on the actual next
         | word). If it is slightly off or unsure then the divergence goes
         | up.
        
         | glial wrote:
         | In the context of Bayesian inference, the KL divergence between
         | the prior p(m) and the posterior p(m|x) is the information
         | gained by observing x.
         | 
         | With VAEs, adding a KL divergence to the loss term can be
         | thought of as regularizing information gain from individual
         | inputs.
        
         | SilasX wrote:
         | My favorite physical interpretation (discrete version):
         | 
         | How many extra bits per character your data compression
         | algorithm would need to store text from distribution P if it
         | (mistakenly) assumed it were drawn from Q.
         | 
         | That is, reserve the shortest "words" for the most common
         | characters based on the assumption that the data will be drawn
         | from Q. Then KL(P||Q) is how much bigger the compressed data
         | will be (per input character) if the data is actually drawn
         | from P.
        
         | tech_ken wrote:
         | Lots of different ways of motivating KL, but I think one that's
         | frequently neglected in the ML era is its relationship to a
         | likelihood ratio test. Kullback and Leibler originally
         | characterized the KL divergence as the measure of the ability
         | of discriminate between two distributions (ie. perform a
         | hypothesis test between them), given some set of observations
         | (the thing you're averaging over). You can read their original
         | paper for free if you want to hear it in their own words,
         | Kullback and Leibler 1951, although they go kind of deep on the
         | stat theory.
         | 
         | Say you're flipping a coin N times, and you get outcomes x_1,
         | x_2, ...., x_N. You want to determine whether the coin comes up
         | heads with probability P or probability Q (kind of a weird
         | fiction that there are only two options, but roll with it). The
         | classic way to do this would be a likelihood (log) ratio test,
         | you compute this test statistic:
         | 
         | Y = log( Pr[x_1,...,x_n|P] / Pr[x_1,...,x_N|Q] )
         | 
         | Depending on the value of Y you make a decision about whether
         | to go with P or Q, and IMO the definition makes intuitive sense
         | for this purpose: if Y is positive then you favor P, if Y is
         | negative you favor Q. If it's 0 then you can't pick between
         | them. Simple and easy, plus the Neyman-Pearson lemma basically
         | says that it's the best you could do (among the set of decision
         | making criteria which satisfy a certain set of desirable
         | properties).
         | 
         | Having defined your test statistic Y, you might then ask what
         | kind of values you can expect to get, even before you make any
         | experimental measurements. Well, if you assume that P is the
         | true probability (the null hypothesis) then E_X[Y|P,Q] =
         | KL[P|Q]. Basically the expected value of the log-likelihood
         | ratio, the thing you would use to decide between P and Q,
         | characterizes how similar or different they are. When the
         | expected value is close to 0, and hence P and Q are similar,
         | then before conducting the experiment you can expect that it
         | will be hard to distinguish between them. When the divergence
         | is very large then you can expect it will be easy.
        
           | mreid wrote:
           | This likelihood ratio approach highlights the fact that KL
           | divergence is a member of the family of Csiszar
           | F-divergences. These are measures of "distance" between
           | distributions of the form E_Q[ F(p(x)/q(x)) ] where F is any
           | convex function with F(1) = 0. This is kind of a
           | generalization of log-likelihood where F kind of "weights"
           | the badness of ratios different to 1. When F is -log you get
           | KL divergence.
           | 
           | Another curious fact about KL divergences is they are also a
           | Bregman divergence: take a convex function H and define
           | B_H(P, Q) = \sum_x H(p(x)) - H(q(x)) - <[?]H(q(x)), p(x) -
           | q(x)>. These generalize pointwise square Euclidean distance.
           | KL is obtained when H(P) is negative entropy \sum_x p(x) log
           | p(x).
           | 
           | I spent a bunch of time studying divergences over
           | distributions (e.g., see my blog post[1]) and in particular
           | these two classes and the _really_ neat fact about KL
           | divergence is that it is essentially the only divergence that
           | is both an F-divergence and a Bregman divergence. This is
           | basically due to the property of log that turns logs of
           | products into sums.
           | 
           | [1]: https://mark.reid.name/blog/meet-the-bregman-
           | divergences.htm...
        
         | bjornsing wrote:
         | I think KL divergence is best understood through the lens of
         | variational inference. Inference in the Bayesian regime is a
         | balancing act between best explaining the data and "keeping it
         | simple", by staying close to the prior. KL divergence is the
         | (only) divergence measure that places just the right amount of
         | weight on the prior to make variational inference proper
         | Bayesian inference. I've written about it here:
         | http://www.openias.org/variational-coin-toss
        
         | wdrw wrote:
         | For me, the intuitive way of understanding it is, "how badly
         | would a gambler lose in the long term, if they keep betting on
         | a game believing the probability distribution is X but it is in
         | actual fact Y". It also explains why KL divergence is
         | assymetric, and why it goes to infinity / undefined when the
         | expected probability distribution has zeros where the true
         | distribution has non-zeros. Suppose an urn can have red, blue
         | and green balls. If the true distribution (X) is that there are
         | no red balls at all, but the gambler believes (Y) that there is
         | a small fraction of red balls, the gambler would lose a bit of
         | money with every bet on red, but overall the loss is finite.
         | But suppose the gambler beleives (Y) there are absolutely no
         | red balls in the urn, but in actual fact (X) there is some
         | small fraction of them. According to the gambler's beliefs it
         | would be rational to gamble potentially infinite money on the
         | ball not being red, so the loss is potentially infinite. There
         | is a parallel here to data compression, transmission, etc (KL
         | divergence between expected and actual distributions in
         | information theory) - if you believe a certain bit sequence
         | will _never_ occur in the input sequence, you won 't assign it
         | a code, and so if it ever does actually occur you won't be able
         | to transmit it at all ("infinite loss"). If you beleive it will
         | occur very infrequently, you will assign it a very long code,
         | and so if it actually occurs very frequently your output data
         | will be very long (large loss, large KL divergence).
        
         | abetusk wrote:
         | Intuitively, it measures the difference between two probability
         | distributions. It's not symmetric, so it's not quite that, but
         | in my opinion, it's good intuition.
         | 
         | As motivation, say you're an internet provider, providing
         | internet service to a business. You naturally want to save
         | money, so you perhaps want to compress packets before they go
         | over the wire. Let's say the business you're providing service
         | to also compresses their data, but they've made a mistake and
         | do it inefficiently.
         | 
         | Let's say the business has, incorrectly, determined the
         | probability distribution for their data to be $q(x)$. That is,
         | they assign probability of seeing symbol $x$ to be $q(x)$.
         | Let's say you've determined the "true" distribution to be
         | $p(x)$. The entropy, or number of bits, they expect to transmit
         | per packet/symbol will be $-\sum p(x) lg(q(x))$. Meaning,
         | they'll compress their stream under the assumption that the
         | distribution is $q(x)$ but the actually probability of seeing a
         | packet, $x$, is $p(x)$, which is why the term $p(x) lg(q(x))$
         | shows up.
         | 
         | The number of bits you're transmitting is just $-\sum p(x)
         | lg(p(x))$. Now we ask, how many bits, per packet, is the
         | savings of your method over the businesses? This is $-\sum p(x)
         | lg(q(x)/p(x))$, which is exactly the Kullback-Leibler
         | divergence (maybe up to a sign difference).
         | 
         | In other words, given a "guess" at a distribution and the
         | "true" distribution, how bad is it between them? This is the
         | Kullback-Leibler distribution and why it shows up (I believe)
         | in machine learning and fitness functions.
         | 
         | As a more concrete example, I just ran across a paper talking
         | [0] about using WFC [1] to asses how well it, and other
         | algorithms, do when trying to create generative "super mario
         | brothers" like levels. Take a 2x2 or 3x3 grid, make a library
         | of tiles, use that to generate a random level, then use the K-L
         | divergence to determine how well your generative algorithm has
         | done compared to the observed distribution from an example
         | image.
         | 
         | [0] https://arxiv.org/pdf/1905.05077.pdf
         | 
         | [1] https://github.com/mxgmn/WaveFunctionCollapse
        
           | enthdegree wrote:
           | Your characterization (originally due to Cover, I think) is a
           | strong one because it concretely ties KL divergence to
           | nature. I also like Sanov's theorem for this.
           | 
           | I have sat through many frustrating anti-explanations of the
           | following sort:
           | 
           | >What is KL divergence you ask? Why, it's simply a
           | quantitative difference between distributions. The further
           | away distributions are, the higher KL divergence is... It's
           | like a distance-squared between distributions... but it isn't
           | symmetric and it doesn't obey any usual triangle inequality,
           | so this analogy isn't helpful for analysis... Pinsker's
           | inequality gives a useful lower bound. A useful general upper
           | bound is, uhh,... uh...
           | 
           | This class of answer is totally uninformative (and
           | discrediting if given, IMO) because it does not provide a
           | useful, unique characterization of KL divergence, only
           | fundamentally inaccurate descriptions of it.
        
         | tel wrote:
         | It's a useful way of measuring how different two probability
         | distributions are. If F and G are distributions, Kl(F||G) is
         | non-negative real number which is larger if F and G are less
         | similar.
         | 
         | In a lot of statistical estimation procedures, you have some
         | kind of "current estimate" distribution which has nice
         | properties and some kind of "true distribution" which you'd
         | like to use your nice distribution to approximate. It's then
         | common to create a system which manipulates the parameters of
         | your current estimate distribution to minimize the KL-
         | divergence with the true distribution.
         | 
         | A relatively simple example of this is fitting a Gaussian
         | mixture model. If you look up how that process is derived
         | you'll see it depends centrally on minimizing the KL-divergence
         | between two distributions.
         | 
         | There are other ways to measure the difference between two
         | probability distributions, but the KL-divergence has some nice
         | properties. It shows up as the answer to lots of well-motivated
         | questions around statistical inference (Neyman-Pearson testing,
         | information geometry, Bayesian inference, entropy) and it has a
         | form which is somewhat amenable to algebraic manipulation.
         | 
         | In some sense it's popular because it keeps showing up and
         | working. People recognize its form, consider it relatively
         | simple, and find it meaningful to talk about. It's common
         | enough that you might even begin considering a problem by
         | asking if you can minimize the KL-divergence between your
         | estimate and some goal just knowing that outright it will
         | likely lead to a successful solution to your problem.
        
         | tnecniv wrote:
         | In addition to what others have said, it fixes a "bug" in
         | Shannon's continuous version of the entropy. Shannon assumed
         | you could just replace the sum with an integral and call it a
         | day. However, two bad things happen:
         | 
         | 1. This definition of entropy is not invariant to coordinate
         | transforms. If you change the parameters of your distribution,
         | you get a different value for the entropy, despite the change
         | of parameters not adding or removing information.
         | 
         | 2. You can get negative values for the entropy.
         | 
         | Jaynes argued (in a way that's quite readable if you find his
         | original paper, I'm on mobile) that really you should pick a
         | base reference measure q and define the entropy as -KL(p; q).
         | This fixes the first bug which is the critical one. The second
         | bug is halfway fixed because this quantity is always non-
         | positive. However that's alright because often we really care
         | about the change in entropy not the absolute value (like how we
         | care about the change in potential energy not the absolute
         | value).
         | 
         | This gets at why the KL divergence is often called the relative
         | entropy. It is the entropy relative to the reference measure q.
        
           | chermi wrote:
           | I assume you're referring to his most famous work, where he
           | introduces his maxent stuff?
           | http://bayes.wustl.edu/etj/articles/theory.1.pdf
           | 
           | I would also highly recommend the free and excellent book by
           | MacKay for understanding this:
           | http://www.inference.org.uk/mackay/itila/book.html
        
         | nonotmenonono wrote:
         | "The content of an information itself, maybe a signal, random
         | variable, or event",
         | 
         | "...defined through a negative logarithm of probability",
         | 
         | "...to model a given outcome", occurred.
         | 
         | P-:
        
         | 317070 wrote:
         | Here is another view, which also explains where the logarithm
         | comes from.
         | 
         | The KL between two distributions of a random variable, say
         | Kl[p|q], says that if you made a perfect compression algorithm
         | for samples from distribution q, how many extra bits/nats you
         | expect to need to code samples that actually come from p
         | instead if you use that compression algorithm.
         | 
         | And compression is all about keeping only the true information
         | that is encoded in a sample.
        
         | puzzlingcaptcha wrote:
         | One practical use might be assessing the similarity of a high-
         | dimensional data set and its low-dimensional projection, as
         | used in e.g. t-SNE.
        
       | mrv_asura wrote:
       | I learnt about KL Divergence recently and it was pretty cool to
       | know that cross-entropy loss originated from KL Divergence. But
       | could someone give me the cases where it is preferred to use
       | Mean-squared Error loss vs Cross-entropy loss? Is there any
       | merits or demerits of using either?
        
         | mufasachan wrote:
         | TL;DR: One is about distance in space, the other is about
         | spread in space.
         | 
         | KL-Divergence is not a metric, it's not symmetric. It's biased
         | toward your reference distribution. Although, it gives an
         | probabilistic / information view of a difference in
         | distributions. One of the outcome of KL is that it will
         | highlight the tail of your distribution.
         | 
         | Euclidean distance, L2, is a metric. So it is suited when you
         | need a metric. Also, it does not give an insight of any
         | distribution phenomenons expect the means of the distribution.
         | 
         | For example, you are a teacher, you have two classes. You want
         | to compare the grades. L2 can be a summary of how far the
         | grades are apart. The length of tails of both grades
         | distribution won't have impact if they have the same mean.
         | That's good if you want to know the average level of both
         | classes. KL will give the point of view how the class grades
         | spread are alike. Two classes can have small KL Divergence if
         | they have the same shapes of distributions. If your classes are
         | very different - one is very homogeneous and the other one is
         | very heterogeneous - then your KL will be big, even if the
         | average are very close.
        
         | uoaei wrote:
         | This is the NN 101 explanation: mean-square loss is for
         | regression, cross-entropy is for classification.
         | 
         | NN 201 explanation: mean-square is about finite but continuous
         | errors (residuals) of predicted value vs true value of the
         | output, cross-entropy is for distributions over discrete sets
         | of categories.
         | 
         | NN 501 explanation: the task of the NN and the form of its
         | outputs should be defined in terms of the "shape" or nature of
         | the residuals of its predictions. Mean-square corresponds to
         | predicting means with Gaussian residuals, cross-entropy
         | corresponds to predicting over discrete outputs with
         | multinomial (mutually exclusive) structure. Indeed you can
         | derive any loss function you want by first defining the
         | expected form of the residuals and then deriving the negative
         | log-likelihood of the associated distribution.
        
         | imjonse wrote:
         | If you look deep enough they are not exclusive. From the
         | deeplearning book:
         | 
         | "Many authors use the term "cross-entropy" to identify
         | specifically the negative log-likelihood of a Bernoulli or
         | softmax distribution, but that is a misnomer. Any loss
         | consisting of a negative log-likelihood is a cross- entropy
         | between the empirical distribution defined by the training set
         | and the probability distribution defined by model. For example,
         | mean squared error is the cross-entropy between the empirical
         | distribution and a Gaussian model"
         | 
         | https://stats.stackexchange.com/questions/288451/why-is-mean...
        
         | canjobear wrote:
         | In NN training, minimizing cross entropy is equivalent to
         | minimizing KL divergence. This is because cross entropy is
         | equal to (entropy of the true distribution) + (KL divergence
         | from the true distribution to the model). Obviously by changing
         | the model you can't change the first term, only the second. So
         | when you minimize cross entropy, you are minimizing KL
         | divergence.
         | 
         | Minimizing mean-squared-error loss is equivalent to minimizing
         | KL divergence (and thus cross entropy) under the assumption
         | that your model produces a vector that parameterizes the mean
         | of a multivariate Gaussian distribution which is then used to
         | predict your data. This is the most natural way to set up a
         | model that predicts continuous data.
        
       | zerojames wrote:
       | I have used KL-divergence in authorship verification:
       | https://github.com/capjamesg/pysurprisal/blob/main/pysurpris...
       | 
       | My theory was: calculate entropy ("surprisal") of used words in a
       | language (in my case, from an NYT corpus), then calculate KL-
       | divergence between a given prose and a collection of surprisals
       | for different authors. The author to whom the prose had the
       | highest KL-divergence was assumed to be the author. I think it
       | has been used in stylometry a bit.
        
         | versteegen wrote:
         | *lowest KL-divergence
        
           | zerojames wrote:
           | Yes indeed -- thank you! :facepalm:
        
       | ljlolel wrote:
       | Equivalent to cross entropy for loss on NN
        
       | janalsncm wrote:
       | Btw, KL divergence isn't symmetrical. So D(P,Q) != D(Q,P). If you
       | need a symmetrical version of it, you can use Jensen-Shannon
       | divergence which is the mean of D(P,Q) and D(Q,P). Or if you only
       | care about relative distances you can just use the sum.
        
       | techwizrd wrote:
       | We use KL-divergence to calculate how surprising a time-series
       | anomaly is and rank them for aviation safety, e.g., give me a
       | ranked list of the most surprising increases in a safety metric.
       | It's quite handy!
        
         | acc_297 wrote:
         | That's very cool what are the most surprising increases in
         | aviation safety metrics lately?
        
         | j2kun wrote:
         | Would you be interested in chatting with me about this topic
         | for a book I'm working on? If so please reach out at
         | mathintersectprogramming@gmail.com and we can set up a time to
         | chat!
        
       | jszymborski wrote:
       | VAEs have made me both love and hate KLD. Goddamn mode collapse.
        
         | Q6T46nT668w6i3m wrote:
         | Yeah. I feel the same. But I also feel the same about mode
         | collapse! It's often useful for debugging issues that might
         | otherwise go undetected.
        
       | nravic wrote:
       | IIRC (and in my experience) KL divergence doesn't account for
       | double counting. Wrote a paper where I ended up having to use a
       | custom metric instead:
       | https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=4...
        
         | tysam_and wrote:
         | ...what does this mean?
        
       | tysam_and wrote:
       | Here is the simplest way of explaining the KL divergence:
       | 
       | The KL divergence yields a concrete value that tells you how many
       | actual bits of space on disk you will waste if you try to use an
       | encoding table from one ZIP file of data to encode another ZIP
       | file of data. It's not just theoretical, this is exactly the type
       | of task that it's used for.
       | 
       | The closer the folders are to each other in content, the fewer
       | wasted bits. So, we can use this to measure how similar two sets
       | of information are, in a manner of speaking.
       | 
       | These 'wasted bits' are also known as relative entropy, since
       | entropy basically is a measure of how disordered something can
       | be. The more disordered, the more possibilities we have to choose
       | from, thus the more information possible.
       | 
       | Entropy does not guarantee that the information is usable. It
       | only guarantees how much of this quantity we can get, much like
       | pipes serving water. Yes, they will likely serve water, but you
       | can accidentally have sludge come through instead. Still, their
       | capacity is the same.
       | 
       | One thing to note is that with our ZIP files, if you use the
       | encoding tables from one to encode the other, then you will end
       | up with different relative entropy (i.e. our 'wasted bits')
       | numbers than if you did the vice versa. This is because the KL is
       | not what's called symmetric. That is, it can have different
       | meaning based upon which direction it goes.
       | 
       | Can you pull out a piece of paper, make yourself an example
       | problem, and tease out an intuition as to why?
        
         | nuancebydefault wrote:
         | Thanks for giving a concrete example. For me, formulas like -
         | log p (x) are much harder to understand.
        
           | tysam_and wrote:
           | First -- thank you for the thanks! It means a lot. :)
           | Secondly -- same. It's the way my brain works. In a
           | particular test I was administered a while back, 'coding'
           | (i.e. mapping information transiently to symbols) was by far
           | one of my weakest skills, oddly enough.
           | 
           | To me, I think in shapes. Ideas have shapes. Some ideas have
           | very similar shapes.
           | 
           | This makes me very good at neural network engineering and
           | research.
        
       ___________________________________________________________________
       (page generated 2023-08-22 23:01 UTC)