[HN Gopher] Kullback-Leibler divergence
___________________________________________________________________
Kullback-Leibler divergence
Author : dedalus
Score : 140 points
Date : 2023-08-21 20:26 UTC (1 days ago)
(HTM) web link (en.wikipedia.org)
(TXT) w3m dump (en.wikipedia.org)
| golwengaud wrote:
| I found https://www.lesswrong.com/posts/no5jDTut5Byjqb4j5/six-
| and-a-... very helpful for getting intuition for what the K-L
| divergence is and why it's useful. The six intuitions:
| 1. Expected surprise 2. Hypothesis testing 3.
| MLEs 4. Suboptimal coding 5a. Gambling games --
| beating the house 5b. Gambling games -- gaming the
| lottery 6. Bregman divergence
| riemannzeta wrote:
| KL divergence has also been used to generalize the second law of
| thermodynamics for systems far from equilibrium:
|
| https://arxiv.org/abs/1508.02421
|
| And to explain the relationship between the rate of evolution and
| evolutionary fitness:
|
| https://math.ucr.edu/home/baez/bio_asu/bio_asu_web.pdf
|
| The connection between all of these manifestations of KL
| divergence is that a system far from equilibrium contains more
| information (in the Shannon sense) than a system in equilibirum.
| That "excess information" is what drives fitness within some
| environment.
| tysam_and wrote:
| I'm not sure if I follow the fitness-driving argument. Without
| an informational sieve of some kind, isn't it just random
| entropy?
|
| In this case, sure, far from equilibrium has more information,
| but that just means the system has more flexibility to move,
| not that the system actually inherently contains more
| information (stored information vs inbound throughput).
|
| Unless it's some kind of Nash game, or problem with multiple
| similarly-deep valleys, I don't see how that would do much more
| from a fitness perspective other than allow us to effect
| something approaching an unbiased estimator of the minima
| density of whatever problem we're working on is. Or...whatever,
| I'm not quite sure.
|
| Anywho, however it is, I'm a bit confused on the last part
| there.
| riemannzeta wrote:
| I will confess that at least the biology work is still
| somewhat mysterious to me. John Baez used to be active on
| Twitter. I believe he's still active on Mathstodon now. If
| you're curious, you might ask him.
| jwarden wrote:
| Here's how I describe KL Divergence, building up from simple to
| complex concepts.
|
| surprisal: how surprised I am when I learn the value of X
| Suprisal(x) = -log p(X=x)
|
| entropy: how surprised I expect to be H(p)
| = E_X -log p(X) = [?]_x p(X=x) * -log p(X=x)
|
| cross-entropy: how surprised I expect Bob to be (if Bob's beliefs
| are q instead of p) H(p,q) = E_X -log q(X)
| = [?]_x p(X=x) * -log q(X=x)
|
| KL divergence: how much *more* surprised I expect Bob to be than
| me Dkl(p || q) = H(p,q) - H(p,p)
| = [?]_x p(X=x) * log p(X=x)/q(X=x)
|
| information gain: how much less surprised I expect Bob to be if
| he knew that Y=y IG(q|Y=y) = Dkl(q(X|Y=y) ||
| q(X))
|
| mutual information: how much information I expect to gain about X
| from learning the value of Y I(X;Y) = E_Y
| IG(q|Y=y) E_Y Dkl(q(X|Y=y) || q(X))
| sarma17 wrote:
| qq, can you explain a few noob thoughts. Why is surprisal
| defined as `log (1/p(X=x)`? I'm lacking the intuition for 1.
| why did someone think to use 1/p? 2. why do we have a log term
| here.
|
| Log shows up a lot but I don't think I ever really understood
| when/how do people decided to use it in their formulas. One KL
| divergence youtube video said, `let's normalize the p(P)/p(Q)
| using (log(p(P)/q(P)))^N` and then shows how to derived the KL
| formula. I'd also appreciate if you know how the N (sample size
| ig) is being used here.
|
| For a lot of math stuff I endup being confused about how people
| use operations. Like using 1/p(X=x) is like magic to me because
| I don't understand what the context is when someone thinks
| about the problem and then decides to do something. What's
| their thinking process here, what tool or process do I not know
| about that makes me confused?
| mrfox321 wrote:
| Typo in cross entropy, should be
|
| p log q
| jwarden wrote:
| Thank you, fixed.
| 3abiton wrote:
| Forgive my peasant mind, I've heard about K-L divergence
| before, I just don't know what it is used for nor what's
| special about it compared to other metrics?
| zmmmmm wrote:
| I think people who love math expressed in nomenclature well
| never understand those who don't ...
|
| K-L divergence is the pretty much really dumb, obvious thing
| you would probably think up if someone asked you to compare
| two probability distributions.
|
| You would start by going "oh I suppose I will look at how far
| apart they are at all the different values" and add it up.
| Then you would say, "oh, but 0.1 and 0.01 are really an order
| of magnitude apart, just like 0.01 and 0.001. Perhaps it will
| work better if I use the log of the probability". Then you
| would pause for thought and say, "Hmmmm hang on some of the
| values are really extreme but almost never happen, shouldn't
| it be weighted by how frequently they occur?".
|
| But of course paragraphs of mathematical symbols are the way
| many people prefer to express this.
| JonyEpsilon wrote:
| Practically speaking, it's a simple measure of how similar
| two probability distributions are, minimised (with value
| zero) when they are the same. So it's often used as a loss
| term in optimisations when you want two distributions to be
| pushed towards being similar. Sometimes this motivated by
| clever reasoning about information/probability ... but often
| it's more just "slap a KL on it", because it tends to work.
| max_ wrote:
| K-L Divergence is something that Keeps coming up in my research
| but I still don't understand what it is.
|
| Could someone give me a simple explanation as to what it's is.
|
| And also, what practical use cases does it have?
| eob wrote:
| As for practical use cases, one is to find an approximate
| optimization to a function
|
| - You want to find the min/max of some probability distribution
| P(x)
|
| - P(x) is too complicated to find a closed-form min, but you
| can draw samples from it.
|
| - So instead, you carefully construct some OTHER probability
| distribution Q(x|th) that you claim is structurally similar
| "enough" to P(x), parameterized by th.
|
| - Now you find the theta which minimizes the KL divergence
| KL(P(x) || Q(x|th)), which is equivalent to delivering you the
| parameters of th to Q(x|th) that make it [approximately] "most"
| similar to P(x) without ever having minimized P(x)
|
| It was a trick that came up a lot when AI consisted of giant
| Bayesian plate models for each specific task that you had to
| hand-optimize.
| versteegen wrote:
| Note that "drawing samples from P(x)" means to have training
| data drawn from P(x).
|
| You can form the 'empirical' probability distribution P'(x)
| from your n training samples {x_i}, with P'(x_i) = 1/n and
| P'(x) = 0 for all other x.
|
| Then finding the th which minimizes KL(P'(x) [?] Q(x|th)) is
| equivalent to finding the maximum likelihood estimate (MLE)
| given your training data.
|
| (Note: I don't know what's meant by "the min/max of some
| probability distribution P(x)" and suggest ignoring that)
| eob wrote:
| MLE | training data
|
| Just writing hand wavily :)
| leourbina wrote:
| One intuition is that KL-divergence represents a sort of
| "distance" between probability distributions. However, this
| isn't quite right as it doesn't satisfy some basic properties a
| real distance (a norm) would satisfy, including the fact that
| it isn't symmetric: KL(Q, P) != KL(P,Q), and it does not
| satisfy the triangle inequality. Nonetheless, KL(P,Q) gives you
| a good idea of how "far" is P is from Q: in the context of
| encoding, if you wanted to come up with an ideal encoding of
| symbols coming from P, but you guessed Q as the distribution of
| these symbols, then KL(P, Q) is the extra number of bits you'd
| have to use. One nice property is that in the case that KL(P,Q)
| = 0, P and Q are equal (almost everywhere, which for most
| applications is irrelevant). This makes it useful in the ML
| context as you can minimize KL divergence and know that the
| resulting "guessed" distribution is getting closer to the data
| distribution you're trying to guess using some parametrized
| function (an NN).
| kgwgk wrote:
| > it doesn't satisfy some basic properties a real distance (a
| norm) would satisfy, including the fact that it isn't
| symmetric [...] and it does not satisfy the triangle
| inequality.
|
| Not sure about "real" but one can have useful distances which
| are not symmetric like the distance between cities measured
| in time or in gallons.
| SleekEagle wrote:
| One use case is in the training of Diffusion Models. In the
| original formulation, the likelihood-maximizing objective is
| recast in terms of KL divergences. This is done because the KL
| divergence between Gaussians has a closed form, and the
| transition distributions in Diffusion Models are taken to be
| Gaussian, which makes the problem tenable.
|
| https://www.assemblyai.com/blog/diffusion-models-for-machine...
| ljlolel wrote:
| It just describes how two probability distributions are
| different. If they are the same then it's 0.
|
| Example: an LLM gives a probability distribution of the next
| word. If it is perfectly accurate at predicting the next word
| then divergence is 0 (100% probability on the actual next
| word). If it is slightly off or unsure then the divergence goes
| up.
| glial wrote:
| In the context of Bayesian inference, the KL divergence between
| the prior p(m) and the posterior p(m|x) is the information
| gained by observing x.
|
| With VAEs, adding a KL divergence to the loss term can be
| thought of as regularizing information gain from individual
| inputs.
| SilasX wrote:
| My favorite physical interpretation (discrete version):
|
| How many extra bits per character your data compression
| algorithm would need to store text from distribution P if it
| (mistakenly) assumed it were drawn from Q.
|
| That is, reserve the shortest "words" for the most common
| characters based on the assumption that the data will be drawn
| from Q. Then KL(P||Q) is how much bigger the compressed data
| will be (per input character) if the data is actually drawn
| from P.
| tech_ken wrote:
| Lots of different ways of motivating KL, but I think one that's
| frequently neglected in the ML era is its relationship to a
| likelihood ratio test. Kullback and Leibler originally
| characterized the KL divergence as the measure of the ability
| of discriminate between two distributions (ie. perform a
| hypothesis test between them), given some set of observations
| (the thing you're averaging over). You can read their original
| paper for free if you want to hear it in their own words,
| Kullback and Leibler 1951, although they go kind of deep on the
| stat theory.
|
| Say you're flipping a coin N times, and you get outcomes x_1,
| x_2, ...., x_N. You want to determine whether the coin comes up
| heads with probability P or probability Q (kind of a weird
| fiction that there are only two options, but roll with it). The
| classic way to do this would be a likelihood (log) ratio test,
| you compute this test statistic:
|
| Y = log( Pr[x_1,...,x_n|P] / Pr[x_1,...,x_N|Q] )
|
| Depending on the value of Y you make a decision about whether
| to go with P or Q, and IMO the definition makes intuitive sense
| for this purpose: if Y is positive then you favor P, if Y is
| negative you favor Q. If it's 0 then you can't pick between
| them. Simple and easy, plus the Neyman-Pearson lemma basically
| says that it's the best you could do (among the set of decision
| making criteria which satisfy a certain set of desirable
| properties).
|
| Having defined your test statistic Y, you might then ask what
| kind of values you can expect to get, even before you make any
| experimental measurements. Well, if you assume that P is the
| true probability (the null hypothesis) then E_X[Y|P,Q] =
| KL[P|Q]. Basically the expected value of the log-likelihood
| ratio, the thing you would use to decide between P and Q,
| characterizes how similar or different they are. When the
| expected value is close to 0, and hence P and Q are similar,
| then before conducting the experiment you can expect that it
| will be hard to distinguish between them. When the divergence
| is very large then you can expect it will be easy.
| mreid wrote:
| This likelihood ratio approach highlights the fact that KL
| divergence is a member of the family of Csiszar
| F-divergences. These are measures of "distance" between
| distributions of the form E_Q[ F(p(x)/q(x)) ] where F is any
| convex function with F(1) = 0. This is kind of a
| generalization of log-likelihood where F kind of "weights"
| the badness of ratios different to 1. When F is -log you get
| KL divergence.
|
| Another curious fact about KL divergences is they are also a
| Bregman divergence: take a convex function H and define
| B_H(P, Q) = \sum_x H(p(x)) - H(q(x)) - <[?]H(q(x)), p(x) -
| q(x)>. These generalize pointwise square Euclidean distance.
| KL is obtained when H(P) is negative entropy \sum_x p(x) log
| p(x).
|
| I spent a bunch of time studying divergences over
| distributions (e.g., see my blog post[1]) and in particular
| these two classes and the _really_ neat fact about KL
| divergence is that it is essentially the only divergence that
| is both an F-divergence and a Bregman divergence. This is
| basically due to the property of log that turns logs of
| products into sums.
|
| [1]: https://mark.reid.name/blog/meet-the-bregman-
| divergences.htm...
| bjornsing wrote:
| I think KL divergence is best understood through the lens of
| variational inference. Inference in the Bayesian regime is a
| balancing act between best explaining the data and "keeping it
| simple", by staying close to the prior. KL divergence is the
| (only) divergence measure that places just the right amount of
| weight on the prior to make variational inference proper
| Bayesian inference. I've written about it here:
| http://www.openias.org/variational-coin-toss
| wdrw wrote:
| For me, the intuitive way of understanding it is, "how badly
| would a gambler lose in the long term, if they keep betting on
| a game believing the probability distribution is X but it is in
| actual fact Y". It also explains why KL divergence is
| assymetric, and why it goes to infinity / undefined when the
| expected probability distribution has zeros where the true
| distribution has non-zeros. Suppose an urn can have red, blue
| and green balls. If the true distribution (X) is that there are
| no red balls at all, but the gambler believes (Y) that there is
| a small fraction of red balls, the gambler would lose a bit of
| money with every bet on red, but overall the loss is finite.
| But suppose the gambler beleives (Y) there are absolutely no
| red balls in the urn, but in actual fact (X) there is some
| small fraction of them. According to the gambler's beliefs it
| would be rational to gamble potentially infinite money on the
| ball not being red, so the loss is potentially infinite. There
| is a parallel here to data compression, transmission, etc (KL
| divergence between expected and actual distributions in
| information theory) - if you believe a certain bit sequence
| will _never_ occur in the input sequence, you won 't assign it
| a code, and so if it ever does actually occur you won't be able
| to transmit it at all ("infinite loss"). If you beleive it will
| occur very infrequently, you will assign it a very long code,
| and so if it actually occurs very frequently your output data
| will be very long (large loss, large KL divergence).
| abetusk wrote:
| Intuitively, it measures the difference between two probability
| distributions. It's not symmetric, so it's not quite that, but
| in my opinion, it's good intuition.
|
| As motivation, say you're an internet provider, providing
| internet service to a business. You naturally want to save
| money, so you perhaps want to compress packets before they go
| over the wire. Let's say the business you're providing service
| to also compresses their data, but they've made a mistake and
| do it inefficiently.
|
| Let's say the business has, incorrectly, determined the
| probability distribution for their data to be $q(x)$. That is,
| they assign probability of seeing symbol $x$ to be $q(x)$.
| Let's say you've determined the "true" distribution to be
| $p(x)$. The entropy, or number of bits, they expect to transmit
| per packet/symbol will be $-\sum p(x) lg(q(x))$. Meaning,
| they'll compress their stream under the assumption that the
| distribution is $q(x)$ but the actually probability of seeing a
| packet, $x$, is $p(x)$, which is why the term $p(x) lg(q(x))$
| shows up.
|
| The number of bits you're transmitting is just $-\sum p(x)
| lg(p(x))$. Now we ask, how many bits, per packet, is the
| savings of your method over the businesses? This is $-\sum p(x)
| lg(q(x)/p(x))$, which is exactly the Kullback-Leibler
| divergence (maybe up to a sign difference).
|
| In other words, given a "guess" at a distribution and the
| "true" distribution, how bad is it between them? This is the
| Kullback-Leibler distribution and why it shows up (I believe)
| in machine learning and fitness functions.
|
| As a more concrete example, I just ran across a paper talking
| [0] about using WFC [1] to asses how well it, and other
| algorithms, do when trying to create generative "super mario
| brothers" like levels. Take a 2x2 or 3x3 grid, make a library
| of tiles, use that to generate a random level, then use the K-L
| divergence to determine how well your generative algorithm has
| done compared to the observed distribution from an example
| image.
|
| [0] https://arxiv.org/pdf/1905.05077.pdf
|
| [1] https://github.com/mxgmn/WaveFunctionCollapse
| enthdegree wrote:
| Your characterization (originally due to Cover, I think) is a
| strong one because it concretely ties KL divergence to
| nature. I also like Sanov's theorem for this.
|
| I have sat through many frustrating anti-explanations of the
| following sort:
|
| >What is KL divergence you ask? Why, it's simply a
| quantitative difference between distributions. The further
| away distributions are, the higher KL divergence is... It's
| like a distance-squared between distributions... but it isn't
| symmetric and it doesn't obey any usual triangle inequality,
| so this analogy isn't helpful for analysis... Pinsker's
| inequality gives a useful lower bound. A useful general upper
| bound is, uhh,... uh...
|
| This class of answer is totally uninformative (and
| discrediting if given, IMO) because it does not provide a
| useful, unique characterization of KL divergence, only
| fundamentally inaccurate descriptions of it.
| tel wrote:
| It's a useful way of measuring how different two probability
| distributions are. If F and G are distributions, Kl(F||G) is
| non-negative real number which is larger if F and G are less
| similar.
|
| In a lot of statistical estimation procedures, you have some
| kind of "current estimate" distribution which has nice
| properties and some kind of "true distribution" which you'd
| like to use your nice distribution to approximate. It's then
| common to create a system which manipulates the parameters of
| your current estimate distribution to minimize the KL-
| divergence with the true distribution.
|
| A relatively simple example of this is fitting a Gaussian
| mixture model. If you look up how that process is derived
| you'll see it depends centrally on minimizing the KL-divergence
| between two distributions.
|
| There are other ways to measure the difference between two
| probability distributions, but the KL-divergence has some nice
| properties. It shows up as the answer to lots of well-motivated
| questions around statistical inference (Neyman-Pearson testing,
| information geometry, Bayesian inference, entropy) and it has a
| form which is somewhat amenable to algebraic manipulation.
|
| In some sense it's popular because it keeps showing up and
| working. People recognize its form, consider it relatively
| simple, and find it meaningful to talk about. It's common
| enough that you might even begin considering a problem by
| asking if you can minimize the KL-divergence between your
| estimate and some goal just knowing that outright it will
| likely lead to a successful solution to your problem.
| tnecniv wrote:
| In addition to what others have said, it fixes a "bug" in
| Shannon's continuous version of the entropy. Shannon assumed
| you could just replace the sum with an integral and call it a
| day. However, two bad things happen:
|
| 1. This definition of entropy is not invariant to coordinate
| transforms. If you change the parameters of your distribution,
| you get a different value for the entropy, despite the change
| of parameters not adding or removing information.
|
| 2. You can get negative values for the entropy.
|
| Jaynes argued (in a way that's quite readable if you find his
| original paper, I'm on mobile) that really you should pick a
| base reference measure q and define the entropy as -KL(p; q).
| This fixes the first bug which is the critical one. The second
| bug is halfway fixed because this quantity is always non-
| positive. However that's alright because often we really care
| about the change in entropy not the absolute value (like how we
| care about the change in potential energy not the absolute
| value).
|
| This gets at why the KL divergence is often called the relative
| entropy. It is the entropy relative to the reference measure q.
| chermi wrote:
| I assume you're referring to his most famous work, where he
| introduces his maxent stuff?
| http://bayes.wustl.edu/etj/articles/theory.1.pdf
|
| I would also highly recommend the free and excellent book by
| MacKay for understanding this:
| http://www.inference.org.uk/mackay/itila/book.html
| nonotmenonono wrote:
| "The content of an information itself, maybe a signal, random
| variable, or event",
|
| "...defined through a negative logarithm of probability",
|
| "...to model a given outcome", occurred.
|
| P-:
| 317070 wrote:
| Here is another view, which also explains where the logarithm
| comes from.
|
| The KL between two distributions of a random variable, say
| Kl[p|q], says that if you made a perfect compression algorithm
| for samples from distribution q, how many extra bits/nats you
| expect to need to code samples that actually come from p
| instead if you use that compression algorithm.
|
| And compression is all about keeping only the true information
| that is encoded in a sample.
| puzzlingcaptcha wrote:
| One practical use might be assessing the similarity of a high-
| dimensional data set and its low-dimensional projection, as
| used in e.g. t-SNE.
| mrv_asura wrote:
| I learnt about KL Divergence recently and it was pretty cool to
| know that cross-entropy loss originated from KL Divergence. But
| could someone give me the cases where it is preferred to use
| Mean-squared Error loss vs Cross-entropy loss? Is there any
| merits or demerits of using either?
| mufasachan wrote:
| TL;DR: One is about distance in space, the other is about
| spread in space.
|
| KL-Divergence is not a metric, it's not symmetric. It's biased
| toward your reference distribution. Although, it gives an
| probabilistic / information view of a difference in
| distributions. One of the outcome of KL is that it will
| highlight the tail of your distribution.
|
| Euclidean distance, L2, is a metric. So it is suited when you
| need a metric. Also, it does not give an insight of any
| distribution phenomenons expect the means of the distribution.
|
| For example, you are a teacher, you have two classes. You want
| to compare the grades. L2 can be a summary of how far the
| grades are apart. The length of tails of both grades
| distribution won't have impact if they have the same mean.
| That's good if you want to know the average level of both
| classes. KL will give the point of view how the class grades
| spread are alike. Two classes can have small KL Divergence if
| they have the same shapes of distributions. If your classes are
| very different - one is very homogeneous and the other one is
| very heterogeneous - then your KL will be big, even if the
| average are very close.
| uoaei wrote:
| This is the NN 101 explanation: mean-square loss is for
| regression, cross-entropy is for classification.
|
| NN 201 explanation: mean-square is about finite but continuous
| errors (residuals) of predicted value vs true value of the
| output, cross-entropy is for distributions over discrete sets
| of categories.
|
| NN 501 explanation: the task of the NN and the form of its
| outputs should be defined in terms of the "shape" or nature of
| the residuals of its predictions. Mean-square corresponds to
| predicting means with Gaussian residuals, cross-entropy
| corresponds to predicting over discrete outputs with
| multinomial (mutually exclusive) structure. Indeed you can
| derive any loss function you want by first defining the
| expected form of the residuals and then deriving the negative
| log-likelihood of the associated distribution.
| imjonse wrote:
| If you look deep enough they are not exclusive. From the
| deeplearning book:
|
| "Many authors use the term "cross-entropy" to identify
| specifically the negative log-likelihood of a Bernoulli or
| softmax distribution, but that is a misnomer. Any loss
| consisting of a negative log-likelihood is a cross- entropy
| between the empirical distribution defined by the training set
| and the probability distribution defined by model. For example,
| mean squared error is the cross-entropy between the empirical
| distribution and a Gaussian model"
|
| https://stats.stackexchange.com/questions/288451/why-is-mean...
| canjobear wrote:
| In NN training, minimizing cross entropy is equivalent to
| minimizing KL divergence. This is because cross entropy is
| equal to (entropy of the true distribution) + (KL divergence
| from the true distribution to the model). Obviously by changing
| the model you can't change the first term, only the second. So
| when you minimize cross entropy, you are minimizing KL
| divergence.
|
| Minimizing mean-squared-error loss is equivalent to minimizing
| KL divergence (and thus cross entropy) under the assumption
| that your model produces a vector that parameterizes the mean
| of a multivariate Gaussian distribution which is then used to
| predict your data. This is the most natural way to set up a
| model that predicts continuous data.
| zerojames wrote:
| I have used KL-divergence in authorship verification:
| https://github.com/capjamesg/pysurprisal/blob/main/pysurpris...
|
| My theory was: calculate entropy ("surprisal") of used words in a
| language (in my case, from an NYT corpus), then calculate KL-
| divergence between a given prose and a collection of surprisals
| for different authors. The author to whom the prose had the
| highest KL-divergence was assumed to be the author. I think it
| has been used in stylometry a bit.
| versteegen wrote:
| *lowest KL-divergence
| zerojames wrote:
| Yes indeed -- thank you! :facepalm:
| ljlolel wrote:
| Equivalent to cross entropy for loss on NN
| janalsncm wrote:
| Btw, KL divergence isn't symmetrical. So D(P,Q) != D(Q,P). If you
| need a symmetrical version of it, you can use Jensen-Shannon
| divergence which is the mean of D(P,Q) and D(Q,P). Or if you only
| care about relative distances you can just use the sum.
| techwizrd wrote:
| We use KL-divergence to calculate how surprising a time-series
| anomaly is and rank them for aviation safety, e.g., give me a
| ranked list of the most surprising increases in a safety metric.
| It's quite handy!
| acc_297 wrote:
| That's very cool what are the most surprising increases in
| aviation safety metrics lately?
| j2kun wrote:
| Would you be interested in chatting with me about this topic
| for a book I'm working on? If so please reach out at
| mathintersectprogramming@gmail.com and we can set up a time to
| chat!
| jszymborski wrote:
| VAEs have made me both love and hate KLD. Goddamn mode collapse.
| Q6T46nT668w6i3m wrote:
| Yeah. I feel the same. But I also feel the same about mode
| collapse! It's often useful for debugging issues that might
| otherwise go undetected.
| nravic wrote:
| IIRC (and in my experience) KL divergence doesn't account for
| double counting. Wrote a paper where I ended up having to use a
| custom metric instead:
| https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=4...
| tysam_and wrote:
| ...what does this mean?
| tysam_and wrote:
| Here is the simplest way of explaining the KL divergence:
|
| The KL divergence yields a concrete value that tells you how many
| actual bits of space on disk you will waste if you try to use an
| encoding table from one ZIP file of data to encode another ZIP
| file of data. It's not just theoretical, this is exactly the type
| of task that it's used for.
|
| The closer the folders are to each other in content, the fewer
| wasted bits. So, we can use this to measure how similar two sets
| of information are, in a manner of speaking.
|
| These 'wasted bits' are also known as relative entropy, since
| entropy basically is a measure of how disordered something can
| be. The more disordered, the more possibilities we have to choose
| from, thus the more information possible.
|
| Entropy does not guarantee that the information is usable. It
| only guarantees how much of this quantity we can get, much like
| pipes serving water. Yes, they will likely serve water, but you
| can accidentally have sludge come through instead. Still, their
| capacity is the same.
|
| One thing to note is that with our ZIP files, if you use the
| encoding tables from one to encode the other, then you will end
| up with different relative entropy (i.e. our 'wasted bits')
| numbers than if you did the vice versa. This is because the KL is
| not what's called symmetric. That is, it can have different
| meaning based upon which direction it goes.
|
| Can you pull out a piece of paper, make yourself an example
| problem, and tease out an intuition as to why?
| nuancebydefault wrote:
| Thanks for giving a concrete example. For me, formulas like -
| log p (x) are much harder to understand.
| tysam_and wrote:
| First -- thank you for the thanks! It means a lot. :)
| Secondly -- same. It's the way my brain works. In a
| particular test I was administered a while back, 'coding'
| (i.e. mapping information transiently to symbols) was by far
| one of my weakest skills, oddly enough.
|
| To me, I think in shapes. Ideas have shapes. Some ideas have
| very similar shapes.
|
| This makes me very good at neural network engineering and
| research.
___________________________________________________________________
(page generated 2023-08-22 23:01 UTC)