Talk:Kullback–Leibler divergence

From Wikipedia, the free encyclopedia

WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, which collaborates on articles related to mathematics.
Mathematics rating: B Class Mid Priority  Field: Probability and statistics
WikiProject Physics This article is within the scope of WikiProject Physics, which collaborates on articles related to physics.
??? This article has not yet received a rating on the assessment scale. [FAQ]
??? This article has not yet received an importance rating within physics.

Help with this template

To whom it may concern. Recently the statement "KL(p,q) = 0 iff p=q" was added. I suspect that's not quite the case; maybe we want "KL(p,q) = 0 iff p=q (except for a set of measure 0 wrt p)" ?? Happy editing, Wile E. Heresiarch 01:41, 21 Oct 2004 (UTC)

I added "KL(p,q) = 0 iff p=q", which is a stronger claim than "KL(p,p)=0" in an earlier revision. My own understanding of measure theory is pretty limited; moreover, the article does not explicitly mention measures in connection with the integrals. However, Kullback and Leibler in their 1951 paper (lemma 3.1) did consider this and say that divergence is equal to zero if and only if the measurable functions p and q are equivalent with respect to the measure (i.e. p and q are equal except on a null set). That would include the case you mentioned, wouldn't it? --MarkSweep 03:29, 21 Oct 2004 (UTC)
Yes, that's what I'm getting at. What's not clear to me is that when we say p and q are equal except for a set of $FOO-measure 0, which measure $FOO are we talking about? I guessed the measure induced by p; but K & L must have specified which one in their paper. Wile E. Heresiarch 00:59, 22 Oct 2004 (UTC)
I checked the K&L paper again, but they simply define a compound measure λ as the pair of the measures associated with the functions p and q. The lemma is stated in terms of the measure λ. --MarkSweep 09:28, 24 Oct 2004 (UTC)


You could say "KL(p,q) = 0 iff p=q almost everywhere" which is concise and says what I think you're trying to say. - grubber 01:57, 17 October 2005 (UTC)
The K-L divergence is between two probability measures on the same space. The support of one measure must be contained in the support of the other measure for the K-L divergence to be defined: For  D_{KL}(\mathbb P \| \mathbb Q) to be defined, it is necessary that  \mathrm{supp}\, \mathbb P \subseteq \mathrm{supp}\, \mathbb Q . Otherwise there is an unavoidable zero in the denominator in the integral. Here the "support" is the intersection of all closed sets of measure 1, and has itself measure 1 because the space of real numbers is second-countable. The integral should actually be taken over  \mathrm{supp}\, \mathbb P rather than over all the reals. The "logN" that appears in some of the equations in the article should be the logarithm of the cardinality of the support of some probability measure on a discrete space. -- 130.94.162.61 20:35, 25 February 2006 (UTC)
Isn't the necessary condition  \mathbb P << \mathbb Q and isn't D_{KL}(\mathbb P \| \mathbb Q) = \int_{\mathrm{supp}\,\mathbb P} \log\frac{d\mathbb P}{d\mathbb Q}\,d\mathbb P = \int_{\mathrm{supp}\,\mathbb P} \frac{d\mathbb P}{d\mathbb Q}\log\frac{d\mathbb P}{d\mathbb Q}\,d\mathbb Q? -- 130.94.162.61 19:15, 27 February 2006 (UTC)
Re reversion: Why have a whole article on "cross entropy", then, if it's not significant? -- 130.94.162.61 04:18, 2 March 2006 (UTC)
Excellent article, by the way. Has the context, clear explanation of technical details, related concepts, everything that is needed in a technical article. -- 130.94.162.61 16:37, 11 March 2006 (UTC)
I agree that the article is excellent. I have several comments and a question.
Regarding the comment of 25 February 2006 and the follow-on, this has special importance when analyzing natural data (as opposed to data composed of man-made alphabets). If the P distribution is based on a corpus of pre-existing data, then, as soon as you discover a new "letter" in an observed sample (upon which the Q distribution is based), you can no longer use Dkl, because then there will be a zero in the denominator.
To give a concrete example, suppose you are looking at amino-acid distributions in proteins and are using Dkl as a measure of how different the composition of a certain class of proteins (Q) is from that of a broad sample (P) comprising many classes. If the Q set lacks some amino acids that the P set contains, you can still compute a Dkl. But suppose that all of a sudden you discover a new amino acid in the Q set; this isn't as far-fetched as it sounds, if you admit somatically modified species. Then, no matter how infrequent that new species is, Dkl goes to infinity (or, if you prefer, hell in a handbasket).
This may one of the reasons why K&L's symmetric measure has not met with favor: as the above example shows, you could easily come across a message that lacks some of the characters in the alphabet that P was computed over. In this case, the reverse -- and therefore the symmetric -- measure cannot be computed, even though the forward one can.
For example, imagine trying to create a Dkl for the composition of a single protein, compared to the composition of a broad set of proteins. It would be quite possible that some amino acid present in the large sample might not be represented in the small sample. But the one-way Dkl can still be computed and is useful.
Question: Is there a literature on expected error bounds on a Dkl estimate due to finite sample size (as there is for the Shannon entropy?) --Shenkin 03:05, 4 July 2006 (UTC)

[edit] regarding recent revert

Regarding, jheald's recent revert: I spent a lot of time reorganizing the article and wanted to discuss the changes. Maybe some of them can be reimplemented? Here's a list:

  • you cannot have two probability distributions on the same random variable -- it's nonsense! What you have are two random variables -- but why even talk about that. Just say given two discrete probability distributions.
  • Absolutely you can, if they are conditioned on different information, or reflect different individuals' different knowledge, or different degrees of belief; or if one distribution is based on deliberate approximation. P(X|I) is different from P(X|D,I), but they are both distributions on the random variable X.
I understand what you mean. How does this mesh with the definition at random variable though? It says that every random variable follows a (single?) distribution. Probability theory and measurable function say that random variables are functions that map outcomes to real numbers (in the discrete case at least.
Strictly speaking, the random variable is the mapping X: Ω -> R. That is a function that can be applied to many probability spaces, distinguished by different measures: (Ώ,Σ,P), (Ώ,Σ,Q), (Ώ,Σ,R), (Ώ,Σ,S) etc. More loosely, we tend to talk of the random variable X as a quantity to which we can assign probability distribution(s), where these are the distributions induced from the measures P,Q,R,S by applying the mapping X to Ώ. Either way, it is entirely conventional to talk about "the probability distribution of the random variable X". Jheald 21:59, 3 March 2007 (UTC)
  • use distinguish template instead of clouding the text
  • Don't use ugly hat notes for mere asides (and that template is particularly ugly). Nobody is going to type in Kullback-Leibler if they want an article on vector calculus.
You're probably right :) I was just trying to get rid of the note "(not to be confused with..." from the text.
  • list of alternative names all together instead of mixed into the text at the author's whim
  • Lists which are too long are hard to read, and break up the flow. Better to stress only information divergence, information gain, and relative entropy first. Information gain and relative entropy are particularly important because they are different ways to think about the K-L divergence. They should stand out. On the other hand K-L distance is just an obvious loose abbreviation.
Okay... :)
  • Generalizing the two examples is a new paragraph
  • Unnecessary, and visually less appealing. Better rhythm without the break.
Yeah... but poor grammar :(
  • Gibbs inequality is the most basic property and comes first
  • No, the most important property is that this functional means something. The anchor for that meaning is the Kraft-McMillan theorem. And that meaning informs what the other properties mean.
Hmmm. that's a really good point. I didn't see it that way before.
  • the motivation, properties and terminology section is split up into three sections
  • Two-line long sections should suggest over-division. Besides, the whole point of calling it a "directed" divergence was about the non-symmetry of D(P||Q) and D(Q||P).
  • the note about the KL divergence being well-defined for continuous distributions is superfluous given that it is defined for continuous distributions in the introduction.
  • Not superfluous. It is hugely important in this context, in that the Shannon entropy does not have a reliably interpretable meaning for continuous distributions. The K-L divergence (a.k.a. relative entropy) does.
Oh! :) We could make that clear: "Unlike Shannon entry, the KL divergence remains..."

--MisterSheik 13:51, 3 March 2007 (UTC)

Sheik, backing revisions out wholesale is not something I do lightly. I can see from the log that you spent time thinking about them. But in this case, as with Beta distribution, I thought that not one of your edits was positive for the article. That contrasts with Probability distribution, Quantities of information, Probability theory and Information theory, where although I have issues with some of the changes you made, I thought some of the steps were in the right direction. Jheald 15:37, 3 March 2007 (UTC)
Thanks for getting back to me Jheald. I'm glad that backing out revisions isn't something you take lightly :) I'm going to take a break from this article so that I can think it over some more. I'll add any ideas to the talk page so that we can discuss them.
Also, I'm glad you were okay with (most) of my changes to those other articles. I didn't actually add any information to probability theory; I just brought together information that was spread over many pages and mostly reduplicated. I think it's unfortunate that some pages (like the information entropy pages) seem disorganized (individual pages are organized, but the group as a whole is hard to read--you end up readin g the same thing on many pages.) Do you think that these pages could be organized? Do you have any idea how we could start?

MisterSheik 18:28, 3 March 2007 (UTC)

[edit] motivation

I read the motivation, but I am not really sure what it means. The first two sentences have nothing to do with the rest of the article. Could someone make this clearer? —Preceding unsigned comment added by Forwardmeasure (talkcontribs) 03:38, 25 March 2007

In fact, there is no "motivation" in there at all. That section really should be renamed. MisterSheik 03:54, 25 March 2007 (UTC)

[edit] f-divergence

The link to f-divergence I placed in the opening section was removed and not placed anywhere else in the article. Is there a reason for this? This family of divergences is a fairly important generalisation and may lead readers who find the KL-divergence unsuitable to something more appropriate. Should I put it back in somewhere else? If not, a reason would be helpful as I can't understand the motivation to remove it completely. MDReid (talk) 11:26, 15 January 2008 (UTC)