Information divergence
The information divergence1 measures the information gained when going from a prior distribution
We can extend the notion to random variables:
(note that this is only a function of the distributions of
Interpretations
The information divergence can be understood in three main ways, as a measure of
- the information gained when updating from
to , - the specificity of draws from
compared to , - and the unlikelihood of
as a model given that your observations suggest .
And all of these links ultimately rely on the fact that the information divergence adds up nicely if we compare several independent samples from both
See Six (and a half) intuitions for KL divergence for more intuitions, in particular ones related to betting.
Information gained
The information divergence is the core of information theory because it defines what it means to gain information based on observations. In particular,
- the mutual information
is the average information we gain about when observing (and vice versa): ; - and the entropy
is the average information we gain about when observing it: .
Specificity of draws from
The prior
Asymptotic likelihood ratio
The information divergence
except with probability
In particular, this means that the information divergence measures the amount of selection pressure you typically need to pick out a sample that roughly looks like it came from
Entropy deficit
If
Unlikelihood of
Hypothesis testing
As a consequence, there exists a test for “detecting “ based on a large enough number of samples, such that
- if the samples are from
, the test comes up positive with high probability in (the test is sensitive), - and if the samples are from
, the test comes up positive with probability at most (the test is extremely specific).2
So if the test comes up positive, the belief that the samples came from
Learning theory: minimizing divergence maximizes likelihood of
In learning theory, we can let
In terms of dynamics, the model knows less than nature, so it starts out more uncertain, and therefore
If instead we define
#to-write explicitly model Q as P conditioned on E?
Properties
Nonnegativity
As a ratio divergence, it is always nonnegative, and is
Asymptotics
Using its nonnegative form, we get
(change
Additivity
Independent priors and posteriors
If the priors and posteriors are both product distributions, then the divergence can be decomposed:
This situation mostly occurs when we believe these
Independent priors
If the prior variables
Intuitively, this is because you can notice more ways in which
I’m not sure about the case when
#to-write
- also rewrite in distribution form?
- no actually maybe it’s better to write it like
? - ohh hold on this is where “total correlation” comes in right?
- can you also write that one as a sum of mutual information? or of interaction informations?
- is it the sum of all the interaction informations
-
is I[X1,X2] + I[X3;X2 X1] + I[X4;X3 X1,X2] etc?
- ==okay this is highly relevant to my plight with the small-read local lemma!==
- can you also write that one as a sum of mutual information? or of interaction informations?
- also explain typical situation: were from a common process but now have observed something about the values in aggregate
- link to the information theoretic proof of the Chernoff bound
General case
In general, though, the quantities
On the other hand, by the section on conditional divergence below, we know that
and the second example below approximately saturates this bound.3

#figure redo the figure with the new notation
Triangle inequality
Sadly, the information divergence doesn’t satisfy the triangle inequality. In general,
Some representative examples:
- “zoom regime”:
- example 1
- if the distributions are
uniform over gives probability to the value and is otherwise uniform is deterministically
- then
- if the distributions are
- example 2 (even worse)
- if the distributions are
- then
and are both- but
- if the distributions are
- example 1
- “tweak regime”:
- if
, , - then
- if
#to-think I think in general, for any distributions
Minimization
Information projection: minimizing over
Minimizing
That is,
More generally,
- the expression
doesn’t have to correspond to an actual distribution: we can let it be any energy function - and we can weight the entropy maximization term by an “temperature”
,
in which case the problem becomes
That is,
#to-write
- link to Variational representations#Information divergence
- generalize this optimization pattern (keeps popping up and almost magical, see
lee--entropy-opt-1.pdf
how magical it is that the density takes a simple form, completely unaffected by actual values)
Moment projection: minimizing over
Minimizing
Special cases
Binary divergence
Let
be the divergence between two Bernoullis
Just like the general case,
Assuming wlog that
(where the asymptotic constant is
#figure
#to-write
- expected score for logarithmic scoring rule
- full second derivative matrix
(I believe? yup seems right since the second derivative is )
Normals
#to-write
Conditional divergence
Similar to entropy (but unlike most ratio divergences), we can define the conditional information divergence by taking an average of the divergences of the conditional distributions:
where
Chain rule
Similar to conditional entropy, this gives a chain rule:
which immediately shows that adding data can only increase the divergence.
Mixtures and convexity
Since information divergence is neither subadditive nor superadditive,
That is, conditioning by the same variable can only increase the divergence, which is equivalent to saying that the information divergence is convex.
See also
-
also known as Kullback–Leibler divergence and relative entropy ↩
-
Intuitively, this is because a typical sample
from is about times more likely according to than , so the set of values on which the test has to come up positive in order to have decent sensitivity will have very small weight when measured according to . ↩ -
And maybe it even saturates it optimally as a function of
? #to-think ↩