Variational representations
In the formula for ratio divergences
since the function
or equivalently, using the convex dual of
with equality whenever
Plugging this into
with equality when
Rearranging, this becomes
That is, when
Total variation distance
Things are particularly nice in the case of total variation distance:
Indeed,
for any
for slope , for slope .
Therefore we can write it as
or rephrasing
That is, the total variation distance is the maximum difference between the probabilities that
Information divergence
For the information divergence
Indeed, we know that the inequality will be tight iff
so
The convex dual of
so
and optimizing over
or more simply, interpreting
Choosing the base
The above is2 actually true for any base of the logarithm and exponentiation, as long as
for
That is, when
#to-write
- special case where
is uniform - see it as generalization of union bound:
- when you show that
(say), usually it’s because you pick in a way that depends on and want to take a worst-case view, so what you really care about is , in which case the union bound becomes ?- actually i’m not sure that this one is an axis on which DV generalizes the union bound
- ok and i guess it’s mostly smoothing it in another axis, which is from
values to arbitrary real values, i.e. instead of , you want to bound - … in addition to generalizing by making the prior non-uniform
- note: the most obvious result that generalizes in this way uses max-divergence instead
- and as compensation for that sharp penalty, it doesn’t have to be about exponentials
- i.e. just get
- #to-think can you use this to make the variational representations of power divergences make sense in general?
- actually maybe this is all cleaner if e.g. you look at the posterior form?
- do they all take the form
for some function ?
- note: the most obvious result that generalizes in this way uses max-divergence instead
- when you show that
- ELBO is just taking
and plugging in log likelihood?- yuppp:
- or (for SUB) rearranging,
- so
- so
- yuppp:
- and could maybe present it alternatively as
- and/or
- and/or
- (which is worse than if we had
)
- (which is worse than if we had
- (but the current formulation still seems good as the main one; e.g. for PAC-Bayes bounds
is indeed the core thing)
- and/or
#to-think
- make it make a little more sense to you that you’re adding KL (which is dimensionless) to a dimensional quantity?
- maybe that one’s really just beacuse you should be using the non-logged version of KL, i.e.
- (where
is defined as in Power entropies and divergences)
- (where
- maybe that one’s really just beacuse you should be using the non-logged version of KL, i.e.
-divergence
We get
but I haven’t yet found an intuitive proof for this.

#to-think is something like
- then let
, giving- ok nice that sounds very promising actually
- and it suggests that a multiplicative version is cleaner in many cases?
As concentration bounds
These variational representations are in a sense a generalization of concentration bounds. Instead of only bounding the probability that
Formally, if we want to bound
which then gives an upper bound on
Information divergence
We have
that is, for
which is the moment-generating function method (also known as Chernoff bound).
-divergence
Let
and let’s rewrite
that is, for
which is Chebyshev’s inequality.
Total variation distance
This one requires a bit more adaptation (and because of this is perhaps more of a hairpull). We have
but this only holds if
which means that for any
so for
which is Markov’s inequality.
See also
#to-think are most/all of these basically versions of Hölder’s inequality, the same way that Cauchy–Schwarz inequality can be understood in terms of weighting one sequence by the other? see SRLL p.128
-
We can do slightly better in general by letting
be the identity, so that the information divergence is that of itself and not done underlying variable (which is better by the data processing inequality for divergences. This is also why we can afford to be a bit sloppy about what exactly the subscripts mean in and . ↩ -
by the change of variables
↩