In the formula for ratio divergences

Df((Q)Q(P)P):=ExP[f(Q(x)P(x))],

since the function f is convex, it can be lower bounded by the tangent at any point t:

f(t)f(t)f(t)(tt)

or equivalently, using the convex dual of f, it can be lower bounded by a tangent of any slope a:

f(t)atf(a),

with equality whenever a=f(t).

Plugging this into Df((Q)Q(P)P), we can even pick different values of a depending on x, which gives us a very general family of lower bounds on the ratio divergence:

Df((Q)Q(P)P)=ExP[f(Q(x)P(x))]ExP[a(x)Q(x)P(x)f(a(x))]=ExQ[a(x)]ExP[f(a(x))]

with equality when a(x)=f(Q(x)P(x)) for all x.

Rearranging, this becomes

ExQ[a(x)]ExP[f(a(x))]+Df((Q)Q(P)P).

That is, when Df((Q)Q(P)P) is not too large, it gives us a generic upper bound on expectations over Q in terms of some other expectation over P!

Total variation distance

Things are particularly nice in the case of total variation distance:

dTV(P,Q)=ExP[(Q(x)P(x)1)+].

Indeed, f(t)=(t1)+ has the tangents

f(t)a(t1)

for any a[0,1], though in general only two of them are interesting:

  • f(t)0 for slope a=0,
  • f(t)t1 for slope a=1.

Therefore we can write it as

dTV(P,Q)=maxa(){0,1}(ExQ[a(x)]ExP[a(x)]),

or rephrasing a() as the indicator of a set A,

dTV(P,Q)=maxA(Q(A)P(A)).

That is, the total variation distance is the maximum difference between the probabilities that P and Q give to some event.

Information divergence

For the information divergence D((Q)Q(P)P), we will in general be able to get a better bound if we consider all possible tilts of f around t=1

f(t):=tlogt+(r1)(t1).

Indeed, we know that the inequality will be tight iff

a=f(t)=logt+reat,

so a(x) can be interpreted of as the logarithm of an (unnormalized) density function, with optimality (for some r) when ea(x)Q(x)P(x) for all x.

The convex dual of f at the slope a=f(t) is:

f(a)=attlogt(r1)(t1)=att(ar)(r1)(t1)=t+r1=ear+r1,

so

ExQ[a(x)]ExP[ea(x)r+r1]+D((Q)Q(P)P)=ExP[ea(x)]er+r1+D((Q)Q(P)P),

and optimizing over r requires er=ExP[ea(x)] (i.e. er is the normalization constant of ea as a density), which gives

ExQ[a(x)]r+D((Q)Q(P)P)=logExP[expa(x)]+D((Q)Q(P)P),

or more simply, interpreting a(x) as a random variable a,1

EQ[a]logEP[expa]+D((Q)Q(P)P).

Choosing the base

The above is2 actually true for any base of the logarithm and exponentiation, as long as D((Q)Q(P)P) is also expressed over that same base:

EQ[a]logbEP[ba]+EQ[logbQ(x)P(x)]``D((Q)Q(P)P) in base b''=lnEP[esa]moment-generating function+D((Q)Q(P)P)s,

for s:=lnb.

That is, when D((Q)Q(P)P) is small, we can bound the expected value of a over Q by its moment-generating function over P. The smaller s is, the more D((Q)Q(P)P) gets amplified, so it is generally best to choose one of the largest possible values of s for which the moment-generating function doesn’t blow up.

#to-write

  • special case where P is uniform
  • see it as generalization of union bound:
    • when you show that Prx[θ:Bx(θ)]13 (say), usually it’s because you pick θ in a way that depends on x and want to take a worst-case view, so what you really care about is f:Prx[Bx(f(x))], in which case the union bound becomes f:Prx[Bx(f(x))]|Θ|PrxθU(Θ)[Bx(θ)]?
      • actually i’m not sure that this one is an axis on which DV generalizes the union bound
    • ok and i guess it’s mostly smoothing it in another axis, which is from {0,1} values to arbitrary real values, i.e. instead of Pr[θ:B(θ)], you want to bound Ex[maxθBx(θ)]|Θ|Ex,θ[Bx(θ)]
    • … in addition to generalizing by making the prior non-uniform
      • note: the most obvious result that generalizes in this way uses max-divergence instead
        • and as compensation for that sharp penalty, it doesn’t have to be about exponentials
        • i.e. just get EQ[a]D((Q)Q(P)P)EP[a]
        • #to-think can you use this to make the variational representations of power divergences make sense in general?
          • actually maybe this is all cleaner if e.g. you look at the posterior form?
          • do they all take the form g(EQ[a])Dα((Q)Q(P)P)EP[g(a)] for some function g?
  • ELBO is just taking EQ[a]logEP[expa]+D((Q)Q(P)P) and plugging in log likelihood?
    • yuppp: logp(x)=Ep(z)[p(x|z)]Eq(z)[logp(x|z)]D((q)q(p)p)
    • or (for SUB) rearranging, log1EP[expa]D((Q)Q(P)P)EQ[a]
      • so log1p(x)=log1Ep(z)[p(x|z)]D((q)q(p)p)+Eq(z)[log1P(x|z)]
  • and could maybe present it alternatively as EQ[logA]logEP[A]+D((Q)Q(P)P)
    • and/or EQ[log1L]log1EP[L]+D((Q)Q(P)P)
    • and/or EP[L]expEQ[log1L]D1((Q)Q(P)P)
      • (which is worse than if we had EQ[L]D1((Q)Q(P)P))
    • (but the current formulation still seems good as the main one; e.g. for PAC-Bayes bounds a is indeed the core thing)

#to-think

  • make it make a little more sense to you that you’re adding KL (which is dimensionless) to a dimensional quantity?
    • maybe that one’s really just beacuse you should be using the non-logged version of KL, i.e. exp(EQ[a])D1((Q)Q(P)P)EP[expa]

χ2-divergence

We get

(EQ[a]EP[a])2χ2((Q)Q(P)P)VarP[a],

but I haven’t yet found an intuitive proof for this.

#to-think is something like EQ[a]2χ2((Q)Q(P)P)EP[a2] true?

  • then let a(x):=b(x)EP[b(x)], giving
    • LHS=ExQ[a(x)]2=ExQ[b(x)ExP[x]]2=(EQ[b(x)]EP[b(x)])2
    • RHS=ExP[b(x)ExP[b(x)]]=Var[b(x)]
    • ok nice that sounds very promising actually
    • and it suggests that a multiplicative version is cleaner in many cases?

As concentration bounds

These variational representations are in a sense a generalization of concentration bounds. Instead of only bounding the probability that a exceeds a certain value, they bound EQ[a] over any distribution Q that is not too far from the prior P.

Formally, if we want to bound p:=PrP[av] for some threshold v, we can let Q be the conditional distribution P|v, which makes Df((Q)Q(P)P) a (decreasing) function of p, and write

Df((Q)Q(P)P)EQ[a]EP[f(a)]=EP[a|av]EP[f(a)]vEP[f(a)],

which then gives an upper bound on p.

Information divergence

We have

log1p=D((Q)Q(P)P)svlnEP[esa],

that is, for aP,

Pr[av]E[esa]esv,

which is the moment-generating function method (also known as Chernoff bound).

χ2-divergence

Let

μ:=EP[a]σ2:=VarP[a]

and let’s rewrite v as μ+tσ for t0. Then we have

1p1=χ2((Q)Q(P)P)(EQ[a]EP[a])2VarP[a](vμ)2σ2=t2,

that is, for aP,

Pr[aμ+tσ]11+t2,

which is Chebyshev’s inequality.

Total variation distance

This one requires a bit more adaptation (and because of this is perhaps more of a hairpull). We have

1p=dTV(P,Q)ExQ[a(x)]ExP[a(x)],

but this only holds if a(x)[0,1]. So we perform the change of variables

a(x)min(a(x),v)v,

which means that for any a0 we have

1pEQ[min(a,v)v]EP[min(a,v)v]=1EP[min(a,v)]v1EP[a]v,

so for aP,

Pr[av]E[a]v,

which is Markov’s inequality.

See also

#to-think are most/all of these basically versions of Hölder’s inequality, the same way that Cauchy–Schwarz inequality can be understood in terms of weighting one sequence by the other? see SRLL p.128

  1. We can do slightly better in general by letting a() be the identity, so that the information divergence is that of a itself and not done underlying variable x (which is better by the data processing inequality for divergences. This is also why we can afford to be a bit sloppy about what exactly the subscripts mean in EQ[a] and EP[ea]

  2. by the change of variables aalnb