Suppose you have some energy function E() and you to find a distribution q which strikes a balance between minimizing energy while being reasonably spread. Then a reasonable choice of objective is to subtract the entropy, giving

L(q):=Eq(z)[E(z)]H(q(z)).

This is called the variational free energy: allegedly, in physics, it represents the part of the energy of system q which is actually available for doing useful work.

Minimizing the free energy

Let q be the distribution which minimizes the free energy. Since probabilities must sum to 1, there must be a constant λ such that for all z,

0=ddq(z)(L(q)+λzq(z))=E(z)+logq(z)+1+λ,

so the unique solution has q be exponentially sensitive to the energy:

q(z)=exp(E(z)λ1)exp(E(z)),

and more precisely given by q(z)=exp(E(z))Z where

Z:=exp(E(z))dz

is the partition function.

Interestingly, minimizing the original loss L(q) is equivalent to minimizing the information divergence between q and q:

L(q)=Eq(z)[E(z)]H(q(z))=Eq(z)[logq(z)logZ]+Eq(z)[logq(z)]=D((q(z))q(z)(q(z))q(z))+log1Z.

Non-uniform priors

If instead we encourage q to be spread with respect to some reference distribution (or “prior”) p, i.e. if we try to minimize

L(q):=Eq(z)[E(z)]+D((q(z))q(z)(p(z))p(z)),

then we can reduce to the previous case by folding p(z) into the energy term:

Eq(z)[E(z)]+D((q(z))q(z)(p(z))p(z))=Eq(z)[E(z)logp(z)]H(q(z)).

This means that the minimizer q is given by q(z)=p(z)exp(E(z))Z where

Z:=p(z)exp(E(z))dz

and we still have L(q)=D((q(z))q(z)(q(z))q(z))+log1Z.

In particular, if E(z) represents a negative log likelihood log1p(x|z), then L(q) is precisely the surprise upper bound, and q(z) is given by the posterior p(z|x).