Suppose you have some energy function and you to find a distribution which strikes a balance between minimizing energy while being reasonably spread. Then a reasonable choice of objective is to subtract the entropy, giving
This is called the variational free energy: allegedly, in physics, it represents the part of the energy of system which is actually available for doing useful work.
Minimizing the free energy
Let be the distribution which minimizes the free energy. Since probabilities must sum to , there must be a constant such that for all ,
so the unique solution has be exponentially sensitive to the energy:
and more precisely given by where
is the partition function.
Interestingly, minimizing the original loss is equivalent to minimizing the information divergence between and :
If instead we encourage to be spread with respect to some reference distribution (or “prior”) , i.e. if we try to minimize
then we can reduce to the previous case by folding into the energy term:
This means that the minimizer is given by where
and we still have .
In particular, if represents a negative log likelihood , then is precisely the surprise upper bound, and is given by the posterior .