- the quantities i call “surprise” rn are more accurately just “rarity”
- but otoh if the rarity you give to outcomes is constant then i think you can be accurately described as “never surprised”
Maximizing the entropy of the distribution itself
Let $P$ be the maximum entropy distribution over some set $X$ (which we’ll assume is finite for simplicity). Then it’s trivial to show that the “surprise” of drawing some value $x \in X$ is never very large; in fact, it’s a constant:
\forall x:\log \f1{P(x)} = \H\p P,
(and $\H(P)$ is just $\log \abs X$). One can show this directly by the fact that the maximum entropy distribution is the uniform distribution $P(x) = \f1{\abs X}$.
An alternative way to prove this (which will generalize better) is to study the maximization problem directly. Let’s introduce dual variable $\lambda$ for the constraint that probabilities sum to $1$. Then for each $x \in X$, the gradient
\partf{}{P(x)}{\p{\H\p P - \lambda \sum_{x' \in X} P\p{x'}}}
&= \partf{}{P(x)}\p{P(x) \log \f1{P(x)}}-\lambda\\
%&= \partf{\p{\sum_y \p{P(y) \log \f1{P(y)}- \lambda P(y)}}}{P(x)}\\
&= \log \f1{P\p x} - 1 - \lambda
must be $0$, so the surprise $\log \f1{P(x)}=\lambda+1$ is the same for all $x$, and its average $\E_{\Bx \sim P}\b{\log \f1{P(\Bx)}}$ is precisely $\H\p P$ by definition, so we’re done.
Maximizing the entropy of a downstream quantity
Now, suppose that we will put the values $x$ through some function $f$, and we choose an “input distribution” $D$ to maximize the entropy $\H_{\Bx \sim D}\b{f(\Bx)}$ of the output. Since we’re not maximizing the entropy of $D$ directly, some values of $x$ might be much rarer than others. In fact, if $f$ is not injective, $D$ will not even be uniquely determined, and it could give probability $0$ to some values $x$. But it’s still pretty obvious to see that for any $x$, the corresponding output $f(x)$ won’t be too rare: the surprise from getting $f(x)$ will exactly be the entropy of $f(\Bx)$.
Formally, let $Q \ce {}_{\Bx \sim D}\b{f(\Bx)}$ be the distribution of the output of $f$. We can see it as a function of the probabilities $P(x)$ if we write it as
Q(y) \ce \sum_{x:f(x)=y}P(x).
Then for each $x \in X$, the gradient
\partf{}{D(x)}\p{\H\p {Q} - \lambda \sum_{x' \in X} D\p{x'}}
&= \partf{}{D(x)}\p{\sum_y {Q}(y) \log \f1{Q(y)}} - \lambda\\
&= \partf{}{D(x)}\p{Q\p{f \p x} \log \f1{Q\p{f \p x}}} - \lambda\\
&= \partf{}{Q\p{f(x)}}\p{Q\p{f \p x} \log \f1{Q\p{f \p x}}} - \lambda\\
&= \log \f1{Q\p{f(x)}}-1 - \lambda
must be $0$, so the surprise $\log \f1{Q\p{f(x)}}$ is the same for all $x$. And (similarly to the previous section), we can observe that the average over $\Bx \sim D$ of this quantity is precisely $\H\p Q$:
\E_{\Bx \sim D}\b{\log \f1{Q\p{f(\Bx)}}} = \E_{\By \sim Q}\b{\log \f1{Q\p{\By}}} = \H\p Q
so we get the desired result
\forall x: \log \f1{Q\p{f(x)}} = \H\p Q.
What’s perhaps a bit surprising is that something similar continues to hold if we move from a fixed function $f$ to a random function $\Bf$ and $D$ maximizes the entropy of the output on average over the specific choice of $\Bf$. More precisely, let’s denote $Q_f \ce {}_{\Bx \sim D}\b{f(\Bx)}$ be the distribution of outputs over $\Bx \sim D$ some fixed $f$, and assume we choose $D$ to maximize the average entropy of these distributions $Q_\Bf$:
\E_\Bf\b{\H\p{Q_\Bf}} = \E_\Bf\b{\H_{\Bx \sim D}\bco{\Bf(\Bx)}{\Bf}}.
We’ll show that on average over $\Bf$, the output $\Bf(x)$ won’t be too rare: the average surprise from getting $\Bf(x)$ (as a draw from $Q_\Bf$) will be exactly the average entropy of $Q_\Bf$.
If $D$ maximizes $\E_\Bf\b{\H\p{Q_\Bf}}$, then for each $x \in X$, the gradient
\partf{}{D(x)}\p{\E_\Bf\b{\H\p{Q_\Bf} - \lambda \sum_{x' \in X} D\p{x'}}}
&= \E_\Bf\b{\partf{}{D(x)}\p{\sum_y Q_\Bf(y) \log \f1{Q_\Bf(y)}}} - \lambda\\
&= \E_\Bf\b{\partf{}{D(x)}\p{Q_\Bf\p{\Bf \p x} \log \f1{Q_\Bf\p{\Bf \p x}}}} - \lambda\\
&= \E_\Bf\b{\log \f1{Q_\Bf\p{\Bf(x)}}}-1 - \lambda
must be $0$, so the average surprise $\E_\Bf\b{\log \f1{Q_\Bf\p{\Bf(x)}}}$ is the same for all $x$. And once again, we observe that the average over $\Bx \sim D$ of this quantity is the average entropy:
\E_{\Bx \sim D}\b{\E_\Bf\b{\log \f1{Q_\Bf\p{\Bf(\Bx)}}}}
&= \E_\Bf\b{\E_{\Bx \sim D}\b{\log \f1{Q_\Bf\p{\Bf(\Bx)}}}}\\
&= \E_\Bf\b{\E_{\By \sim Q_\Bf}\b{\log \f1{Q_\Bf\p{\By}}}}\\
&= \E_\Bf\b{\H\p{Q_\Bf}},
so we get the desired result
\forall x:\E_\Bf\b{\log \f1{Q_\Bf\p{\Bf(x)}}} = \E_\Bf\b{\H\p{Q_\Bf}}.
Minimizing the divergence relative some prior
The information divergence between distributions can be decomposed in a way that involves the negative entropy:
\D\pff{Q}{P} = \E_{\Bx \sim Q}\b{\log \f1{P(\Bx)}} - \H\p{Q},
so it should not be too surprising that we can get analogous results when miminizing the divergence of our output distributions $Q_f$ relative to some given prior $P_f$, as long as you measure surprise relative to $P_f$ as well. More generally, similar results hold for minimization of the variational free energy.
Formally, if $D$ minimizes $\E_\Bf\b{\D\pff{Q_\Bf}{P_\Bf}}$, then for each $x \in X$ the gradients
\partf{}{D(x)}\p{\E_\Bf\b{\D\pff{Q_\Bf}{P_\Bf} - \lambda \sum_{x' \in X} D\p{x'}}}
&= \E_\Bf\b{\partf{}{D(x)}\p{\sum_y Q_\Bf(y) \log \f{Q_\Bf(y)}{P_\Bf(y)}}} - \lambda\\
&= \E_\Bf\b{\partf{}{D(x)}\p{Q_\Bf\p{\Bf \p x} \log \f{Q_\Bf\p{\Bf \p x}}{P_\Bf\p{\Bf \p x}}}} - \lambda\\
&= \E_\Bf\b{\log \f{Q_\Bf\p{\Bf(x)}}{P_\Bf\p{\Bf\p x}}}+1 - \lambda
must be $0$, so the average relative surprise $\E_\Bf\b{\log \f{Q_\Bf\p{\Bf(x)}}{P_\Bf\p{\Bf\p x}}}$ is the same for all $x$, and its average over $\Bx \sim D$ is the average divergence $\E_\Bf \b{\D\pff{Q_\Bf}{P_\Bf}}$, so we get
\forall x: \E_\Bf\b{\log \f{Q_\Bf\p{\Bf(x)}}{P_\Bf\p{\Bf\p x}}} = \E_\Bf \b{\D\pff{Q_\Bf}{P_\Bf}}.