$\require{mathtools} % %%% GENERIC MATH %%% % % Environments \newcommand{\al}[1]{\begin{align}#1\end{align}} % need this for \tag{} to work \renewcommand{\r}{\mathrm} % % Greek \newcommand{\eps}{\epsilon} \newcommand{\veps}{\varepsilon} \let\fi\phi % because it looks like an f \let\phi\varphi % because it looks like a p % % Miscellaneous shortcuts % .. over and under \newcommand{\ss}[1]{_{\substack{#1}}} \newcommand{\ob}{\overbrace} \newcommand{\ub}{\underbrace} \newcommand{\ol}{\overline} \newcommand{\tld}{\widetilde} \newcommand{\HAT}{\widehat} \newcommand{\f}{\frac} \newcommand{\s}[2]{#1 /\mathopen{}#2} \newcommand{\rt}{\sqrt} % .. relations \newcommand{\sr}{\stackrel} \newcommand{\sse}{\subseteq} \newcommand{\ce}{\coloneqq} \newcommand{\ec}{\eqqcolon} \newcommand{\ap}{\approx} \newcommand{\ls}{\lesssim} \newcommand{\gs}{\greatersim} % .. miscer \newcommand{\q}{\quad} \newcommand{\qq}{\qquad} \newcommand{\heart}{\heartsuit} % % Delimiters % (I needed to create my own because the MathJax version of \DeclarePairedDelimiter doesn't have \mathopen{} and that messes up the spacing) % .. one-part \newcommand{\p}[1]{\mathopen{}\left( #1 \right)} \newcommand{\b}[1]{\mathopen{}\left[ #1 \right]} \newcommand{\set}[1]{\mathopen{}\left\{ #1 \right\}} \newcommand{\abs}[1]{\mathopen{}\left\lvert #1 \right\rvert} \newcommand{\floor}[1]{\mathopen{}\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\mathopen{}\left\lceil #1 \right\rceil} \newcommand{\inner}[1]{\mathopen{}\left\langle #1 \right\rangle} % .... (use phantom to force at least the standard height of double bars) \newcommand{\norm}[1]{\mathopen{}\left\lVert #1 \vphantom{f} \right\rVert} \newcommand{\frob}[1]{\norm{#1}_\mathrm{F}} %% .. two-part \newcommand{\incond}[2]{#1 \mathop{}\middle|\mathop{} #2} \newcommand{\cond}[2]{ {\left.\incond{#1}{#2}\right.}} \newcommand{\pco}[2]{\p{\incond{#1}{#2}}} \newcommand{\bco}[2]{\b{\incond{#1}{#2}}} \newcommand{\setco}[2]{\set{\incond{#1}{#2}}} \newcommand{\at}[2]{\left.#1\right|_{#2}} % ..... (use phantom to force at least the standard height of double bar) \newcommand{\oldpara}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} %\newcommand{\para}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} \newcommand{\para}[2]{\mathchoice{\begin{matrix}#1\\\hdashline#2\end{matrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}} \newcommand{\ppa}[2]{\p{\para{#1}{#2}}} \newcommand{\bpa}[2]{\b{\para{#1}{#2}}} %\newcommand{\bpaco}[4]{\bpa{\incond{#1}{#2}}{\incond{#3}{#4}}} \newcommand{\bpaco}[4]{\bpa{\cond{#1}{#2}}{\cond{#3}{#4}}} % % Levels of closeness \newcommand{\scirc}[1]{\sr{\circ}{#1}} \newcommand{\sdot}[1]{\sr{.}{#1}} \newcommand{\slog}[1]{\sr{\log}{#1}} \newcommand{\createClosenessLevels}[7]{ \newcommand{#2}{\mathrel{(#1)}} \newcommand{#3}{\mathrel{#1}} \newcommand{#4}{\mathrel{#1\!\!#1}} \newcommand{#5}{\mathrel{#1\!\!#1\!\!#1}} \newcommand{#6}{\mathrel{(\sdot{#1})}} \newcommand{#7}{\mathrel{(\slog{#1})}} } \let\lt\undefined \let\gt\undefined % .. vanilla versions (is it within a constant?) \newcommand{\ez}{\scirc=} \newcommand{\eq}{\simeq} \newcommand{\eqq}{\mathrel{\eq\!\!\eq}} \newcommand{\eqqq}{\mathrel{\eq\!\!\eq\!\!\eq}} \newcommand{\lez}{\scirc\le} \newcommand{\lq}{\preceq} \newcommand{\lqq}{\mathrel{\lq\!\!\lq}} \newcommand{\lqqq}{\mathrel{\lq\!\!\lq\!\!\lq}} \newcommand{\gez}{\scirc\ge} \newcommand{\gq}{\succeq} \newcommand{\gqq}{\mathrel{\gq\!\!\gq}} \newcommand{\gqqq}{\mathrel{\gq\!\!\gq\!\!\gq}} \newcommand{\lz}{\scirc<} \newcommand{\lt}{\prec} \newcommand{\ltt}{\mathrel{\lt\!\!\lt}} \newcommand{\lttt}{\mathrel{\lt\!\!\lt\!\!\lt}} \newcommand{\gz}{\scirc>} \newcommand{\gt}{\succ} \newcommand{\gtt}{\mathrel{\gt\!\!\gt}} \newcommand{\gttt}{\mathrel{\gt\!\!\gt\!\!\gt}} % .. dotted versions (is it equal in the limit?) \newcommand{\ed}{\sdot=} \newcommand{\eqd}{\sdot\eq} \newcommand{\eqqd}{\sdot\eqq} \newcommand{\eqqqd}{\sdot\eqqq} \newcommand{\led}{\sdot\le} \newcommand{\lqd}{\sdot\lq} \newcommand{\lqqd}{\sdot\lqq} \newcommand{\lqqqd}{\sdot\lqqq} \newcommand{\ged}{\sdot\ge} \newcommand{\gqd}{\sdot\gq} \newcommand{\gqqd}{\sdot\gqq} \newcommand{\gqqqd}{\sdot\gqqq} \newcommand{\ld}{\sdot<} \newcommand{\ltd}{\sdot\lt} \newcommand{\lttd}{\sdot\ltt} \newcommand{\ltttd}{\sdot\lttt} \newcommand{\gd}{\sdot>} \newcommand{\gtd}{\sdot\gt} \newcommand{\gttd}{\sdot\gtt} \newcommand{\gtttd}{\sdot\gttt} % .. log versions (is it equal up to log?) \newcommand{\elog}{\slog=} \newcommand{\eqlog}{\slog\eq} \newcommand{\eqqlog}{\slog\eqq} \newcommand{\eqqqlog}{\slog\eqqq} \newcommand{\lelog}{\slog\le} \newcommand{\lqlog}{\slog\lq} \newcommand{\lqqlog}{\slog\lqq} \newcommand{\lqqqlog}{\slog\lqqq} \newcommand{\gelog}{\slog\ge} \newcommand{\gqlog}{\slog\gq} \newcommand{\gqqlog}{\slog\gqq} \newcommand{\gqqqlog}{\slog\gqqq} \newcommand{\llog}{\slog<} \newcommand{\ltlog}{\slog\lt} \newcommand{\lttlog}{\slog\ltt} \newcommand{\ltttlog}{\slog\lttt} \newcommand{\glog}{\slog>} \newcommand{\gtlog}{\slog\gt} \newcommand{\gttlog}{\slog\gtt} \newcommand{\gtttlog}{\slog\gttt} % % Miscellaneous \newcommand{\LHS}{\mathrm{LHS}} \newcommand{\RHS}{\mathrm{RHS}} % .. operators \DeclareMathOperator{\poly}{poly} \DeclareMathOperator{\polylog}{polylog} \DeclareMathOperator{\quasipoly}{quasipoly} \DeclareMathOperator{\negl}{negl} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} % .. functions \DeclareMathOperator{\id}{id} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\err}{err} \DeclareMathOperator{\ReLU}{ReLU} % .. analysis \let\d\undefined \newcommand{\d}{\operatorname{d}\mathopen{}} \newcommand{\df}[2]{\f{\d #1}{\d #2}} \newcommand{\ds}[2]{\s{\d #1}{\d #2}} \newcommand{\part}{\partial} \newcommand{\partf}[2]{\f{\part #1}{\part #2}} \newcommand{\parts}[2]{\s{\part #1}{\part #2}} \newcommand{\grad}[1]{\mathop{\nabla\!_{#1}}} % .. sets \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\F}{\mathbb{F}} \newcommand{\zo}{\set{0,1}} \newcommand{\pmo}{\set{\pm 1}} % %%% SPECIALIZED MATH %%% % % Logic \renewcommand{\and}{\wedge} \newcommand{\AND}{\bigwedge} \newcommand{\or}{\vee} \newcommand{\OR}{\bigvee} \newcommand{\xor}{\oplus} \newcommand{\XOR}{\bigoplus} \newcommand{\union}{\cup} \newcommand{\inter}{\cap} \newcommand{\UNION}{\bigcup} \newcommand{\INTER}{\bigcap} \newcommand{\comp}{\overline} \newcommand{\true}{\r{true}} \newcommand{\false}{\r{false}} \newcommand{\tf}{\set{\true,\false}} \DeclareMathOperator{\One}{\mathbb{1}} \DeclareMathOperator{\1}{\mathbb{1}} % % Linear algebra \renewcommand{\span}{\mathrm{span}} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\dom}{dom} \DeclareMathOperator{\Img}{Im} \newcommand{\transp}{\mathsf{T}} \renewcommand{\t}{^\transp} % ... named tensors \newcommand{\namedtensorstrut}{\vphantom{fg}} % milder than \mathstrut \newcommand{\name}[1]{\mathsf{\namedtensorstrut #1}} \newcommand{\nbin}[2]{\mathbin{\underset{\substack{#1}}{\namedtensorstrut #2}}} \newcommand{\ndot}[1]{\nbin{#1}{\odot}} \newcommand{\ncat}[1]{\nbin{#1}{\oplus}} \newcommand{\nsum}[1]{\sum\limits_{\substack{#1}}} \newcommand{\nfun}[2]{\mathop{\underset{\substack{#1}}{\namedtensorstrut\mathrm{#2}}}} \newcommand{\ndef}[2]{\newcommand{#1}{\name{#2}}} \newcommand{\nt}[1]{^{\transp(#1)}} % % Probability \newcommand{\Normal}{\mathcal{N}} \let\Pr\undefined \DeclareMathOperator*{\Pr}{Pr} \DeclareMathOperator*{\G}{\mathbb{G}} \DeclareMathOperator*{\Odds}{Od} \DeclareMathOperator*{\E}{E} \DeclareMathOperator*{\Var}{Var} \DeclareMathOperator*{\Cov}{Cov} \DeclareMathOperator*{\corr}{corr} \DeclareMathOperator*{\median}{median} \newcommand{\dTV}{d_{\mathrm{TV}}} \newcommand{\dHel}{d_{\mathrm{Hel}}} \newcommand{\dJS}{d_{\mathrm{JS}}} % ... information theory \let\H\undefined \DeclareMathOperator*{\H}{H} \DeclareMathOperator*{\I}{I} \DeclareMathOperator*{\D}{D} % %%% SPECIALIZED COMPUTER SCIENCE %%% % % Complexity classes % .. classical \newcommand{\Poly}{\mathsf{P}} \newcommand{\NP}{\mathsf{NP}} \newcommand{\PH}{\mathsf{PH}} \newcommand{\PSPACE}{\mathsf{PSPACE}} \renewcommand{\L}{\mathsf{L}} % .. probabilistic \newcommand{\formost}{\mathsf{Я}} \newcommand{\RP}{\mathsf{RP}} \newcommand{\BPP}{\mathsf{BPP}} \newcommand{\MA}{\mathsf{MA}} \newcommand{\AM}{\mathsf{AM}} \newcommand{\IP}{\mathsf{IP}} \newcommand{\RL}{\mathsf{RL}} % .. circuits \newcommand{\NC}{\mathsf{NC}} \newcommand{\AC}{\mathsf{AC}} \newcommand{\ACC}{\mathsf{ACC}} \newcommand{\TC}{\mathsf{TC}} \newcommand{\Ppoly}{\mathsf{P}/\poly} \newcommand{\Lpoly}{\mathsf{L}/\poly} % .. resources \newcommand{\TIME}{\mathsf{TIME}} \newcommand{\SPACE}{\mathsf{SPACE}} \newcommand{\TISP}{\mathsf{TISP}} \newcommand{\SIZE}{\mathsf{SIZE}} % .. keywords \newcommand{\co}{\mathsf{co}} \newcommand{\Prom}{\mathsf{Promise}} % % Boolean analysis \newcommand{\harpoon}{\!\upharpoonright\!} \newcommand{\rr}[2]{#1\harpoon_{#2}} \newcommand{\Fou}[1]{\widehat{#1}} \DeclareMathOperator{\Ind}{\mathrm{Ind}} \DeclareMathOperator{\Inf}{\mathrm{Inf}} \DeclareMathOperator{\Der}{\mathrm{D}} \DeclareMathOperator{\Stab}{\mathrm{Stab}} \DeclareMathOperator{\T}{T} \DeclareMathOperator{\sens}{\mathrm{s}} \DeclareMathOperator{\bsens}{\mathrm{bs}} \DeclareMathOperator{\fbsens}{\mathrm{fbs}} \DeclareMathOperator{\Cert}{\mathrm{C}} \DeclareMathOperator{\DT}{\mathrm{DT}} \DeclareMathOperator{\CDT}{\mathrm{CDT}} % canonical \DeclareMathOperator{\ECDT}{\mathrm{ECDT}} \DeclareMathOperator{\CDTv}{\mathrm{CDT_{vars}}} \DeclareMathOperator{\ECDTv}{\mathrm{ECDT_{vars}}} \DeclareMathOperator{\CDTt}{\mathrm{CDT_{terms}}} \DeclareMathOperator{\ECDTt}{\mathrm{ECDT_{terms}}} \DeclareMathOperator{\CDTw}{\mathrm{CDT_{weighted}}} \DeclareMathOperator{\ECDTw}{\mathrm{ECDT_{weighted}}} \DeclareMathOperator{\AvgDT}{\mathrm{AvgDT}} \DeclareMathOperator{\PDT}{\mathrm{PDT}} % partial decision tree \DeclareMathOperator{\DTsize}{\mathrm{DT_{size}}} \DeclareMathOperator{\W}{\mathbf{W}} % .. functions (small caps sadly doesn't work) \DeclareMathOperator{\Par}{\mathrm{Par}} \DeclareMathOperator{\Maj}{\mathrm{Maj}} \DeclareMathOperator{\HW}{\mathrm{HW}} \DeclareMathOperator{\Th}{\mathrm{Th}} \DeclareMathOperator{\Tribes}{\mathrm{Tribes}} \DeclareMathOperator{\RotTribes}{\mathrm{RotTribes}} \DeclareMathOperator{\CycleRun}{\mathrm{CycleRun}} \DeclareMathOperator{\SAT}{\mathrm{SAT}} \DeclareMathOperator{\UniqueSAT}{\mathrm{UniqueSAT}} % % Dynamic optimality \newcommand{\OPT}{\mathsf{OPT}} \newcommand{\Alt}{\mathsf{Alt}} \newcommand{\Funnel}{\mathsf{Funnel}} % % Alignment \DeclareMathOperator{\Amp}{\mathrm{Amp}} % %%% TYPESETTING %%% % % In text \renewcommand{\th}{^{\mathrm{th}}} \newcommand{\degree}{^\circ} % % Fonts % .. bold \newcommand{\BA}{\boldsymbol{A}} \newcommand{\BB}{\boldsymbol{B}} \newcommand{\BC}{\boldsymbol{C}} \newcommand{\BD}{\boldsymbol{D}} \newcommand{\BE}{\boldsymbol{E}} \newcommand{\BF}{\boldsymbol{F}} \newcommand{\BG}{\boldsymbol{G}} \newcommand{\BH}{\boldsymbol{H}} \newcommand{\BI}{\boldsymbol{I}} \newcommand{\BJ}{\boldsymbol{J}} \newcommand{\BK}{\boldsymbol{K}} \newcommand{\BL}{\boldsymbol{L}} \newcommand{\BM}{\boldsymbol{M}} \newcommand{\BN}{\boldsymbol{N}} \newcommand{\BO}{\boldsymbol{O}} \newcommand{\BP}{\boldsymbol{P}} \newcommand{\BQ}{\boldsymbol{Q}} \newcommand{\BR}{\boldsymbol{R}} \newcommand{\BS}{\boldsymbol{S}} \newcommand{\BT}{\boldsymbol{T}} \newcommand{\BU}{\boldsymbol{U}} \newcommand{\BV}{\boldsymbol{V}} \newcommand{\BW}{\boldsymbol{W}} \newcommand{\BX}{\boldsymbol{X}} \newcommand{\BY}{\boldsymbol{Y}} \newcommand{\BZ}{\boldsymbol{Z}} \newcommand{\Ba}{\boldsymbol{a}} \newcommand{\Bb}{\boldsymbol{b}} \newcommand{\Bc}{\boldsymbol{c}} \newcommand{\Bd}{\boldsymbol{d}} \newcommand{\Be}{\boldsymbol{e}} \newcommand{\Bf}{\boldsymbol{f}} \newcommand{\Bg}{\boldsymbol{g}} \newcommand{\Bh}{\boldsymbol{h}} \newcommand{\Bi}{\boldsymbol{i}} \newcommand{\Bj}{\boldsymbol{j}} \newcommand{\Bk}{\boldsymbol{k}} \newcommand{\Bp}{\boldsymbol{p}} \newcommand{\Bq}{\boldsymbol{q}} \newcommand{\Br}{\boldsymbol{r}} \newcommand{\Bs}{\boldsymbol{s}} \newcommand{\Bt}{\boldsymbol{t}} \newcommand{\Bu}{\boldsymbol{u}} \newcommand{\Bv}{\boldsymbol{v}} \newcommand{\Bw}{\boldsymbol{w}} \newcommand{\Bx}{\boldsymbol{x}} \newcommand{\By}{\boldsymbol{y}} \newcommand{\Bz}{\boldsymbol{z}} \newcommand{\Balpha}{\boldsymbol{\alpha}} \newcommand{\Bbeta}{\boldsymbol{\beta}} \newcommand{\Bgamma}{\boldsymbol{\gamma}} \newcommand{\Bdelta}{\boldsymbol{\delta}} \newcommand{\Beps}{\boldsymbol{\eps}} \newcommand{\Bveps}{\boldsymbol{\veps}} \newcommand{\Bzeta}{\boldsymbol{\zeta}} \newcommand{\Beta}{\boldsymbol{\eta}} \newcommand{\Btheta}{\boldsymbol{\theta}} \newcommand{\Biota}{\boldsymbol{\iota}} \newcommand{\Bkappa}{\boldsymbol{\kappa}} \newcommand{\Blambda}{\boldsymbol{\lambda}} \newcommand{\Bmu}{\boldsymbol{\mu}} \newcommand{\Bnu}{\boldsymbol{\nu}} \newcommand{\Bxi}{\boldsymbol{\xi}} \newcommand{\Bomicron}{\boldsymbol{\omicron}} \newcommand{\Bpi}{\boldsymbol{\pi}} \newcommand{\Brho}{\boldsymbol{\rho}} \newcommand{\Bsigma}{\boldsymbol{\sigma}} \newcommand{\Btau}{\boldsymbol{\tau}} \newcommand{\Bupsilon}{\boldsymbol{\upsilon}} \newcommand{\Bphi}{\boldsymbol{\phi}} \newcommand{\Bfi}{\boldsymbol{\fi}} \newcommand{\Bchi}{\boldsymbol{\chi}} \newcommand{\Bpsi}{\boldsymbol{\psi}} \newcommand{\Bomega}{\boldsymbol{\omega}} % .. calligraphic \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} % .. typewriter \newcommand{\TA}{\mathtt{A}} \newcommand{\TB}{\mathtt{B}} \newcommand{\TC}{\mathtt{C}} \newcommand{\TD}{\mathtt{D}} \newcommand{\TE}{\mathtt{E}} \newcommand{\TF}{\mathtt{F}} \newcommand{\TG}{\mathtt{G}} \newcommand{\TH}{\mathtt{H}} \newcommand{\TI}{\mathtt{I}} \newcommand{\TJ}{\mathtt{J}} \newcommand{\TK}{\mathtt{K}} \newcommand{\TL}{\mathtt{L}} \newcommand{\TM}{\mathtt{M}} \newcommand{\TN}{\mathtt{N}} \newcommand{\TO}{\mathtt{O}} \newcommand{\TP}{\mathtt{P}} \newcommand{\TQ}{\mathtt{Q}} \newcommand{\TR}{\mathtt{R}} \newcommand{\TS}{\mathtt{S}} \newcommand{\TT}{\mathtt{T}} \newcommand{\TU}{\mathtt{U}} \newcommand{\TV}{\mathtt{V}} \newcommand{\TW}{\mathtt{W}} \newcommand{\TX}{\mathtt{X}} \newcommand{\TY}{\mathtt{Y}} \newcommand{\TZ}{\mathtt{Z}}$

The entropy1 of a random variable is a quantitative notion of how uncertain we are about its outcome. More precisely, it’s the “average surprise”2 you get from samples, where surprise is measured as $\log\p{\f1{\text{probability of getting that outcome}}}$:

\[ \H[\BX] \ce \E_{x \sim \BX}\b{\log \frac{1}{\Pr[\BX=x]}}. \]

It’s the average “amount of information” you get from observing $\BX$: the number of bits you need to describe draws from $\BX$, amortized over many draws.

By extension, we define the entropy of a distribution $P$ as

\[ \H\p{P} \ce \E_{x \sim P}\b{\log \f{1}{P(x)}}. \]

Properties

  • Entropy ranges between $0$ (if $\BX$ is deterministic) and $\log |\CX|$ (if $\BX$ is uniform), where $\CX$ is the support of $\BX$.
  • It’s subadditive: $\H[\BX,\BY] \leq \H[\BX] + \H[\BY]$, with equality iff $\BX$ and $\BY$ are independent.
    • Intuitively, you can’t be more surprised by the combination of $\BX$ and $\BY$ than you were by seeing both outcomes separately.
    • The “correlated part” of $\BX$ and $\BY$ is counted once in $\H[\BX,\BY]$ but twice in $\H[\BX] + \H[\BY]$.

We will prove $\H[X] \leq \log |\CX|$ by looking at entropy deficit, and subadditivity by looking at mutual information.

Binary entropy

In particular, let $H(p) \ce p\log\frac{1}{p} + (1-p)\log\frac{1}{1-p}$ denote the entropy of a Bernoulli with mean $p$. Then

  • $H(p)$ is concave,
  • $H(p)$ is symmetric around $\frac12$: $H(p) = H(1-p)$,
  • when $p\approx \frac12$, $H(p) = 1-\Theta\p{\p{p-\frac{1}{2}}^2}$,
  • when $p \le \frac12$, $H(p) = \Theta\p{p \log\frac{1}{p}}$.

Conditional entropy

Similarly, we can define conditional entropy as the average surprise you get from $\BY$ if you’ve already seen $\BX$, or the number of bits you need to describe draws from $\BY$ to someone who knows $\BX$:

\[ \al{ \H\bco{\BY}{\BX} \ce&\ \E_{x \sim \BX}\b{\H[\at{\BY}{\BX=x}]}\\ =&\ \E_{(x,y) \sim (\BX,\BY)}\b{\log \frac{1}{\Pr\bco{\BY=y}{\BX=x}}} } \]

where $\at{\BY}{\BX=x}$ denotes the random variable obtained by conditioning $\BY$ on $\BX=x$. Confusingly, $\H\bco\BY\BX$ denotes a fixed value, in contrast with the conditional expectation $\E\bco{\BY}{\BX}$, which is a random variable depending on $\BX$.

Chain rule

The conditional entropy gives a chain rule:

\[ \al{ \H[\BX,\BY] &= \E_{(x,y) \sim (\BX,\BY)}\b{\log\frac{1}{\Pr[\BX = x \and \BY=y]}}\\ &= \E_{(x,y) \sim (\BX,\BY)}\b{\log\frac{1}{\Pr[\BX = x]\Pr\bco{\BY=y}{\BX=x}}}\\ &= \H[\BX] + \H\bco\BY\BX, } \]

which immediately shows that adding data can only increase the entropy

\[ \H[\BX,\BY] \ge \H[\BX]. \]

Mixtures and concavity

Together with subadditivity, this shows that conditioning can only decrease entropy

\[ \al{ \H\b{\BY} &\ge \H\b{\BX,\BY} - \H[\BX]\tag{subadditivity}\\ &= \H\bco{\BY}{\BX},\tag{chain rule} } \]

with equality iff $\BX$ and $\BY$ are independent.

If we see $\BY$ as a mixture3 $\BY_{\BX}$ of the variables $\BY_x \ce \at{\BY}{\BX=x}$ weighted by $\BX$, then this is saying that entropy is strictly concave:

\[ \ob{\E_{x \sim \BX}\b{\H\b{\BY_x}}}^{\H\bco{\BY}{\BX}}_\text{average of entropies}\le \ob{\H\b{\BY_{\BX}}}^{\H\b{\BY}}_\text{entropy of mixture} \le \ob{\H\b{\BX,\BY_\BX}}^{\H\b{\BX,\BY}}_\text{entropy of ``disjoint mixture''}, \]

with equality respectively iff the variables $\BY_x$ are all indentical, and iff they have disjoint supports.

#to-write

  • the “amount of information” thing makes sense? because it’s the ==information you gain== if you started from a prior of $\BX$’s distribution then you actually observe $\BX$? i.e. $\E_{\Bx \sim [\BX]}\b{\D\bpa{\at{\BX}{\BX = \Bx}}{\BX}}$
  • okay I’m now thoroughly KL-pilled, it’s the source of everything. $\D\ppa{Q}{P}$ is the information you’ve gained if you started from a prior of $P$ and somehow your prior is now $Q$
    • but I want to understand the sense if which $\D\ppa{Q}{P}$ is a lower bound on the “number of questions you must have asked in order to get there”?
      • i.e. the information $\BX$ gains you about $\BX$ is limited by the information that it gives about itself
      • i.e. $\E_{\Bx \sim [\BX]}\b{\D\bpa{\at{\BY}{\BX=\Bx}}{\BY}} \ge \H[\BX]$?
      • actually I believe you’ve just rediscovered $\I[\BX;\BY] \ge \I[\BX;\BX] = \H[\BX]$
        • darn information theory is so cool
      • and btw you can write this as $\E\b{\D\bpa{\cond{\BY}{\BX}}{\BY}}$
        • and this directly shows that it’s equal to $\I[\BX;\BY] = \H[\BY] - \H\bco{\BY}{\BX}$ I guess
          • though maybe you should swallow the pill and not talk about entropy ever again!
          • note: I guess it also means $\I[\BX;\BY] = \I[\BY;\BY] - \I\bco{\BY;\BY}{\BX} = -\I[\BY;\BY;\BX]$?
            • cool story bro
            • wait is it the case that in general if you add a variable that’s already present it just flips the sign?
            • $\I[\BX;\BY;\BZ;\BZ] = \I\bco{\BX;\BY;\BZ}{\BZ} - \I\b{\BX;\BY;\BZ}$
              • and somehow I’m supposed to just know that $\I\bco{\BX;\BY;\BZ}{\BZ} = 0$?
              • I guess this just suggests that if any variable is a constant then the interaction is $0$, which would make a whole lot of sense tbh
              • very similar to Wick in that case!
        • really, entropy should be negative…
    • okay I’m this close to just replacing all occurrences of $\D$ to $\I$
      • and to have mutual information as a more core concept than entropy
      • and to literally just define $\H[\BX]$ as $\I[\BX;\BX]$
        • this would explain why you have to somehow bother to use two copies of $\BX$ in the definition of entropy!
          • (and same thing in basically all of the power entropies)
        • ultimately ==information is about something providing information about another thing==
          • there is no such thing as “the information of one thing”
          • and the information divergence is the core of what information means
          • and it’s also what makes information theory so nice: it provides all the nontrivial inequalities!
      • does $\I[\BX;\BX;\BX]$ make sense? it’s $\I\bco{\BX;\BX}{\BX} - \I[\BX;\BX] = 0-\H[\BX] = -\H[\BX]$ hmm maybe this is slightly cursed?
      • and $\I[\BX;\BX;\BX;\BX] = \I\bco{\BX;\BX;\BX}{\BX} - \I[\BX;\BX;\BX] = \H[\BX]$? so it keeps alternating, hmmm…
        • actually this is support for defining $\I[\BX] \ce -\H[\BX]$ instead of $\D[\BX]$
          • indeed, $\I[\BX;\BX] = \I\bco{\BX}{\BX} - \I\b{\BX}$
            • and I’m not sure what this first quantity $\I\bco{\BX}{\BX}$ is supposed to mean but it sure seems like it should be $0$?
              • if you set $\I[\BX] \ce \D[\BX]$ then this quantity is $\H[\BX] + \D[\BX] =$ entropy of the prior on $\BX$?
            • actually very confusing
          • and $\I[\BX] = \I\bco{}{\BX} - \I\b{}$
            • i.e. “how much does $\BX$ help you predict nothing”??
            • so we’d have to accept that $\I\bco{}{\BX} = \I[\BX] + C = C-\H[\BX]$
              • i.e. $\BX$ is actively detrimental to the task of predicting nothing? (???)
          • actually feels like we should have $\I\bco{}{\BX} = \I[]$ given that $\BX$ is not dependent with anything in the LHS (this is definitely a rule right?)
            • which would mean $\I[\BX] = 0$
            • but then $\I\bco{\BX}{\BX} = \I\b{\BX;\BX} = \H[\BX]$, even though $\I\bco{\BX}{\BX} = \E_{\Bx \sim [\BX]}\b{\I\b{\at{\BX}{\BX = \Bx}}}$ and $\at{\BX}{\BX=x}$ is literally a point mass no matter what $x$ is
              • actually that’s a general argument for $\I\bco{\BX}{\BX}$ being a constant $C$ (not clear which constant, but $0$ would seem like a reasonable choice here since we get zero whenever there is any deterministic constant in the expression?)
              • so $\I[\BX]$ should be $C-\H[\BX]$
          • to summarize:
            • given that $\I[\BX] = C - \H[\BX]$ is the only tenable position, we would have to have $\I\bco{}{\BX} = C' - \H[\BX]$
            • but then that violates the rule of “if $\BX$ is independent from the joint distribution of the things in the LHS then the information should be $0$
          • okay I vote for $\I[\BX]$ just being undefined!! $\I[\BX;\BY]$ is clearly the core quantity here
            • and it seems like this makes things better in the case of continuous random variables as well
        • it’s in fact somewhat baffling that the interaction information is even well-defined (that it’s consistent and symmetric)
      • actually I think I’d be on board to keep $\D[\cdot]$ only for the deficit
  • KL divergence is to entropy the same as squared distance (!!) is to variance?
  1. More precisely “Shannon entropy” or “information entropy” to distinguish it from other notions of uncertainty based purely on its probability mass function, such as power entropies

  2. Note that this is a pretty stupid notion of surprise. If I ask you to draw me numbers between $1$ and $1000$ and everytime you do so I go “woah that’s crazy, this number only have a $1/1000$ chance of appearing”, then you would be justified in questioning my sanity. 

  3. a convex combination of distributions