Entropy

$\require{mathtools} \newcommand{\nc}{\newcommand} % %%% GENERIC MATH %%% % % Environments \newcommand{\al}[1]{\begin{align}#1\end{align}} % need this for \tag{} to work \renewcommand{\r}{\mathrm} % BAD!! does cursed things with accents :(( \renewcommand{\t}{\textrm} \newcommand{\either}[1]{\begin{cases}#1\end{cases}} % % Delimiters % (I needed to create my own because the MathJax version of \DeclarePairedDelimiter doesn't have \mathopen{} and that messes up the spacing) % .. one-part \newcommand{\p}[1]{\mathopen{}\left( #1 \right)} \renewcommand{\P}[1]{^{\p{#1}}} \renewcommand{\b}[1]{\mathopen{}\left[ #1 \right]} \newcommand{\lopen}[1]{\mathopen{}\left( #1 \right]} \newcommand{\ropen}[1]{\mathopen{}\left[ #1 \right)} \newcommand{\set}[1]{\mathopen{}\left\{ #1 \right\}} \newcommand{\abs}[1]{\mathopen{}\left\lvert #1 \right\rvert} \newcommand{\floor}[1]{\mathopen{}\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\mathopen{}\left\lceil #1 \right\rceil} \newcommand{\round}[1]{\mathopen{}\left\lfloor #1 \right\rceil} \newcommand{\inner}[1]{\mathopen{}\left\langle #1 \right\rangle} \newcommand{\norm}[1]{\mathopen{}\left\lVert #1 \strut \right\rVert} \newcommand{\frob}[1]{\norm{#1}_\mathrm{F}} \newcommand{\mix}[1]{\mathopen{}\left\lfloor #1 \right\rceil} %% .. two-part \newcommand{\inco}[2]{#1 \mathop{}\middle|\mathop{} #2} \newcommand{\co}[2]{ {\left.\inco{#1}{#2}\right.}} \newcommand{\cond}{\co} % deprecated \newcommand{\pco}[2]{\p{\inco{#1}{#2}}} \newcommand{\bco}[2]{\b{\inco{#1}{#2}}} \newcommand{\setco}[2]{\set{\inco{#1}{#2}}} \newcommand{\at}[2]{ {\left.#1\strut\right|_{#2}}} \newcommand{\pat}[2]{\p{\at{#1}{#2}}} \newcommand{\bat}[2]{\b{\at{#1}{#2}}} \newcommand{\para}[2]{#1\strut \mathop{}\middle\|\mathop{} #2} \newcommand{\ppa}[2]{\p{\para{#1}{#2}}} \newcommand{\pff}[2]{\p{\ff{#1}{#2}}} \newcommand{\bff}[2]{\b{\ff{#1}{#2}}} \newcommand{\bffco}[4]{\bff{\cond{#1}{#2}}{\cond{#3}{#4}}} \newcommand{\sm}[1]{\p{\begin{smallmatrix}#1\end{smallmatrix}}} % % Greek \newcommand{\eps}{\epsilon} \newcommand{\veps}{\varepsilon} \newcommand{\vpi}{\varpi} % the following cause issues with real LaTeX tho :/ maybe consider naming it \fhi instead? \let\fi\phi % because it looks like an f \let\phi\varphi % because it looks like a p \renewcommand{\th}{\theta} \newcommand{\Th}{\Theta} \newcommand{\om}{\omega} \newcommand{\Om}{\Omega} % % Miscellaneous \newcommand{\LHS}{\mathrm{LHS}} \newcommand{\RHS}{\mathrm{RHS}} \DeclareMathOperator{\cst}{const} % .. operators \DeclareMathOperator{\poly}{poly} \DeclareMathOperator{\polylog}{polylog} \DeclareMathOperator{\quasipoly}{quasipoly} \DeclareMathOperator{\negl}{negl} \DeclareMathOperator*{\argmin}{arg\thinspace min} \DeclareMathOperator*{\argmax}{arg\thinspace max} \DeclareMathOperator{\diag}{diag} % .. functions \DeclareMathOperator{\id}{id} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\step}{step} \DeclareMathOperator{\err}{err} \DeclareMathOperator{\ReLU}{ReLU} \DeclareMathOperator{\softmax}{softmax} % .. analysis \let\d\undefined \newcommand{\d}{\operatorname{d}\mathopen{}} \newcommand{\dd}[1]{\operatorname{d}^{#1}\mathopen{}} \newcommand{\df}[2]{ {\f{\d #1}{\d #2}}} \newcommand{\ds}[2]{ {\sl{\d #1}{\d #2}}} \newcommand{\ddf}[3]{ {\f{\dd{#1} #2}{\p{\d #3}^{#1}}}} \newcommand{\dds}[3]{ {\sl{\dd{#1} #2}{\p{\d #3}^{#1}}}} \renewcommand{\part}{\partial} \newcommand{\ppart}[1]{\part^{#1}} \newcommand{\partf}[2]{\f{\part #1}{\part #2}} \newcommand{\parts}[2]{\sl{\part #1}{\part #2}} \newcommand{\ppartf}[3]{ {\f{\ppart{#1} #2}{\p{\part #3}^{#1}}}} \newcommand{\pparts}[3]{ {\sl{\ppart{#1} #2}{\p{\part #3}^{#1}}}} \newcommand{\grad}[1]{\mathop{\nabla\!_{#1}}} % .. sets \newcommand{\es}{\emptyset} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\R}{\mathbb{R}} \newcommand{\Rge}{\R_{\ge 0}} \newcommand{\Rgt}{\R_{> 0}} \newcommand{\C}{\mathbb{C}} \newcommand{\F}{\mathbb{F}} \newcommand{\zo}{\set{0,1}} \newcommand{\pmo}{\set{\pm 1}} \newcommand{\zpmo}{\set{0,\pm 1}} % .... set operations \newcommand{\sse}{\subseteq} \newcommand{\out}{\not\in} \newcommand{\minus}{\setminus} \newcommand{\inc}[1]{\union \set{#1}} % "including" \newcommand{\exc}[1]{\setminus \set{#1}} % "except" % .. over and under \renewcommand{\ss}[1]{_{\substack{#1}}} \newcommand{\OB}{\overbrace} \newcommand{\ob}[2]{\OB{#1}^\t{#2}} \newcommand{\UB}{\underbrace} \newcommand{\ub}[2]{\UB{#1}_\t{#2}} \newcommand{\ol}{\overline} \newcommand{\tld}{\widetilde} % deprecated \renewcommand{\~}{\widetilde} \newcommand{\HAT}{\widehat} % deprecated \renewcommand{\^}{\widehat} \newcommand{\rt}[1]{ {\sqrt{#1}}} \newcommand{\for}[2]{_{#1=1}^{#2}} \newcommand{\sfor}{\sum\for} \newcommand{\pfor}{\prod\for} % .... two-part \newcommand{\f}{\frac} \renewcommand{\sl}[2]{#1 /\mathopen{}#2} \newcommand{\ff}[2]{\mathchoice{\begin{smallmatrix}\displaystyle\vphantom{\p{#1}}#1\\[-0.05em]\hline\\[-0.05em]\hline\displaystyle\vphantom{\p{#2}}#2\end{smallmatrix}}{\begin{smallmatrix}\vphantom{\p{#1}}#1\\[-0.1em]\hline\\[-0.1em]\hline\vphantom{\p{#2}}#2\end{smallmatrix}}{\begin{smallmatrix}\vphantom{\p{#1}}#1\\[-0.1em]\hline\\[-0.1em]\hline\vphantom{\p{#2}}#2\end{smallmatrix}}{\begin{smallmatrix}\vphantom{\p{#1}}#1\\[-0.1em]\hline\\[-0.1em]\hline\vphantom{\p{#2}}#2\end{smallmatrix}}} % .. arrows \newcommand{\from}{\leftarrow} \DeclareMathOperator*{\<}{\!\;\longleftarrow\;\!} \let\>\undefined \DeclareMathOperator*{\>}{\!\;\longrightarrow\;\!} \let\-\undefined \DeclareMathOperator*{\-}{\!\;\longleftrightarrow\;\!} \newcommand{\so}{\implies} % .. operators and relations \renewcommand{\*}{\cdot} \newcommand{\x}{\times} \newcommand{\ox}{\otimes} \newcommand{\OX}[1]{^{\ox #1}} \newcommand{\sr}{\stackrel} \newcommand{\ce}{\coloneqq} \newcommand{\ec}{\eqqcolon} \newcommand{\ap}{\approx} \newcommand{\ls}{\lesssim} \newcommand{\gs}{\gtrsim} % .. punctuation and spacing \renewcommand{\.}[1]{#1\dots#1} \newcommand{\ts}{\thinspace} \newcommand{\q}{\quad} \newcommand{\qq}{\qquad} % % %%% SPECIALIZED MATH %%% % % Logic and bit operations \newcommand{\fa}{\forall} \newcommand{\ex}{\exists} \renewcommand{\and}{\wedge} \newcommand{\AND}{\bigwedge} \renewcommand{\or}{\vee} \newcommand{\OR}{\bigvee} \newcommand{\xor}{\oplus} \newcommand{\XOR}{\bigoplus} \newcommand{\union}{\cup} \newcommand{\dunion}{\sqcup} \newcommand{\inter}{\cap} \newcommand{\UNION}{\bigcup} \newcommand{\DUNION}{\bigsqcup} \newcommand{\INTER}{\bigcap} \newcommand{\comp}{\overline} \newcommand{\true}{\r{true}} \newcommand{\false}{\r{false}} \newcommand{\tf}{\set{\true,\false}} \DeclareMathOperator{\One}{\mathbb{1}} \DeclareMathOperator{\1}{\mathbb{1}} % use \mathbbm instead if using real LaTeX \DeclareMathOperator{\LSB}{LSB} % % Linear algebra \newcommand{\spn}{\mathrm{span}} % do NOT use \span because it causes misery with amsmath \DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\dom}{dom} \DeclareMathOperator{\Img}{Im} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\perm}{perm} \DeclareMathOperator{\haf}{haf} \newcommand{\transp}{\mathsf{T}} \newcommand{\T}{^\transp} \newcommand{\par}{\parallel} % .. named tensors \newcommand{\namedtensorstrut}{\vphantom{fg}} % milder than \mathstrut \newcommand{\name}[1]{\mathsf{\namedtensorstrut #1}} \newcommand{\nbin}[2]{\mathbin{\underset{\substack{#1}}{\namedtensorstrut #2}}} \newcommand{\ndot}[1]{\nbin{#1}{\odot}} \newcommand{\ncat}[1]{\nbin{#1}{\oplus}} \newcommand{\nsum}[1]{\sum\limits_{\substack{#1}}} \newcommand{\nfun}[2]{\mathop{\underset{\substack{#1}}{\namedtensorstrut\mathrm{#2}}}} \newcommand{\ndef}[2]{\newcommand{#1}{\name{#2}}} \newcommand{\nt}[1]{^{\transp(#1)}} % % Probability \newcommand{\tri}{\triangle} \newcommand{\Normal}{\mathcal{N}} \newcommand{\Exp}{\mathcal{Exp}} % .. operators \DeclareMathOperator{\supp}{supp} \let\Pr\undefined \DeclareMathOperator*{\Pr}{Pr} \DeclareMathOperator*{\G}{\mathbb{G}} \DeclareMathOperator*{\Odds}{Od} \DeclareMathOperator*{\E}{E} \DeclareMathOperator*{\Var}{Var} \DeclareMathOperator*{\Cov}{Cov} \DeclareMathOperator*{\K}{K} \DeclareMathOperator*{\corr}{corr} \DeclareMathOperator*{\median}{median} \DeclareMathOperator*{\maj}{maj} % ... information theory \let\H\undefined \DeclareMathOperator*{\H}{H} \DeclareMathOperator*{\I}{I} \DeclareMathOperator*{\D}{D} \DeclareMathOperator*{\KL}{KL} % .. other divergences \newcommand{\dTV}{d_{\mathrm{TV}}} \newcommand{\dHel}{d_{\mathrm{Hel}}} \newcommand{\dJS}{d_{\mathrm{JS}}} % % Polynomials \DeclareMathOperator{\He}{He} \DeclareMathOperator{\coeff}{coeff} % %%% SPECIALIZED COMPUTER SCIENCE %%% % % Complexity classes % .. keywords \newcommand{\coclass}{\mathsf{co}} \newcommand{\Prom}{\mathsf{Promise}} % .. classical \newcommand{\PTIME}{\mathsf{P}} \newcommand{\NP}{\mathsf{NP}} \newcommand{\coNP}{\coclass\NP} \newcommand{\PH}{\mathsf{PH}} \newcommand{\PSPACE}{\mathsf{PSPACE}} \renewcommand{\L}{\mathsf{L}} \newcommand{\EXP}{\mathsf{EXP}} \newcommand{\NEXP}{\mathsf{NEXP}} % .. probabilistic \newcommand{\formost}{\mathsf{Я}} \newcommand{\RP}{\mathsf{RP}} \newcommand{\BPP}{\mathsf{BPP}} \newcommand{\ZPP}{\mathsf{ZPP}} \newcommand{\MA}{\mathsf{MA}} \newcommand{\AM}{\mathsf{AM}} \newcommand{\IP}{\mathsf{IP}} \newcommand{\RL}{\mathsf{RL}} % .. circuits \newcommand{\NC}{\mathsf{NC}} \newcommand{\AC}{\mathsf{AC}} \newcommand{\ACC}{\mathsf{ACC}} \newcommand{\ThrC}{\mathsf{TC}} \newcommand{\Ppoly}{\mathsf{P}/\poly} \newcommand{\Lpoly}{\mathsf{L}/\poly} % .. resources \newcommand{\TIME}{\mathsf{TIME}} \newcommand{\NTIME}{\mathsf{NTIME}} \newcommand{\SPACE}{\mathsf{SPACE}} \newcommand{\TISP}{\mathsf{TISP}} \newcommand{\SIZE}{\mathsf{SIZE}} % .. custom \newcommand{\NCP}{\mathsf{NCP}} % % Boolean analysis \newcommand{\harpoon}{\!\upharpoonright\!} \newcommand{\rr}[2]{#1\harpoon_{#2}} \newcommand{\Fou}[1]{\widehat{#1}} \DeclareMathOperator{\Ind}{\mathrm{Ind}} \DeclareMathOperator{\Inf}{\mathrm{Inf}} \newcommand{\Der}[1]{\operatorname{D}_{#1}\mathopen{}} % \newcommand{\Exp}[1]{\operatorname{E}_{#1}\mathopen{}} \DeclareMathOperator{\Stab}{\mathrm{Stab}} \DeclareMathOperator{\Tau}{T} \DeclareMathOperator{\sens}{\mathrm{s}} \DeclareMathOperator{\bsens}{\mathrm{bs}} \DeclareMathOperator{\fbsens}{\mathrm{fbs}} \DeclareMathOperator{\Cert}{\mathrm{C}} \DeclareMathOperator{\DT}{\mathrm{DT}} \DeclareMathOperator{\CDT}{\mathrm{CDT}} % canonical \DeclareMathOperator{\ECDT}{\mathrm{ECDT}} \DeclareMathOperator{\CDTv}{\mathrm{CDT_{vars}}} \DeclareMathOperator{\ECDTv}{\mathrm{ECDT_{vars}}} \DeclareMathOperator{\CDTt}{\mathrm{CDT_{terms}}} \DeclareMathOperator{\ECDTt}{\mathrm{ECDT_{terms}}} \DeclareMathOperator{\CDTw}{\mathrm{CDT_{weighted}}} \DeclareMathOperator{\ECDTw}{\mathrm{ECDT_{weighted}}} \DeclareMathOperator{\AvgDT}{\mathrm{AvgDT}} \DeclareMathOperator{\PDT}{\mathrm{PDT}} % partial decision tree \DeclareMathOperator{\DTsize}{\mathrm{DT_{size}}} \DeclareMathOperator{\W}{\mathbf{W}} % .. functions (small caps sadly doesn't work) \DeclareMathOperator{\Par}{\mathrm{Par}} \DeclareMathOperator{\Maj}{\mathrm{Maj}} \DeclareMathOperator{\HW}{\mathrm{HW}} \DeclareMathOperator{\Thr}{\mathrm{Thr}} \DeclareMathOperator{\Tribes}{\mathrm{Tribes}} \DeclareMathOperator{\RotTribes}{\mathrm{RotTribes}} \DeclareMathOperator{\CycleRun}{\mathrm{CycleRun}} \DeclareMathOperator{\SAT}{\mathrm{SAT}} \DeclareMathOperator{\UniqueSAT}{\mathrm{UniqueSAT}} % % Dynamic optimality \newcommand{\OPT}{\mathsf{OPT}} \newcommand{\Alt}{\mathsf{Alt}} \newcommand{\Funnel}{\mathsf{Funnel}} % % Alignment \DeclareMathOperator{\Amp}{\mathrm{Amp}} % %%% TYPESETTING %%% % % In "text" \newcommand{\heart}{\heartsuit} \newcommand{\nth}{^\t{th}} \newcommand{\degree}{^\circ} \newcommand{\qu}[1]{\text{``}#1\text{''}} % remove these last two if using real LaTeX \newcommand{\qed}{\blacksquare} \newcommand{\qedhere}{\tag*{$\blacksquare$}} % % Fonts % .. bold \newcommand{\BA}{\boldsymbol{A}} \newcommand{\BB}{\boldsymbol{B}} \newcommand{\BC}{\boldsymbol{C}} \newcommand{\BD}{\boldsymbol{D}} \newcommand{\BE}{\boldsymbol{E}} \newcommand{\BF}{\boldsymbol{F}} \newcommand{\BG}{\boldsymbol{G}} \newcommand{\BH}{\boldsymbol{H}} \newcommand{\BI}{\boldsymbol{I}} \newcommand{\BJ}{\boldsymbol{J}} \newcommand{\BK}{\boldsymbol{K}} \newcommand{\BL}{\boldsymbol{L}} \newcommand{\BM}{\boldsymbol{M}} \newcommand{\BN}{\boldsymbol{N}} \newcommand{\BO}{\boldsymbol{O}} \newcommand{\BP}{\boldsymbol{P}} \newcommand{\BQ}{\boldsymbol{Q}} \newcommand{\BR}{\boldsymbol{R}} \newcommand{\BS}{\boldsymbol{S}} \newcommand{\BT}{\boldsymbol{T}} \newcommand{\BU}{\boldsymbol{U}} \newcommand{\BV}{\boldsymbol{V}} \newcommand{\BW}{\boldsymbol{W}} \newcommand{\BX}{\boldsymbol{X}} \newcommand{\BY}{\boldsymbol{Y}} \newcommand{\BZ}{\boldsymbol{Z}} \newcommand{\Ba}{\boldsymbol{a}} \newcommand{\Bb}{\boldsymbol{b}} \newcommand{\Bc}{\boldsymbol{c}} \newcommand{\Bd}{\boldsymbol{d}} \newcommand{\Be}{\boldsymbol{e}} \newcommand{\Bf}{\boldsymbol{f}} \newcommand{\Bg}{\boldsymbol{g}} \newcommand{\Bh}{\boldsymbol{h}} \newcommand{\Bi}{\boldsymbol{i}} \newcommand{\Bj}{\boldsymbol{j}} \newcommand{\Bk}{\boldsymbol{k}} \newcommand{\Bl}{\boldsymbol{l}} \newcommand{\Bm}{\boldsymbol{m}} \newcommand{\Bn}{\boldsymbol{n}} \newcommand{\Bo}{\boldsymbol{o}} \newcommand{\Bp}{\boldsymbol{p}} \newcommand{\Bq}{\boldsymbol{q}} \newcommand{\Br}{\boldsymbol{r}} \newcommand{\Bs}{\boldsymbol{s}} \newcommand{\Bt}{\boldsymbol{t}} \newcommand{\Bu}{\boldsymbol{u}} \newcommand{\Bv}{\boldsymbol{v}} \newcommand{\Bw}{\boldsymbol{w}} \newcommand{\Bx}{\boldsymbol{x}} \newcommand{\By}{\boldsymbol{y}} \newcommand{\Bz}{\boldsymbol{z}} \newcommand{\Balpha}{\boldsymbol{\alpha}} \newcommand{\Bbeta}{\boldsymbol{\beta}} \newcommand{\Bgamma}{\boldsymbol{\gamma}} \newcommand{\Bdelta}{\boldsymbol{\delta}} \newcommand{\Beps}{\boldsymbol{\eps}} \newcommand{\Bveps}{\boldsymbol{\veps}} \newcommand{\Bzeta}{\boldsymbol{\zeta}} \newcommand{\Beta}{\boldsymbol{\eta}} \newcommand{\Btheta}{\boldsymbol{\theta}} \newcommand{\Bth}{\boldsymbol{\th}} \newcommand{\Biota}{\boldsymbol{\iota}} \newcommand{\Bkappa}{\boldsymbol{\kappa}} \newcommand{\Blambda}{\boldsymbol{\lambda}} \newcommand{\Bmu}{\boldsymbol{\mu}} \newcommand{\Bnu}{\boldsymbol{\nu}} \newcommand{\Bxi}{\boldsymbol{\xi}} \newcommand{\Bpi}{\boldsymbol{\pi}} \newcommand{\Bvpi}{\boldsymbol{\vpi}} \newcommand{\Brho}{\boldsymbol{\rho}} \newcommand{\Bsigma}{\boldsymbol{\sigma}} \newcommand{\Btau}{\boldsymbol{\tau}} \newcommand{\Bupsilon}{\boldsymbol{\upsilon}} \newcommand{\Bphi}{\boldsymbol{\phi}} \newcommand{\Bfi}{\boldsymbol{\fi}} \newcommand{\Bchi}{\boldsymbol{\chi}} \newcommand{\Bpsi}{\boldsymbol{\psi}} \newcommand{\Bom}{\boldsymbol{\om}} % .. calligraphic \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} % .. typewriter \newcommand{\TA}{\mathtt{A}} \newcommand{\TB}{\mathtt{B}} \newcommand{\TC}{\mathtt{C}} \newcommand{\TD}{\mathtt{D}} \newcommand{\TE}{\mathtt{E}} \newcommand{\TF}{\mathtt{F}} \newcommand{\TG}{\mathtt{G}} \renewcommand{\TH}{\mathtt{H}} \newcommand{\TI}{\mathtt{I}} \newcommand{\TJ}{\mathtt{J}} \newcommand{\TK}{\mathtt{K}} \newcommand{\TL}{\mathtt{L}} \newcommand{\TM}{\mathtt{M}} \newcommand{\TN}{\mathtt{N}} \newcommand{\TO}{\mathtt{O}} \newcommand{\TP}{\mathtt{P}} \newcommand{\TQ}{\mathtt{Q}} \newcommand{\TR}{\mathtt{R}} \newcommand{\TS}{\mathtt{S}} \newcommand{\TT}{\mathtt{T}} \newcommand{\TU}{\mathtt{U}} \newcommand{\TV}{\mathtt{V}} \newcommand{\TW}{\mathtt{W}} \newcommand{\TX}{\mathtt{X}} \newcommand{\TY}{\mathtt{Y}} \newcommand{\TZ}{\mathtt{Z}} % % LEVELS OF CLOSENESS (basically deprecated) \newcommand{\scirc}[1]{\sr{\circ}{#1}} \newcommand{\sdot}[1]{\sr{.}{#1}} \newcommand{\slog}[1]{\sr{\log}{#1}} \newcommand{\createClosenessLevels}[7]{ \newcommand{#2}{\mathrel{(#1)}} \newcommand{#3}{\mathrel{#1}} \newcommand{#4}{\mathrel{#1\!\!#1}} \newcommand{#5}{\mathrel{#1\!\!#1\!\!#1}} \newcommand{#6}{\mathrel{(\sdot{#1})}} \newcommand{#7}{\mathrel{(\slog{#1})}} } \let\lt\undefined \let\gt\undefined % .. vanilla versions (is it within a constant?) \newcommand{\ez}{\scirc=} \newcommand{\eq}{\simeq} \newcommand{\eqq}{\mathrel{\eq\!\!\eq}} \newcommand{\eqqq}{\mathrel{\eq\!\!\eq\!\!\eq}} \newcommand{\lez}{\scirc\le} \renewcommand{\lq}{\preceq} \newcommand{\lqq}{\mathrel{\lq\!\!\lq}} \newcommand{\lqqq}{\mathrel{\lq\!\!\lq\!\!\lq}} \newcommand{\gez}{\scirc\ge} \newcommand{\gq}{\succeq} \newcommand{\gqq}{\mathrel{\gq\!\!\gq}} \newcommand{\gqqq}{\mathrel{\gq\!\!\gq\!\!\gq}} \newcommand{\lz}{\scirc<} \newcommand{\lt}{\prec} \newcommand{\ltt}{\mathrel{\lt\!\!\lt}} \newcommand{\lttt}{\mathrel{\lt\!\!\lt\!\!\lt}} \newcommand{\gz}{\scirc>} \newcommand{\gt}{\succ} \newcommand{\gtt}{\mathrel{\gt\!\!\gt}} \newcommand{\gttt}{\mathrel{\gt\!\!\gt\!\!\gt}} % .. dotted versions (is it equal in the limit?) \newcommand{\ed}{\sdot=} \newcommand{\eqd}{\sdot\eq} \newcommand{\eqqd}{\sdot\eqq} \newcommand{\eqqqd}{\sdot\eqqq} \newcommand{\led}{\sdot\le} \newcommand{\lqd}{\sdot\lq} \newcommand{\lqqd}{\sdot\lqq} \newcommand{\lqqqd}{\sdot\lqqq} \newcommand{\ged}{\sdot\ge} \newcommand{\gqd}{\sdot\gq} \newcommand{\gqqd}{\sdot\gqq} \newcommand{\gqqqd}{\sdot\gqqq} \newcommand{\ld}{\sdot<} \newcommand{\ltd}{\sdot\lt} \newcommand{\lttd}{\sdot\ltt} \newcommand{\ltttd}{\sdot\lttt} \newcommand{\gd}{\sdot>} \newcommand{\gtd}{\sdot\gt} \newcommand{\gttd}{\sdot\gtt} \newcommand{\gtttd}{\sdot\gttt} % .. log versions (is it equal up to log?) \newcommand{\elog}{\slog=} \newcommand{\eqlog}{\slog\eq} \newcommand{\eqqlog}{\slog\eqq} \newcommand{\eqqqlog}{\slog\eqqq} \newcommand{\lelog}{\slog\le} \newcommand{\lqlog}{\slog\lq} \newcommand{\lqqlog}{\slog\lqq} \newcommand{\lqqqlog}{\slog\lqqq} \newcommand{\gelog}{\slog\ge} \newcommand{\gqlog}{\slog\gq} \newcommand{\gqqlog}{\slog\gqq} \newcommand{\gqqqlog}{\slog\gqqq} \newcommand{\llog}{\slog<} \newcommand{\ltlog}{\slog\lt} \newcommand{\lttlog}{\slog\ltt} \newcommand{\ltttlog}{\slog\lttt} \newcommand{\glog}{\slog>} \newcommand{\gtlog}{\slog\gt} \newcommand{\gttlog}{\slog\gtt} \newcommand{\gtttlog}{\slog\gttt}$

The entropy¹ of a random variable is a quantitative notion of how uncertain we are about its outcome. More precisely, it’s the “average surprise”² you get from samples, where surprise is measured as $\log\p{\f1{\text{probability of getting that outcome}}}$:

\[ \H[\BX] \ce \E_{x \sim \BX}\b{\log \frac{1}{\Pr[\BX=x]}}. \]

It’s the average “amount of information” you get from observing $\BX$: the number of bits you need to describe draws from $\BX$, amortized over many draws.

By extension, we define the entropy of a distribution $P$ as

\[ \H\p{P} \ce \E_{x \sim P}\b{\log \f{1}{P(x)}}. \]

Properties

Entropy ranges between $0$ (if $\BX$ is deterministic) and $\log |\CX|$ (if $\BX$ is uniform), where $\CX$ is the support of $\BX$.
It’s subadditive: $\H[\BX,\BY] \leq \H[\BX] + \H[\BY]$, with equality iff $\BX$ and $\BY$ are independent.
- Intuitively, you can’t be more surprised by the combination of $\BX$ and $\BY$ than you were by seeing both outcomes separately.
- The “correlated part” of $\BX$ and $\BY$ is counted once in $\H[\BX,\BY]$ but twice in $\H[\BX] + \H[\BY]$.

We will prove $\H[X] \leq \log |\CX|$ by looking at entropy deficit, and subadditivity by looking at mutual information.

Binary entropy

In particular, let $H(p) \ce p\log\frac{1}{p} + (1-p)\log\frac{1}{1-p}$ denote the entropy of a Bernoulli with mean $p$. Then

$H(p)$ is concave,
$H(p)$ is symmetric around $\frac12$: $H(p) = H(1-p)$,
when $p\approx \frac12$, $H(p) = 1-\Theta\p{\p{p-\frac{1}{2}}^2}$,
when $p \le \frac12$, $H(p) = \Theta\p{p \log\frac{1}{p}}$.

Conditional entropy

#to-write make the expectation explicit Similarly, we can define conditional entropy as the average surprise you get from $\BY$ if you’ve already seen $\BX$, or the number of bits you need to describe draws from $\BY$ to someone who knows $\BX$:

\[ \al{ \H\bco{\BY}{\BX} \ce&\ \E_{x \sim \BX}\b{\H[\at{\BY}{\BX=x}]}\\ =&\ \E_{(x,y) \sim (\BX,\BY)}\b{\log \frac{1}{\Pr\bco{\BY=y}{\BX=x}}} } \]

where $\at{\BY}{\BX=x}$ denotes the random variable obtained by conditioning $\BY$ on $\BX=x$. Confusingly, $\H\bco\BY\BX$ denotes a fixed value, in contrast with the conditional expectation $\E\bco{\BY}{\BX}$, which is a random variable depending on $\BX$.

Chain rule

The conditional entropy gives a chain rule:

\[ \al{ \H[\BX,\BY] &= \E_{(x,y) \sim (\BX,\BY)}\b{\log\frac{1}{\Pr[\BX = x \and \BY=y]}}\\ &= \E_{(x,y) \sim (\BX,\BY)}\b{\log\frac{1}{\Pr[\BX = x]\Pr\bco{\BY=y}{\BX=x}}}\\ &= \H[\BX] + \H\bco\BY\BX, } \]

which immediately shows that adding data can only increase the entropy

\[ \H[\BX,\BY] \ge \H[\BX]. \]

Together with subadditivity, this shows that conditioning can only decrease entropy

\[ \al{ \H\b{\BY} &\ge \H\b{\BX,\BY} - \H[\BX]\tag{subadditivity}\\ &= \H\bco{\BY}{\BX},\tag{chain rule} } \]

with equality iff $\BX$ and $\BY$ are independent.

Mixtures and concavity

If we see $\BY$ as a mixture³ $\BY_{\BX}$ of the variables $\BY_x \ce \at{\BY}{\BX=x}$ weighted by $\BX$, then the above is saying that entropy is strictly concave:

\[ \OB{\E_{x \sim \BX}\b{\H\b{\BY_x}}}^{\H\bco{\BY}{\BX}}_\text{average of entropies}\le \OB{\H\b{\BY_{\BX}}}^{\H\b{\BY}}_\text{entropy of mixture} \le \OB{\H\b{\BX,\BY_\BX}}^{\H\b{\BX,\BY}}_\text{entropy of ``disjoint mixture''}, \]

with equality respectively iff the variables $\BY_x$ are all indentical, and iff they have disjoint supports.

If we take a multiplicative view of entropy, then the entropy of a disjoint mixture is upper-bounded by the sum of the entropies:

\[ \exp\H\b{\BX,\BY_\BX} \le \sum_x \exp\H\b{\BY_x}, \]

and this is achieved if $\BX$ is distributed according to $P(x) \propto \exp\H\b{\BY_x}$.

Use as a prior

#to-write term like $-\H(P)$ to the loss

Gradient

Over logits

Suppose that the learned distribution $P$ is generated by first computing a logit $\tau_x$ for each possible value, then taking the softmax

\[ P(x) \ce \frac{e^{\tau_x}}{\sum_z e^{\tau_z}}. \]

Then the direction of greatest increase of $\H\p P$ as a function of the logits $\tau$ is given by the gradient of the expectation that defines $\H(P)$,

\[ \al{ \grad{\tau}\H\p{P} &= \grad{\tau_x}\p{\E_{\By \sim P}\b{\log \f1{P(\By)}}}\\ &= \ub{\E_{\By \sim P}\b{\f{\grad{\tau}P}{P}(\By) \log \f1{P(\By)}}}{relative probability gradient}+\cancelto0{\ub{\E_{\By \sim P}\b{\grad{\tau}\log \f1{P(\By)}}}{direct gradient}}, %&= \E_{\By \sim P}\b{\f{\grad{\tau}P}{P}(\By) \log \f1{P(\By)}}, } \]

where the direct gradient term goes to zero because

\[ \al{ \E_{\By \sim P}\b{\grad{\tau}\log \f1{P(\By)}} &= -\E_{\By \sim P}\b{\f{\grad{\tau}P(\By)}{P(\By)}}\\ %&= -\int \f{\grad{\tau}P(y)}{P(y)} P(y) \d y\\ &= -\int \grad{\tau}P(y) \d y\\ &= -\grad{\tau}\int P(y) \d y\\ &= -\grad\tau1. } \]

(It makes sense that this term goes to zero, since it roughly corresponds to trying to increase the entropy by “decreasing the probability uniformly everywhere at once”, which is impossible.)

Using the expression for the relative derivatives of softmax, this further becomes

\[ \al{ \p{\grad{\tau}\H\p P}_x &= \E_{\By \sim P}\b{\p{\1(\By = x) - P(x)}\log \f1{P(\By)}}\\ &= P(x)\log \f1{P(x)} - P(x)\E_{\By \sim P}\b{\log \f1{P(\By)}}\\ &= P(x)\p{\log \f1{P(x)} - \H(P)}. } \]

Intuitively, this says that logit $\tau_x$ should increase if $x$ is rarer than a typical draw from $P$ (which is operationalized as $\log \f1{P(x)} > \H(P)$), but that this only matters proportionally to the current probability $P(x)$. Also note that this expression has mean $0$ over the possible values $x$, which seems desirable: shifting all the logits by a constant has no effect on $P$, so we might as well keep the mean of the logits fixed.

Over probabilities

Suppose we increase $P(x)$ and shrink the other probabilities $\p{P\p y}_{y \ne x}$ while keeping their proportions. This can be modeled by increasing the logit $\tau_x$ while keeping the other logits $\p{\tau_y}_{y \ne x}$ constant. So the change in entropy by an infinitesimal change like this is given by

\[ \]

In addition, we have

\[ \grad{\tau_x}P(x) = P(x)\p{1-P\p x}, \]

so the effect of increasing $P(x)$ by an infinitesimal amount by tweaking $\tau_x$ only (which means the other probabilities $P(y)$ for $y \ne x$ will get shrunk while keeping their proportions) is

\[ \al{ \grad{P(x)}\H(P) &=\f{\grad{\tau_x}\H\p P}{\grad{\tau_x}P(x)}\\ %&= \f{P(x)\p{\log \f1{P(x)} - \H(P)}}{P(x)\p{1-P\p x}}\\ &= \f{\log \f1{P\p x} - \H\p P}{1-P(x)}. } \]

#to-write

the “amount of information” thing makes sense? because it’s the ==information you gain== if you started from a prior of $\BX$’s distribution then you actually observe $\BX$? i.e. $\E_{\Bx \sim [\BX]}\b{\D\bff{\at{\BX}{\BX = \Bx}}{\BX}}$
okay I’m now thoroughly KL-pilled, it’s the source of everything. $\D\pff{Q}{P}$ is the information you’ve gained if you started from a prior of $P$ and somehow your prior is now $Q$
- but I want to understand the sense if which $\D\pff{Q}{P}$ is a lower bound on the “number of questions you must have asked in order to get there”?
  - i.e. the information $\BX$ gains you about $\BX$ is limited by the information that it gives about itself
  - i.e. $\E_{\Bx \sim [\BX]}\b{\D\bff{\at{\BY}{\BX=\Bx}}{\BY}} \ge \H[\BX]$?
  - actually I believe you’ve just rediscovered $\I[\BX;\BY] \ge \I[\BX;\BX] = \H[\BX]$
    - darn information theory is so cool
  - and btw you can write this as $\E\b{\D\bff{\cond{\BY}{\BX}}{\BY}}$
    - and this directly shows that it’s equal to $\I[\BX;\BY] = \H[\BY] - \H\bco{\BY}{\BX}$ I guess
      - though maybe you should swallow the pill and not talk about entropy ever again!
      - note: I guess it also means $\I[\BX;\BY] = \I[\BY;\BY] - \I\bco{\BY;\BY}{\BX} = -\I[\BY;\BY;\BX]$?
        
        cool story bro
        
        wait is it the case that in general if you add a variable that’s already present it just flips the sign?
        
        $\I[\BX;\BY;\BZ;\BZ] = \I\bco{\BX;\BY;\BZ}{\BZ} - \I\b{\BX;\BY;\BZ}$
        
        and somehow I’m supposed to just know that $\I\bco{\BX;\BY;\BZ}{\BZ} = 0$?
        
        I guess this just suggests that if any variable is a constant then the interaction is $0$, which would make a whole lot of sense tbh
        
        very similar to Wick in that case!
    - really, entropy should be negative…
- okay I’m this close to just replacing all occurrences of $\D$ by $\I$
  - and to have mutual information as a more core concept than entropy
  - and to literally just define $\H[\BX]$ as $\I[\BX;\BX]$
    - this would explain why you have to somehow bother to use two copies of $\BX$ in the definition of entropy!
      - (and same thing in basically all of the power entropies)
    - ultimately ==information is about something providing information about another thing==
      - there is no such thing as “the information of one thing”
      - and the information divergence is the core of what information means
      - and it’s also what makes information theory so nice: it provides all the nontrivial inequalities!
    - information divergence measures the size of a partial update, whereas entropy measures the (average) size of a total update
  - does $\I[\BX;\BX;\BX]$ make sense? it’s $\I\bco{\BX;\BX}{\BX} - \I[\BX;\BX] = 0-\H[\BX] = -\H[\BX]$ hmm maybe this is slightly cursed?
  - and $\I[\BX;\BX;\BX;\BX] = \I\bco{\BX;\BX;\BX}{\BX} - \I[\BX;\BX;\BX] = \H[\BX]$? so it keeps alternating, hmmm…
    - actually this is support for defining $\I[\BX] \ce -\H[\BX]$ instead of $\D[\BX]$
      - indeed, $\I[\BX;\BX] = \I\bco{\BX}{\BX} - \I\b{\BX}$
        
        and I’m not sure what this first quantity $\I\bco{\BX}{\BX}$ is supposed to mean but it sure seems like it should be $0$?
        
        if you set $\I[\BX] \ce \D[\BX]$ then this quantity is $\H[\BX] + \D[\BX] =$ entropy of the prior on $\BX$?
        
        actually very confusing
      - and $\I[\BX] = \I\bco{}{\BX} - \I\b{}$
        
        i.e. “how much does $\BX$ help you predict nothing”??
        
        so we’d have to accept that $\I\bco{}{\BX} = \I[\BX] + C = C-\H[\BX]$
        
        i.e. $\BX$ is actively detrimental to the task of predicting nothing? (???)
      - actually feels like we should have $\I\bco{}{\BX} = \I[]$ given that $\BX$ is not dependent with anything in the LHS (this is definitely a rule right?)
        
        which would mean $\I[\BX] = 0$
        
        but then $\I\bco{\BX}{\BX} = \I\b{\BX;\BX} = \H[\BX]$, even though $\I\bco{\BX}{\BX} = \E_{\Bx \sim [\BX]}\b{\I\b{\at{\BX}{\BX = \Bx}}}$ and $\at{\BX}{\BX=x}$ is literally a point mass no matter what $x$ is
        
        actually that’s a general argument for $\I\bco{\BX}{\BX}$ being a constant $C$ (not clear which constant, but $0$ would seem like a reasonable choice here since we get zero whenever there is any deterministic constant in the expression?)
        
        so $\I[\BX]$ should be $C-\H[\BX]$
      - to summarize:
        
        given that $\I[\BX] = C - \H[\BX]$ is the only tenable position, we would have to have $\I\bco{}{\BX} = C' - \H[\BX]$
        
        but then that violates the rule of “if $\BX$ is independent from the joint distribution of the things in the LHS then the information should be $0$”
      - okay I vote for $\I[\BX]$ just being undefined!! $\I[\BX;\BY]$ is clearly the core quantity here
        
        and it seems like this makes things better in the case of continuous random variables as well
    - it’s in fact somewhat baffling that the interaction information is even well-defined (that it’s consistent and symmetric)
  - actually I think I’d be on board to keep $\D[\cdot]$ only for the deficit
information theory is specifically about making dimensionless statistics additive
KL divergence is to entropy the same as squared distance (!!) is to variance?

More precisely “Shannon entropy” or “information entropy” to separate it from the wider class of dimensionless uncertainty measures based described in Generalized entropies. ↩
Note that this is a pretty stupid notion of surprise. If I ask you to draw me numbers between $1$ and $1000$ and everytime you do so I go “woah that’s crazy, this number only have a $1/1000$ chance of appearing”, then you would be justified in questioning my sanity. ↩
a convex combination of distributions ↩