$\require{mathtools} % %%% GENERIC MATH %%% % % Environments \newcommand{\al}[1]{\begin{align}#1\end{align}} % need this for \tag{} to work \renewcommand{\r}{\mathrm} % % Greek \newcommand{\eps}{\epsilon} \newcommand{\veps}{\varepsilon} \newcommand{\Om}{\Omega} \newcommand{\om}{\omega} \newcommand{\Th}{\Theta} \let\fi\phi % because it looks like an f \let\phi\varphi % because it looks like a p % % Miscellaneous shortcuts % .. over and under \newcommand{\ss}[1]{_{\substack{#1}}} \newcommand{\ob}{\overbrace} \newcommand{\ub}{\underbrace} \newcommand{\ol}{\overline} \newcommand{\tld}{\widetilde} \newcommand{\HAT}{\widehat} \newcommand{\f}{\frac} \newcommand{\s}[2]{#1 /\mathopen{}#2} \newcommand{\rt}{\sqrt} % .. relations \newcommand{\sr}{\stackrel} \newcommand{\sse}{\subseteq} \newcommand{\ce}{\coloneqq} \newcommand{\ec}{\eqqcolon} \newcommand{\ap}{\approx} \newcommand{\ls}{\lesssim} \newcommand{\gs}{\greatersim} % .. miscer \newcommand{\q}{\quad} \newcommand{\qq}{\qquad} \newcommand{\heart}{\heartsuit} % % Delimiters % (I needed to create my own because the MathJax version of \DeclarePairedDelimiter doesn't have \mathopen{} and that messes up the spacing) % .. one-part \newcommand{\p}[1]{\mathopen{}\left( #1 \right)} \newcommand{\b}[1]{\mathopen{}\left[ #1 \right]} \newcommand{\set}[1]{\mathopen{}\left\{ #1 \right\}} \newcommand{\abs}[1]{\mathopen{}\left\lvert #1 \right\rvert} \newcommand{\floor}[1]{\mathopen{}\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\mathopen{}\left\lceil #1 \right\rceil} \newcommand{\inner}[1]{\mathopen{}\left\langle #1 \right\rangle} % .... (use phantom to force at least the standard height of double bars) \newcommand{\norm}[1]{\mathopen{}\left\lVert #1 \vphantom{f} \right\rVert} \newcommand{\frob}[1]{\norm{#1}_\mathrm{F}} %% .. two-part \newcommand{\incond}[2]{#1 \mathop{}\middle|\mathop{} #2} \newcommand{\cond}[2]{ {\left.\incond{#1}{#2}\right.}} \newcommand{\pco}[2]{\p{\incond{#1}{#2}}} \newcommand{\bco}[2]{\b{\incond{#1}{#2}}} \newcommand{\setco}[2]{\set{\incond{#1}{#2}}} \newcommand{\at}[2]{ {\left.#1\right|_{#2}}} \newcommand{\pat}[2]{\p{\at{#1}{#2}}} \newcommand{\bat}[2]{\b{\at{#1}{#2}}} % ..... (use phantom to force at least the standard height of double bar) \newcommand{\oldpara}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} %\newcommand{\para}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} \newcommand{\para}[2]{\mathchoice{\begin{matrix}#1\\\hdashline#2\end{matrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}} \newcommand{\ppa}[2]{\p{\para{#1}{#2}}} \newcommand{\bpa}[2]{\b{\para{#1}{#2}}} %\newcommand{\bpaco}[4]{\bpa{\incond{#1}{#2}}{\incond{#3}{#4}}} \newcommand{\bpaco}[4]{\bpa{\cond{#1}{#2}}{\cond{#3}{#4}}} % % Levels of closeness \newcommand{\scirc}[1]{\sr{\circ}{#1}} \newcommand{\sdot}[1]{\sr{.}{#1}} \newcommand{\slog}[1]{\sr{\log}{#1}} \newcommand{\createClosenessLevels}[7]{ \newcommand{#2}{\mathrel{(#1)}} \newcommand{#3}{\mathrel{#1}} \newcommand{#4}{\mathrel{#1\!\!#1}} \newcommand{#5}{\mathrel{#1\!\!#1\!\!#1}} \newcommand{#6}{\mathrel{(\sdot{#1})}} \newcommand{#7}{\mathrel{(\slog{#1})}} } \let\lt\undefined \let\gt\undefined % .. vanilla versions (is it within a constant?) \newcommand{\ez}{\scirc=} \newcommand{\eq}{\simeq} \newcommand{\eqq}{\mathrel{\eq\!\!\eq}} \newcommand{\eqqq}{\mathrel{\eq\!\!\eq\!\!\eq}} \newcommand{\lez}{\scirc\le} \newcommand{\lq}{\preceq} \newcommand{\lqq}{\mathrel{\lq\!\!\lq}} \newcommand{\lqqq}{\mathrel{\lq\!\!\lq\!\!\lq}} \newcommand{\gez}{\scirc\ge} \newcommand{\gq}{\succeq} \newcommand{\gqq}{\mathrel{\gq\!\!\gq}} \newcommand{\gqqq}{\mathrel{\gq\!\!\gq\!\!\gq}} \newcommand{\lz}{\scirc<} \newcommand{\lt}{\prec} \newcommand{\ltt}{\mathrel{\lt\!\!\lt}} \newcommand{\lttt}{\mathrel{\lt\!\!\lt\!\!\lt}} \newcommand{\gz}{\scirc>} \newcommand{\gt}{\succ} \newcommand{\gtt}{\mathrel{\gt\!\!\gt}} \newcommand{\gttt}{\mathrel{\gt\!\!\gt\!\!\gt}} % .. dotted versions (is it equal in the limit?) \newcommand{\ed}{\sdot=} \newcommand{\eqd}{\sdot\eq} \newcommand{\eqqd}{\sdot\eqq} \newcommand{\eqqqd}{\sdot\eqqq} \newcommand{\led}{\sdot\le} \newcommand{\lqd}{\sdot\lq} \newcommand{\lqqd}{\sdot\lqq} \newcommand{\lqqqd}{\sdot\lqqq} \newcommand{\ged}{\sdot\ge} \newcommand{\gqd}{\sdot\gq} \newcommand{\gqqd}{\sdot\gqq} \newcommand{\gqqqd}{\sdot\gqqq} \newcommand{\ld}{\sdot<} \newcommand{\ltd}{\sdot\lt} \newcommand{\lttd}{\sdot\ltt} \newcommand{\ltttd}{\sdot\lttt} \newcommand{\gd}{\sdot>} \newcommand{\gtd}{\sdot\gt} \newcommand{\gttd}{\sdot\gtt} \newcommand{\gtttd}{\sdot\gttt} % .. log versions (is it equal up to log?) \newcommand{\elog}{\slog=} \newcommand{\eqlog}{\slog\eq} \newcommand{\eqqlog}{\slog\eqq} \newcommand{\eqqqlog}{\slog\eqqq} \newcommand{\lelog}{\slog\le} \newcommand{\lqlog}{\slog\lq} \newcommand{\lqqlog}{\slog\lqq} \newcommand{\lqqqlog}{\slog\lqqq} \newcommand{\gelog}{\slog\ge} \newcommand{\gqlog}{\slog\gq} \newcommand{\gqqlog}{\slog\gqq} \newcommand{\gqqqlog}{\slog\gqqq} \newcommand{\llog}{\slog<} \newcommand{\ltlog}{\slog\lt} \newcommand{\lttlog}{\slog\ltt} \newcommand{\ltttlog}{\slog\lttt} \newcommand{\glog}{\slog>} \newcommand{\gtlog}{\slog\gt} \newcommand{\gttlog}{\slog\gtt} \newcommand{\gtttlog}{\slog\gttt} % % Miscellaneous \newcommand{\LHS}{\mathrm{LHS}} \newcommand{\RHS}{\mathrm{RHS}} % .. operators \DeclareMathOperator{\poly}{poly} \DeclareMathOperator{\polylog}{polylog} \DeclareMathOperator{\quasipoly}{quasipoly} \DeclareMathOperator{\negl}{negl} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} % .. functions \DeclareMathOperator{\id}{id} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\err}{err} \DeclareMathOperator{\ReLU}{ReLU} % .. analysis \let\d\undefined \newcommand{\d}{\operatorname{d}\mathopen{}} \newcommand{\df}[2]{ {\f{\d #1}{\d #2}}} \newcommand{\ds}[2]{ {\s{\d #1}{\d #2}}} \newcommand{\part}{\partial} \newcommand{\partf}[2]{\f{\part #1}{\part #2}} \newcommand{\parts}[2]{\s{\part #1}{\part #2}} \newcommand{\grad}[1]{\mathop{\nabla\!_{#1}}} % .. sets \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\F}{\mathbb{F}} \newcommand{\zo}{\set{0,1}} \newcommand{\pmo}{\set{\pm 1}} % %%% SPECIALIZED MATH %%% % % Logic \renewcommand{\and}{\wedge} \newcommand{\AND}{\bigwedge} \newcommand{\or}{\vee} \newcommand{\OR}{\bigvee} \newcommand{\xor}{\oplus} \newcommand{\XOR}{\bigoplus} \newcommand{\union}{\cup} \newcommand{\inter}{\cap} \newcommand{\UNION}{\bigcup} \newcommand{\INTER}{\bigcap} \newcommand{\comp}{\overline} \newcommand{\true}{\r{true}} \newcommand{\false}{\r{false}} \newcommand{\tf}{\set{\true,\false}} \DeclareMathOperator{\One}{\mathbb{1}} \DeclareMathOperator{\1}{\mathbb{1}} % % Linear algebra \renewcommand{\span}{\mathrm{span}} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\dom}{dom} \DeclareMathOperator{\Img}{Im} \newcommand{\transp}{\mathsf{T}} \renewcommand{\t}{^\transp} % ... named tensors \newcommand{\namedtensorstrut}{\vphantom{fg}} % milder than \mathstrut \newcommand{\name}[1]{\mathsf{\namedtensorstrut #1}} \newcommand{\nbin}[2]{\mathbin{\underset{\substack{#1}}{\namedtensorstrut #2}}} \newcommand{\ndot}[1]{\nbin{#1}{\odot}} \newcommand{\ncat}[1]{\nbin{#1}{\oplus}} \newcommand{\nsum}[1]{\sum\limits_{\substack{#1}}} \newcommand{\nfun}[2]{\mathop{\underset{\substack{#1}}{\namedtensorstrut\mathrm{#2}}}} \newcommand{\ndef}[2]{\newcommand{#1}{\name{#2}}} \newcommand{\nt}[1]{^{\transp(#1)}} % % Probability \newcommand{\Normal}{\mathcal{N}} \let\Pr\undefined \DeclareMathOperator*{\Pr}{Pr} \DeclareMathOperator*{\G}{\mathbb{G}} \DeclareMathOperator*{\Odds}{Od} \DeclareMathOperator*{\E}{E} \DeclareMathOperator*{\Var}{Var} \DeclareMathOperator*{\Cov}{Cov} \DeclareMathOperator*{\corr}{corr} \DeclareMathOperator*{\median}{median} \newcommand{\dTV}{d_{\mathrm{TV}}} \newcommand{\dHel}{d_{\mathrm{Hel}}} \newcommand{\dJS}{d_{\mathrm{JS}}} % ... information theory \let\H\undefined \DeclareMathOperator*{\H}{H} \DeclareMathOperator*{\I}{I} \DeclareMathOperator*{\D}{D} % %%% SPECIALIZED COMPUTER SCIENCE %%% % % Complexity classes % .. classical \newcommand{\Poly}{\mathsf{P}} \newcommand{\NP}{\mathsf{NP}} \newcommand{\PH}{\mathsf{PH}} \newcommand{\PSPACE}{\mathsf{PSPACE}} \renewcommand{\L}{\mathsf{L}} % .. probabilistic \newcommand{\formost}{\mathsf{Я}} \newcommand{\RP}{\mathsf{RP}} \newcommand{\BPP}{\mathsf{BPP}} \newcommand{\MA}{\mathsf{MA}} \newcommand{\AM}{\mathsf{AM}} \newcommand{\IP}{\mathsf{IP}} \newcommand{\RL}{\mathsf{RL}} % .. circuits \newcommand{\NC}{\mathsf{NC}} \newcommand{\AC}{\mathsf{AC}} \newcommand{\ACC}{\mathsf{ACC}} \newcommand{\TC}{\mathsf{TC}} \newcommand{\Ppoly}{\mathsf{P}/\poly} \newcommand{\Lpoly}{\mathsf{L}/\poly} % .. resources \newcommand{\TIME}{\mathsf{TIME}} \newcommand{\SPACE}{\mathsf{SPACE}} \newcommand{\TISP}{\mathsf{TISP}} \newcommand{\SIZE}{\mathsf{SIZE}} % .. keywords \newcommand{\co}{\mathsf{co}} \newcommand{\Prom}{\mathsf{Promise}} % % Boolean analysis \newcommand{\harpoon}{\!\upharpoonright\!} \newcommand{\rr}[2]{#1\harpoon_{#2}} \newcommand{\Fou}[1]{\widehat{#1}} \DeclareMathOperator{\Ind}{\mathrm{Ind}} \DeclareMathOperator{\Inf}{\mathrm{Inf}} \DeclareMathOperator{\Der}{\mathrm{D}} \DeclareMathOperator{\Stab}{\mathrm{Stab}} \DeclareMathOperator{\T}{T} \DeclareMathOperator{\sens}{\mathrm{s}} \DeclareMathOperator{\bsens}{\mathrm{bs}} \DeclareMathOperator{\fbsens}{\mathrm{fbs}} \DeclareMathOperator{\Cert}{\mathrm{C}} \DeclareMathOperator{\DT}{\mathrm{DT}} \DeclareMathOperator{\CDT}{\mathrm{CDT}} % canonical \DeclareMathOperator{\ECDT}{\mathrm{ECDT}} \DeclareMathOperator{\CDTv}{\mathrm{CDT_{vars}}} \DeclareMathOperator{\ECDTv}{\mathrm{ECDT_{vars}}} \DeclareMathOperator{\CDTt}{\mathrm{CDT_{terms}}} \DeclareMathOperator{\ECDTt}{\mathrm{ECDT_{terms}}} \DeclareMathOperator{\CDTw}{\mathrm{CDT_{weighted}}} \DeclareMathOperator{\ECDTw}{\mathrm{ECDT_{weighted}}} \DeclareMathOperator{\AvgDT}{\mathrm{AvgDT}} \DeclareMathOperator{\PDT}{\mathrm{PDT}} % partial decision tree \DeclareMathOperator{\DTsize}{\mathrm{DT_{size}}} \DeclareMathOperator{\W}{\mathbf{W}} % .. functions (small caps sadly doesn't work) \DeclareMathOperator{\Par}{\mathrm{Par}} \DeclareMathOperator{\Maj}{\mathrm{Maj}} \DeclareMathOperator{\HW}{\mathrm{HW}} \DeclareMathOperator{\Thr}{\mathrm{Thr}} \DeclareMathOperator{\Tribes}{\mathrm{Tribes}} \DeclareMathOperator{\RotTribes}{\mathrm{RotTribes}} \DeclareMathOperator{\CycleRun}{\mathrm{CycleRun}} \DeclareMathOperator{\SAT}{\mathrm{SAT}} \DeclareMathOperator{\UniqueSAT}{\mathrm{UniqueSAT}} % % Dynamic optimality \newcommand{\OPT}{\mathsf{OPT}} \newcommand{\Alt}{\mathsf{Alt}} \newcommand{\Funnel}{\mathsf{Funnel}} % % Alignment \DeclareMathOperator{\Amp}{\mathrm{Amp}} % %%% TYPESETTING %%% % % In text \renewcommand{\th}{^{\mathrm{th}}} \newcommand{\degree}{^\circ} % % Fonts % .. bold \newcommand{\BA}{\boldsymbol{A}} \newcommand{\BB}{\boldsymbol{B}} \newcommand{\BC}{\boldsymbol{C}} \newcommand{\BD}{\boldsymbol{D}} \newcommand{\BE}{\boldsymbol{E}} \newcommand{\BF}{\boldsymbol{F}} \newcommand{\BG}{\boldsymbol{G}} \newcommand{\BH}{\boldsymbol{H}} \newcommand{\BI}{\boldsymbol{I}} \newcommand{\BJ}{\boldsymbol{J}} \newcommand{\BK}{\boldsymbol{K}} \newcommand{\BL}{\boldsymbol{L}} \newcommand{\BM}{\boldsymbol{M}} \newcommand{\BN}{\boldsymbol{N}} \newcommand{\BO}{\boldsymbol{O}} \newcommand{\BP}{\boldsymbol{P}} \newcommand{\BQ}{\boldsymbol{Q}} \newcommand{\BR}{\boldsymbol{R}} \newcommand{\BS}{\boldsymbol{S}} \newcommand{\BT}{\boldsymbol{T}} \newcommand{\BU}{\boldsymbol{U}} \newcommand{\BV}{\boldsymbol{V}} \newcommand{\BW}{\boldsymbol{W}} \newcommand{\BX}{\boldsymbol{X}} \newcommand{\BY}{\boldsymbol{Y}} \newcommand{\BZ}{\boldsymbol{Z}} \newcommand{\Ba}{\boldsymbol{a}} \newcommand{\Bb}{\boldsymbol{b}} \newcommand{\Bc}{\boldsymbol{c}} \newcommand{\Bd}{\boldsymbol{d}} \newcommand{\Be}{\boldsymbol{e}} \newcommand{\Bf}{\boldsymbol{f}} \newcommand{\Bg}{\boldsymbol{g}} \newcommand{\Bh}{\boldsymbol{h}} \newcommand{\Bi}{\boldsymbol{i}} \newcommand{\Bj}{\boldsymbol{j}} \newcommand{\Bk}{\boldsymbol{k}} \newcommand{\Bp}{\boldsymbol{p}} \newcommand{\Bq}{\boldsymbol{q}} \newcommand{\Br}{\boldsymbol{r}} \newcommand{\Bs}{\boldsymbol{s}} \newcommand{\Bt}{\boldsymbol{t}} \newcommand{\Bu}{\boldsymbol{u}} \newcommand{\Bv}{\boldsymbol{v}} \newcommand{\Bw}{\boldsymbol{w}} \newcommand{\Bx}{\boldsymbol{x}} \newcommand{\By}{\boldsymbol{y}} \newcommand{\Bz}{\boldsymbol{z}} \newcommand{\Balpha}{\boldsymbol{\alpha}} \newcommand{\Bbeta}{\boldsymbol{\beta}} \newcommand{\Bgamma}{\boldsymbol{\gamma}} \newcommand{\Bdelta}{\boldsymbol{\delta}} \newcommand{\Beps}{\boldsymbol{\eps}} \newcommand{\Bveps}{\boldsymbol{\veps}} \newcommand{\Bzeta}{\boldsymbol{\zeta}} \newcommand{\Beta}{\boldsymbol{\eta}} \newcommand{\Btheta}{\boldsymbol{\theta}} \newcommand{\Biota}{\boldsymbol{\iota}} \newcommand{\Bkappa}{\boldsymbol{\kappa}} \newcommand{\Blambda}{\boldsymbol{\lambda}} \newcommand{\Bmu}{\boldsymbol{\mu}} \newcommand{\Bnu}{\boldsymbol{\nu}} \newcommand{\Bxi}{\boldsymbol{\xi}} \newcommand{\Bomicron}{\boldsymbol{\omicron}} \newcommand{\Bpi}{\boldsymbol{\pi}} \newcommand{\Brho}{\boldsymbol{\rho}} \newcommand{\Bsigma}{\boldsymbol{\sigma}} \newcommand{\Btau}{\boldsymbol{\tau}} \newcommand{\Bupsilon}{\boldsymbol{\upsilon}} \newcommand{\Bphi}{\boldsymbol{\phi}} \newcommand{\Bfi}{\boldsymbol{\fi}} \newcommand{\Bchi}{\boldsymbol{\chi}} \newcommand{\Bpsi}{\boldsymbol{\psi}} \newcommand{\Bomega}{\boldsymbol{\omega}} % .. calligraphic \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} % .. typewriter \newcommand{\TA}{\mathtt{A}} \newcommand{\TB}{\mathtt{B}} \newcommand{\TC}{\mathtt{C}} \newcommand{\TD}{\mathtt{D}} \newcommand{\TE}{\mathtt{E}} \newcommand{\TF}{\mathtt{F}} \newcommand{\TG}{\mathtt{G}} \newcommand{\TH}{\mathtt{H}} \newcommand{\TI}{\mathtt{I}} \newcommand{\TJ}{\mathtt{J}} \newcommand{\TK}{\mathtt{K}} \newcommand{\TL}{\mathtt{L}} \newcommand{\TM}{\mathtt{M}} \newcommand{\TN}{\mathtt{N}} \newcommand{\TO}{\mathtt{O}} \newcommand{\TP}{\mathtt{P}} \newcommand{\TQ}{\mathtt{Q}} \newcommand{\TR}{\mathtt{R}} \newcommand{\TS}{\mathtt{S}} \newcommand{\TT}{\mathtt{T}} \newcommand{\TU}{\mathtt{U}} \newcommand{\TV}{\mathtt{V}} \newcommand{\TW}{\mathtt{W}} \newcommand{\TX}{\mathtt{X}} \newcommand{\TY}{\mathtt{Y}} \newcommand{\TZ}{\mathtt{Z}}$

The information divergence1 measures the information gained when going from a prior distribution $P$ to a posterior distribution $Q$, i.e. how much you learned if you believed something was distributed according to $P$ and now believe it is distribution $Q$ based on evidence. Concretely, it measures (the logarithm of) how much more “confident” you now are about your current beliefs:

\[ \D\ppa{Q}{P} \ce \E_{x \sim Q}\b{\log\frac{Q(x)}{P(x)}}. \]

We can extend the notion to random variables:

\[ \D\bpa{\BX'}{\BX} \ce \E_{x \sim \BX'}\b{\log\frac{\Pr[\BX'=x]}{\Pr[\BX=x]}} \]

(note that this is only a function of the distributions of $\BX$ and $\BX'$ seen separately; we’re not thinking of them as jointly distributed).

Interpretations

The information divergence can be understood in three main ways, as a measure of

  • the information gained when updating from $P$ to $Q$,
  • the specificity of draws from $Q$ compared to $P$,
  • and the unlikelihood of $P$ as a model given that your observations suggest $Q$.

And all of these links ultimately rely on the fact that the information divergence adds up nicely if we compare several independent samples from both $P$ and $Q$. A good intuition is that as $n$ grows, $Q^n$ looks more and more like we just conditioned $P^n$ on some event of probability $\exp\p{-n \D\ppa{Q}{P}}$.

See Six (and a half) intuitions for KL divergence for more intuitions, in particular ones related to betting.

Information gained

The information divergence is the core of information theory because it defines what it means to gain information based on observations. In particular,

  • the mutual information $\I[\BX;\BY]$ is the average information we gain about $\BY$ when observing $\BX$ (and vice versa): $\I[\BX;\BY] = \E\b{\D\bpa{\cond{\BY}{\BX}}{\BY}} = \E\b{\D\bpa{\cond{\BX}{\BY}}{\BX}}$;
  • and the entropy $\H[\BX]$ is the average information we gain about $\BX$ when observing it: $\H[\BX] = \I[\BX;\BX] = \E\b{\D\bpa{\cond{\BX}{\BX}}{\BX}} = \E\b{\log \f 1 {[\BX](\BX)}}$.

Specificity of draws from $Q$

The prior $P$ should be thought of as a more uniform baseline, and the posterior $Q$ as taking more specific or weird values, and being the interesting object of study. In particular, the definition only makes sense if the support of $Q$ is contained in the support of $P$.

Asymptotic likelihood ratio

The information divergence $\D\ppa{Q}{P}$ quantifies the typical likelihood ratio between $Q$ and $P$ for large samples of independent draws from $Q$. More precisely, for any fixed tolerance $\delta>0$, if you draw a sample $\BS \sim Q^n$, then we have

\[ \log \f{Q^n(\BS)}{P^n(\BS)} \in n\p{\D\ppa{Q}{P} \pm o(1)} \]

except with probability $\delta$.

In particular, this means that the information divergence measures the amount of selection pressure you typically need to pick out a sample that roughly looks like it came from $Q$ from $n$-samples that actually came from $P$ (here, “roughly” means we might give up on reproducing a $\delta$ part of the distribution of $Q^n$). If you want to match the distribution exactly, this is measured by max-divergence instead, no matter what $n$ is.

Entropy deficit

If $P$ is uniform over some sample space $\CX$, then $\D\ppa{Q}{P}$ measures the difference between the entropies $\H(P)-\H(Q)$. Since uniform distributions maximize entropy, this represents how much entropy $Q$ is potentially “missing out” on, i.e. its entropy deficit.

Unlikelihood of $P$

$\D\ppa{Q}{P}$ is the “amount of evidence against $P$” that you get per sample: it’s the amortized log-unlikelihood that a large sample from $P$ ends up empirically distributed in a way that looks exactly like $Q$. Indeed, suppose $P,Q$ are distributions over the set $\set{x_1, \ldots x_m}$, then the probability that we draw $n$ samples from $P$ and about $Q(x_i) n$ of them take value $x_i$ for each $i\in [m]$ is roughly

\[ \al{ \binom{n}{\approx Q(x_1)n, \dotsc, \approx Q(x_m)n} \prod P(x_i)^{Q(x_i) n} &= \ub{\binom{n}{\approx Q(x_1)n, \dotsc, \approx Q(x_m)n} \prod Q(x_i)^{Q(x_i) n}}_\text{probability that a sample from $Q$ ends up looking like $Q$} \prod\f{P(x_i)}{Q(x_i)}^{Q(x_i)n}\\ &\approx \prod\f{P(x_i)}{Q(x_i)}^{Q(x_i)n}\\ &= \exp\p{-\D\ppa{Q}{P}n}. } \]

#to-write simplify notation to $p_i,q_i$?

Hypothesis testing

As a consequence, there exists a test for “detecting “ based on a large enough number of samples, such that

  • if the samples are from $Q$, the test comes up positive with high probability in $n$ (the test is sensitive),
  • and if the samples are from $P$, the test comes up positive with probability at most $2^{-n\p{\D\ppa{Q}{P}-o(1)}}$ (the test is extremely specific).2

So if the test comes up positive, the belief that the samples came from $P$ as opposed to $Q$ becomes extremely untenable as $n$ grows.

Learning theory: minimizing divergence maximizes likelihood of $P$

In learning theory, we can let $Q$ be the target real-world distribution, and $P$ be the distribution represented by a simplified generative model from a restricted class. Then, minimizing $\D\ppa{Q}{P}$ corresponds to picking the model $P$ that minimizes the information we still have to learn per sample. Since $Q$ is fixed, this is equivalent to minimizing the cross-entropy $\H\ppa{Q}{P}$. #to-write actually that’s an interpretation for the cross entropy not the information div

In terms of dynamics, the model knows less than nature, so it starts out more uncertain, and therefore $P$ is generally more uniform, and gives fairly low probabilities to everything. Minimizing the information divergence $\D\ppa{Q}{P}$ will incentivize $P$ to give higher probabilities to values that are frequent in $Q$.

If instead we define $Q$ to be uniform over the training data, then the $P$ that minimizes $\D\ppa{Q}{P}$ is the maximum likelihood model.

#to-write explicitly model Q as P conditioned on E?

Properties

Nonnegativity

As a ratio divergence, it is always nonnegative, and is $0$ iff $P$ and $Q$ are identical.

Asymptotics

Using its nonnegative form, we get

\[ \D\ppa{Q}{P} = \Theta\p{\sum_x \begin{cases} P(x)\p{\f{Q(x)}{P(x)}-1}^2 & \text{if $\f{Q(x)}{P(x)} \le O(1)$}\\ Q(x)\log \frac{Q(x)}{P(x)} & \text{otherwise} \end{cases} } \]

(change $\sum_x$ to $\int_x$ if $P,Q$ are continuous probability distributions).

Additivity

Independent priors and posteriors

If the priors and posteriors are both product distributions, then the divergence can be decomposed:

\[ \al{ \D\ppa{Q_1 \dotsm Q_n}{P_1 \dotsm P_n} &= \E_{\Bx \sim Q_1 \dotsm Q_n}\b{\log\frac{Q_1(\Bx_1) \dotsm Q_n(\Bx_n)}{P_1(\Bx_1) \dotsm P_n(\Bx_n)}}\\ &= \E_{\Bx \sim Q_1 \dotsm Q_n}\b{\log\frac{Q_1(\Bx_1)}{P_1(\Bx_1)} + \dotsb + \log\f{Q_n(\Bx_n)}{P_n(\Bx_n)}}\\ &= \D\ppa{Q_1}{P_1} + \dotsb + \D\ppa{Q_n}{P_n}. } \]

This situation mostly occurs when we believe these $n$ values $\Bx_1, \ldots, \Bx_n$ are independent samples from an identical random process, and we want to update our belief about the random process but not about the values themselves. When that’s the case, the additivity simplifies to

\[ \D\ppa{Q^n}{P^n} = n\D\ppa{Q}{P}. \]

Independent priors

If the prior variables $\BX$ and $\BY$ are independent, then we have superadditivity (this generalizes the superadditivity of entropy deficit):

\[ \D\bpa{\BX',\BY'}{\BX,\BY} \geq \D\bpa{\BX'}{\BX}+\D\bpa{\BY'}{\BY}. \]

Intuitively, this is because you can notice more ways in which $\BX',\BY'$ differ from $\BX,\BY$ if you get to look at the together and cross-check. Formally, we have

\[ \al{ &\D\bpa{\BX',\BY'}{\BX,\BY} - \D\bpa{\BX'}{\BX}-\D\bpa{\BY'}{\BY}\\ &\qquad= \E_{(x,y) \sim (\BX',\BY')}\b{\frac{\Pr[\BX'=x\and\BY'=y]\Pr[\BX=x]\Pr[\BY=y]}{\Pr[\BX=x\and\BY=y]\Pr[\BX'=x]\Pr[\BY'=y]}}\\ &\qquad= \E_{(x,y) \sim (\BX',\BY')}\b{\frac{\Pr[\BX'=x\and\BY'=y]}{\Pr[\BX'=x]\Pr[\BY'=y]}}\\ &\qquad= \I\b{\BX';\BY'}\\ &\qquad\ge 0. } \]

I’m not sure about the case when $\BX'$ and $\BY'$ are independent, but it seems less natural anyway: why would the posterior be independent if the prior isn’t?

#to-write

  • also rewrite in distribution form?
  • no actually maybe it’s better to write it like $\D\bpa{\at{(\BX,\BY)}{\BE}}{\BX,\BY}$?
  • ohh hold on this is where “total correlation” comes in right?
    • can you also write that one as a sum of mutual information? or of interaction informations?
      • is it the sum of all the interaction informations
      • is I[X1,X2] + I[X3;X2 X1] + I[X4;X3 X1,X2] etc?
    • ==okay this is highly relevant to my plight with the small-read local lemma!==
  • also explain typical situation: were from a common process but now have observed something about the values in aggregate
  • link to the information theoretic proof of the Chernoff bound

General case

In general, though, the quantities $\D\bpa{\BX',\BY'}{\BX,\BY}$ and $\D\bpa{\BX'}{\BX}+\D\bpa{\BY'}{\BY}$ are not comparable: information divergence is neither subadditive nor superadditive. For example, even when $\BX$ and $\BY$ are independent, we can have

\[ \left\{ \al{ &\D\bpa{\BX'}{\BX} = \D\bpa{\BX'}{\BX} = 0\\ &\D\bpa{\BX',\BY'}{\BX,\BY} > 0. } \right. \]

On the other hand, by the section on conditional divergence below, we know that $\D\bpa{\BX'}{\BX}, \D\bpa{\BY'}{\BY} \le \D\bpa{\BX',\BY'}{\BX,\BY}$, so

\[ \D\bpa{\BX'}{\BX} + \D\bpa{\BY'}{\BY} \le 2\D\bpa{\BX',\BY'}{\BX,\BY}, \]

and the second example below approximately saturates this bound.3

#figure redo the figure with the new notation

Triangle inequality

Sadly, the information divergence doesn’t satisfy the triangle inequality.

In general, $\D\ppa{Q}{P} + \D\ppa{R}{Q}$ can be much smaller than $\D\ppa{R}{P}$. For example:

  • “tweak regime”:
    • if $P \ce B\p{\f12}$, $Q \ce B\p{\f{1+\eps}{2}}$, $R \ce B\p{\f{1+2\eps}{2}}$
    • then
      • $\D\ppa{Q}{P} \sim \D\ppa{R}{Q} \sim \f{\eps^2}{2}$
      • $\D\ppa{R}{P} \sim 2\eps^2$
  • “zoom regime”:
    • if the distributions are
      • $P$ uniform over $[D]$
      • $Q$ gives $\f{1}{\log D}$ probability to the value $1$ and is otherwise uniform
      • $R$ is $1$ deterministically
    • then
      • $\D\ppa{Q}{P} \le \f{1}{\log D} \log \p{\f{D}{\log D}} \le 1$
      • $\D\ppa{R}{Q} = \log \log D$
      • $\D\ppa{R}{P} = \log D$

#to-think I think in general, for any distributions $P$ and $R$, there is a distribution $Q$ such that

\[ \D\ppa{Q}{P} + \D\ppa{R}{Q} \le (1-\Omega(1))\D\ppa{R}{P}. \]

Examples

Binary divergence

Let

\[ D\ppa{q}{p} \ce q\log\frac{q}{p} + (1-q)\log\frac{1-q}{1-p} \]

be the divergence between two Bernoullis $B(q)$ and $B(p)$.

Just like the general case, $D\ppa{q}{p}$ is nonnegative with $D\ppa{q}{p} =0$ iff $q=p$, and strictly convex over $[0,1]^2$. More precisely,

\[ \frac{d^2}{dq^2}D\ppa{q}{p} = \frac{1}{q(1-q)}. \]

Assuming wlog that $p\le 1/2$, the asymptotics are

\[ D\ppa{q}{p} = \begin{cases} \Theta\p{p\p{\f{q}{p}-1}^2} & \text{if $0 \le q \le 2p$}\\ \Theta\p{q\log \frac{q}{p}} & \text{if $q \ge 2p$,} \end{cases} \]

(where the asymptotic constant is $\f12$ when $q \to p$). In particular, if $p=\f12$ (unbiased) and $q = \f{1+\rho}{2}$ (bias $\rho$), we have

\[ D\ppa{q}{p} = \Theta\p{\rho^2}. \]

#figure

#to-write

  • expected score for logarithmic scoring rule
  • full second derivative matrix
  • $\D\ppa{q}{p} \ge 2(q-p)^2$ (I believe? yup seems right since the second derivative is $\ge 4$)

Normals

#to-write

Conditional divergence

Similar to entropy (but unlike most ratio divergences), we can define the conditional information divergence by taking an average of the divergences of the conditional distributions:

\[ \al{ \D\bpaco{\BY'}{\BX'}{\BY}{\BX} \ce&\ \E_{x \sim \BX'}\b{\D\bpa{\at{\BY'}{\BX' = x}}{\at{\BY}{\BX=x}}}\\ =&\ \E_{(x,y) \sim (\BX',\BY')}\b{\log\frac{\Pr[\BY'=y \mid \BX'=x]}{\Pr[\BY=y \mid \BX = x]}}, } \]

where $\at{\BY}{\BX=x}$ denotes the random variable obtained by conditioning $\BY$ on $\BX=x$.

Chain rule

Similar to conditional entropy, this gives a chain rule:

\[ \al{ \D\bpa{\BX',\BY'}{\BX,\BY} &= \E_{(x,y) \sim (\BX',\BY')}\b{\log\frac{\Pr[\BX' = x \and \BY'=y]}{\Pr[\BX = x \and \BY=y]}}\\ &= \E_{(x,y) \sim (\BX',\BY')}\b{\log\frac{\Pr[\BX' = x]\Pr[\BY'=y\mid \BX'=x]}{\Pr[\BX = x] \Pr[\BY=y \mid \BX=x]}}\\ &= \D\bpa{\BX'}{\BX} + \D\bpaco{\BY'}{\BX'}{\BY}{\BX}, } \]

which immediately shows that adding data can only increase the divergence.

Mixtures and convexity

Since information divergence is neither subadditive nor superadditive, $\D\bpaco{\BY'}{\BX'}{\BY}{\BX}$ is sometimes bigger and sometimes smaller than $\D\bpa{\BY'}{\BY}$. However, like any ratio divergence when $\BX$ and $\BX'$ are the same variable, we have

\[ \D\bpa{\BY'}{\BY} \le \D\bpa{\BX,\BY'}{\BX,\BY} = \D\bpaco{\BY'}{\BX}{\BY}{\BX}. \]

That is, conditioning by the same variable can only increase the divergence, which is equivalent to saying that the information divergence is convex.

See also

  1. also known as Kullback–Leibler divergence and relative entropy 

  2. Intuitively, this is because a typical sample $x_1, \ldots, x_n$ from $Q$ is about $2^{\D\ppa{Q}{P}n}$ times more likely according to $Q$ than $P$, so the set of values on which the test has to come up positive in order to have decent sensitivity will have very small weight $\approx 2^{-\D\ppa{Q}{P}n}$ when measured according to $P$

  3. And maybe it even saturates it optimally as a function of $\D\bpa{\BX',\BY'}{\BX,\BY}$? #to-think