$\require{mathtools} % %%% GENERIC MATH %%% % % Environments \newcommand{\al}[1]{\begin{align}#1\end{align}} % need this for \tag{} to work \renewcommand{\r}{\mathrm} % % Greek \newcommand{\eps}{\epsilon} \newcommand{\veps}{\varepsilon} \newcommand{\Om}{\Omega} \newcommand{\om}{\omega} \newcommand{\Th}{\Theta} \let\fi\phi % because it looks like an f \let\phi\varphi % because it looks like a p % % Miscellaneous shortcuts % .. over and under \newcommand{\ss}[1]{_{\substack{#1}}} \newcommand{\ob}{\overbrace} \newcommand{\ub}{\underbrace} \newcommand{\ol}{\overline} \newcommand{\tld}{\widetilde} \newcommand{\HAT}{\widehat} \newcommand{\f}{\frac} \newcommand{\s}[2]{#1 /\mathopen{}#2} \newcommand{\rt}[1]{ {\sqrt{#1}}} % .. relations \newcommand{\sr}{\stackrel} \newcommand{\sse}{\subseteq} \newcommand{\ce}{\coloneqq} \newcommand{\ec}{\eqqcolon} \newcommand{\ap}{\approx} \newcommand{\ls}{\lesssim} \newcommand{\gs}{\gtrsim} % .. miscer \newcommand{\q}{\quad} \newcommand{\qq}{\qquad} \newcommand{\heart}{\heartsuit} % % Delimiters % (I needed to create my own because the MathJax version of \DeclarePairedDelimiter doesn't have \mathopen{} and that messes up the spacing) % .. one-part \newcommand{\p}[1]{\mathopen{}\left( #1 \right)} \newcommand{\b}[1]{\mathopen{}\left[ #1 \right]} \newcommand{\set}[1]{\mathopen{}\left\{ #1 \right\}} \newcommand{\abs}[1]{\mathopen{}\left\lvert #1 \right\rvert} \newcommand{\floor}[1]{\mathopen{}\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\mathopen{}\left\lceil #1 \right\rceil} \newcommand{\inner}[1]{\mathopen{}\left\langle #1 \right\rangle} % .... (use phantom to force at least the standard height of double bars) \newcommand{\norm}[1]{\mathopen{}\left\lVert #1 \vphantom{f} \right\rVert} \newcommand{\frob}[1]{\norm{#1}_\mathrm{F}} %% .. two-part \newcommand{\incond}[2]{#1 \mathop{}\middle|\mathop{} #2} \newcommand{\cond}[2]{ {\left.\incond{#1}{#2}\right.}} \newcommand{\pco}[2]{\p{\incond{#1}{#2}}} \newcommand{\bco}[2]{\b{\incond{#1}{#2}}} \newcommand{\setco}[2]{\set{\incond{#1}{#2}}} \newcommand{\at}[2]{ {\left.#1\right|_{#2}}} \newcommand{\pat}[2]{\p{\at{#1}{#2}}} \newcommand{\bat}[2]{\b{\at{#1}{#2}}} % ..... (use phantom to force at least the standard height of double bar) \newcommand{\oldpara}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} %\newcommand{\para}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} \newcommand{\para}[2]{\mathchoice{\begin{matrix}#1\\\hdashline#2\end{matrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}} \newcommand{\ppa}[2]{\p{\para{#1}{#2}}} \newcommand{\bpa}[2]{\b{\para{#1}{#2}}} %\newcommand{\bpaco}[4]{\bpa{\incond{#1}{#2}}{\incond{#3}{#4}}} \newcommand{\bpaco}[4]{\bpa{\cond{#1}{#2}}{\cond{#3}{#4}}} % % Levels of closeness \newcommand{\scirc}[1]{\sr{\circ}{#1}} \newcommand{\sdot}[1]{\sr{.}{#1}} \newcommand{\slog}[1]{\sr{\log}{#1}} \newcommand{\createClosenessLevels}[7]{ \newcommand{#2}{\mathrel{(#1)}} \newcommand{#3}{\mathrel{#1}} \newcommand{#4}{\mathrel{#1\!\!#1}} \newcommand{#5}{\mathrel{#1\!\!#1\!\!#1}} \newcommand{#6}{\mathrel{(\sdot{#1})}} \newcommand{#7}{\mathrel{(\slog{#1})}} } \let\lt\undefined \let\gt\undefined % .. vanilla versions (is it within a constant?) \newcommand{\ez}{\scirc=} \newcommand{\eq}{\simeq} \newcommand{\eqq}{\mathrel{\eq\!\!\eq}} \newcommand{\eqqq}{\mathrel{\eq\!\!\eq\!\!\eq}} \newcommand{\lez}{\scirc\le} \newcommand{\lq}{\preceq} \newcommand{\lqq}{\mathrel{\lq\!\!\lq}} \newcommand{\lqqq}{\mathrel{\lq\!\!\lq\!\!\lq}} \newcommand{\gez}{\scirc\ge} \newcommand{\gq}{\succeq} \newcommand{\gqq}{\mathrel{\gq\!\!\gq}} \newcommand{\gqqq}{\mathrel{\gq\!\!\gq\!\!\gq}} \newcommand{\lz}{\scirc<} \newcommand{\lt}{\prec} \newcommand{\ltt}{\mathrel{\lt\!\!\lt}} \newcommand{\lttt}{\mathrel{\lt\!\!\lt\!\!\lt}} \newcommand{\gz}{\scirc>} \newcommand{\gt}{\succ} \newcommand{\gtt}{\mathrel{\gt\!\!\gt}} \newcommand{\gttt}{\mathrel{\gt\!\!\gt\!\!\gt}} % .. dotted versions (is it equal in the limit?) \newcommand{\ed}{\sdot=} \newcommand{\eqd}{\sdot\eq} \newcommand{\eqqd}{\sdot\eqq} \newcommand{\eqqqd}{\sdot\eqqq} \newcommand{\led}{\sdot\le} \newcommand{\lqd}{\sdot\lq} \newcommand{\lqqd}{\sdot\lqq} \newcommand{\lqqqd}{\sdot\lqqq} \newcommand{\ged}{\sdot\ge} \newcommand{\gqd}{\sdot\gq} \newcommand{\gqqd}{\sdot\gqq} \newcommand{\gqqqd}{\sdot\gqqq} \newcommand{\ld}{\sdot<} \newcommand{\ltd}{\sdot\lt} \newcommand{\lttd}{\sdot\ltt} \newcommand{\ltttd}{\sdot\lttt} \newcommand{\gd}{\sdot>} \newcommand{\gtd}{\sdot\gt} \newcommand{\gttd}{\sdot\gtt} \newcommand{\gtttd}{\sdot\gttt} % .. log versions (is it equal up to log?) \newcommand{\elog}{\slog=} \newcommand{\eqlog}{\slog\eq} \newcommand{\eqqlog}{\slog\eqq} \newcommand{\eqqqlog}{\slog\eqqq} \newcommand{\lelog}{\slog\le} \newcommand{\lqlog}{\slog\lq} \newcommand{\lqqlog}{\slog\lqq} \newcommand{\lqqqlog}{\slog\lqqq} \newcommand{\gelog}{\slog\ge} \newcommand{\gqlog}{\slog\gq} \newcommand{\gqqlog}{\slog\gqq} \newcommand{\gqqqlog}{\slog\gqqq} \newcommand{\llog}{\slog<} \newcommand{\ltlog}{\slog\lt} \newcommand{\lttlog}{\slog\ltt} \newcommand{\ltttlog}{\slog\lttt} \newcommand{\glog}{\slog>} \newcommand{\gtlog}{\slog\gt} \newcommand{\gttlog}{\slog\gtt} \newcommand{\gtttlog}{\slog\gttt} % % Miscellaneous \newcommand{\LHS}{\mathrm{LHS}} \newcommand{\RHS}{\mathrm{RHS}} % .. operators \DeclareMathOperator{\poly}{poly} \DeclareMathOperator{\polylog}{polylog} \DeclareMathOperator{\quasipoly}{quasipoly} \DeclareMathOperator{\negl}{negl} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} % .. functions \DeclareMathOperator{\id}{id} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\err}{err} \DeclareMathOperator{\ReLU}{ReLU} % .. analysis \let\d\undefined \newcommand{\d}{\operatorname{d}\mathopen{}} \newcommand{\df}[2]{ {\f{\d #1}{\d #2}}} \newcommand{\ds}[2]{ {\s{\d #1}{\d #2}}} \newcommand{\part}{\partial} \newcommand{\partf}[2]{\f{\part #1}{\part #2}} \newcommand{\parts}[2]{\s{\part #1}{\part #2}} \newcommand{\grad}[1]{\mathop{\nabla\!_{#1}}} % .. sets of numbers \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\F}{\mathbb{F}} % %%% SPECIALIZED MATH %%% % % Logic \renewcommand{\and}{\wedge} \newcommand{\AND}{\bigwedge} \newcommand{\or}{\vee} \newcommand{\OR}{\bigvee} \newcommand{\xor}{\oplus} \newcommand{\XOR}{\bigoplus} \newcommand{\union}{\cup} \newcommand{\inter}{\cap} \newcommand{\UNION}{\bigcup} \newcommand{\INTER}{\bigcap} \newcommand{\comp}{\overline} \newcommand{\true}{\r{true}} \newcommand{\false}{\r{false}} \newcommand{\tf}{\set{\true,\false}} \DeclareMathOperator{\One}{\mathbb{1}} \DeclareMathOperator{\1}{\mathbb{1}} % % Linear algebra \renewcommand{\span}{\mathrm{span}} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\dom}{dom} \DeclareMathOperator{\Img}{Im} \newcommand{\transp}{\mathsf{T}} \renewcommand{\t}{^\transp} % ... named tensors \newcommand{\namedtensorstrut}{\vphantom{fg}} % milder than \mathstrut \newcommand{\name}[1]{\mathsf{\namedtensorstrut #1}} \newcommand{\nbin}[2]{\mathbin{\underset{\substack{#1}}{\namedtensorstrut #2}}} \newcommand{\ndot}[1]{\nbin{#1}{\odot}} \newcommand{\ncat}[1]{\nbin{#1}{\oplus}} \newcommand{\nsum}[1]{\sum\limits_{\substack{#1}}} \newcommand{\nfun}[2]{\mathop{\underset{\substack{#1}}{\namedtensorstrut\mathrm{#2}}}} \newcommand{\ndef}[2]{\newcommand{#1}{\name{#2}}} \newcommand{\nt}[1]{^{\transp(#1)}} % % Probability \newcommand{\Normal}{\mathcal{N}} \let\Pr\undefined \DeclareMathOperator*{\Pr}{Pr} \DeclareMathOperator*{\G}{\mathbb{G}} \DeclareMathOperator*{\Odds}{Od} \DeclareMathOperator*{\E}{E} \DeclareMathOperator*{\Var}{Var} \DeclareMathOperator*{\Cov}{Cov} \DeclareMathOperator*{\corr}{corr} \DeclareMathOperator*{\median}{median} \newcommand{\dTV}{d_{\mathrm{TV}}} \newcommand{\dHel}{d_{\mathrm{Hel}}} \newcommand{\dJS}{d_{\mathrm{JS}}} % ... information theory \let\H\undefined \DeclareMathOperator*{\H}{H} \DeclareMathOperator*{\I}{I} \DeclareMathOperator*{\D}{D} % %%% SPECIALIZED COMPUTER SCIENCE %%% % % Complexity classes % .. classical \newcommand{\Poly}{\mathsf{P}} \newcommand{\NP}{\mathsf{NP}} \newcommand{\PH}{\mathsf{PH}} \newcommand{\PSPACE}{\mathsf{PSPACE}} \renewcommand{\L}{\mathsf{L}} % .. probabilistic \newcommand{\formost}{\mathsf{Я}} \newcommand{\RP}{\mathsf{RP}} \newcommand{\BPP}{\mathsf{BPP}} \newcommand{\MA}{\mathsf{MA}} \newcommand{\AM}{\mathsf{AM}} \newcommand{\IP}{\mathsf{IP}} \newcommand{\RL}{\mathsf{RL}} % .. circuits \newcommand{\NC}{\mathsf{NC}} \newcommand{\AC}{\mathsf{AC}} \newcommand{\ACC}{\mathsf{ACC}} \newcommand{\TC}{\mathsf{TC}} \newcommand{\Ppoly}{\mathsf{P}/\poly} \newcommand{\Lpoly}{\mathsf{L}/\poly} % .. resources \newcommand{\TIME}{\mathsf{TIME}} \newcommand{\SPACE}{\mathsf{SPACE}} \newcommand{\TISP}{\mathsf{TISP}} \newcommand{\SIZE}{\mathsf{SIZE}} % .. keywords \newcommand{\co}{\mathsf{co}} \newcommand{\Prom}{\mathsf{Promise}} % % Boolean analysis \newcommand{\zo}{\set{0,1}} \newcommand{\pmo}{\set{\pm 1}} \newcommand{\zpmo}{\set{0,\pm 1}} \newcommand{\harpoon}{\!\upharpoonright\!} \newcommand{\rr}[2]{#1\harpoon_{#2}} \newcommand{\Fou}[1]{\widehat{#1}} \DeclareMathOperator{\Ind}{\mathrm{Ind}} \DeclareMathOperator{\Inf}{\mathrm{Inf}} \newcommand{\Der}[1]{\operatorname{D}_{#1}\mathopen{}} \newcommand{\Exp}[1]{\operatorname{E}_{#1}\mathopen{}} \DeclareMathOperator{\Stab}{\mathrm{Stab}} \DeclareMathOperator{\T}{T} \DeclareMathOperator{\sens}{\mathrm{s}} \DeclareMathOperator{\bsens}{\mathrm{bs}} \DeclareMathOperator{\fbsens}{\mathrm{fbs}} \DeclareMathOperator{\Cert}{\mathrm{C}} \DeclareMathOperator{\DT}{\mathrm{DT}} \DeclareMathOperator{\CDT}{\mathrm{CDT}} % canonical \DeclareMathOperator{\ECDT}{\mathrm{ECDT}} \DeclareMathOperator{\CDTv}{\mathrm{CDT_{vars}}} \DeclareMathOperator{\ECDTv}{\mathrm{ECDT_{vars}}} \DeclareMathOperator{\CDTt}{\mathrm{CDT_{terms}}} \DeclareMathOperator{\ECDTt}{\mathrm{ECDT_{terms}}} \DeclareMathOperator{\CDTw}{\mathrm{CDT_{weighted}}} \DeclareMathOperator{\ECDTw}{\mathrm{ECDT_{weighted}}} \DeclareMathOperator{\AvgDT}{\mathrm{AvgDT}} \DeclareMathOperator{\PDT}{\mathrm{PDT}} % partial decision tree \DeclareMathOperator{\DTsize}{\mathrm{DT_{size}}} \DeclareMathOperator{\W}{\mathbf{W}} % .. functions (small caps sadly doesn't work) \DeclareMathOperator{\Par}{\mathrm{Par}} \DeclareMathOperator{\Maj}{\mathrm{Maj}} \DeclareMathOperator{\HW}{\mathrm{HW}} \DeclareMathOperator{\Thr}{\mathrm{Thr}} \DeclareMathOperator{\Tribes}{\mathrm{Tribes}} \DeclareMathOperator{\RotTribes}{\mathrm{RotTribes}} \DeclareMathOperator{\CycleRun}{\mathrm{CycleRun}} \DeclareMathOperator{\SAT}{\mathrm{SAT}} \DeclareMathOperator{\UniqueSAT}{\mathrm{UniqueSAT}} % % Dynamic optimality \newcommand{\OPT}{\mathsf{OPT}} \newcommand{\Alt}{\mathsf{Alt}} \newcommand{\Funnel}{\mathsf{Funnel}} % % Alignment \DeclareMathOperator{\Amp}{\mathrm{Amp}} % %%% TYPESETTING %%% % % In text \renewcommand{\th}{^{\mathrm{th}}} \newcommand{\degree}{^\circ} % % Fonts % .. bold \newcommand{\BA}{\boldsymbol{A}} \newcommand{\BB}{\boldsymbol{B}} \newcommand{\BC}{\boldsymbol{C}} \newcommand{\BD}{\boldsymbol{D}} \newcommand{\BE}{\boldsymbol{E}} \newcommand{\BF}{\boldsymbol{F}} \newcommand{\BG}{\boldsymbol{G}} \newcommand{\BH}{\boldsymbol{H}} \newcommand{\BI}{\boldsymbol{I}} \newcommand{\BJ}{\boldsymbol{J}} \newcommand{\BK}{\boldsymbol{K}} \newcommand{\BL}{\boldsymbol{L}} \newcommand{\BM}{\boldsymbol{M}} \newcommand{\BN}{\boldsymbol{N}} \newcommand{\BO}{\boldsymbol{O}} \newcommand{\BP}{\boldsymbol{P}} \newcommand{\BQ}{\boldsymbol{Q}} \newcommand{\BR}{\boldsymbol{R}} \newcommand{\BS}{\boldsymbol{S}} \newcommand{\BT}{\boldsymbol{T}} \newcommand{\BU}{\boldsymbol{U}} \newcommand{\BV}{\boldsymbol{V}} \newcommand{\BW}{\boldsymbol{W}} \newcommand{\BX}{\boldsymbol{X}} \newcommand{\BY}{\boldsymbol{Y}} \newcommand{\BZ}{\boldsymbol{Z}} \newcommand{\Ba}{\boldsymbol{a}} \newcommand{\Bb}{\boldsymbol{b}} \newcommand{\Bc}{\boldsymbol{c}} \newcommand{\Bd}{\boldsymbol{d}} \newcommand{\Be}{\boldsymbol{e}} \newcommand{\Bf}{\boldsymbol{f}} \newcommand{\Bg}{\boldsymbol{g}} \newcommand{\Bh}{\boldsymbol{h}} \newcommand{\Bi}{\boldsymbol{i}} \newcommand{\Bj}{\boldsymbol{j}} \newcommand{\Bk}{\boldsymbol{k}} \newcommand{\Bp}{\boldsymbol{p}} \newcommand{\Bq}{\boldsymbol{q}} \newcommand{\Br}{\boldsymbol{r}} \newcommand{\Bs}{\boldsymbol{s}} \newcommand{\Bt}{\boldsymbol{t}} \newcommand{\Bu}{\boldsymbol{u}} \newcommand{\Bv}{\boldsymbol{v}} \newcommand{\Bw}{\boldsymbol{w}} \newcommand{\Bx}{\boldsymbol{x}} \newcommand{\By}{\boldsymbol{y}} \newcommand{\Bz}{\boldsymbol{z}} \newcommand{\Balpha}{\boldsymbol{\alpha}} \newcommand{\Bbeta}{\boldsymbol{\beta}} \newcommand{\Bgamma}{\boldsymbol{\gamma}} \newcommand{\Bdelta}{\boldsymbol{\delta}} \newcommand{\Beps}{\boldsymbol{\eps}} \newcommand{\Bveps}{\boldsymbol{\veps}} \newcommand{\Bzeta}{\boldsymbol{\zeta}} \newcommand{\Beta}{\boldsymbol{\eta}} \newcommand{\Btheta}{\boldsymbol{\theta}} \newcommand{\Biota}{\boldsymbol{\iota}} \newcommand{\Bkappa}{\boldsymbol{\kappa}} \newcommand{\Blambda}{\boldsymbol{\lambda}} \newcommand{\Bmu}{\boldsymbol{\mu}} \newcommand{\Bnu}{\boldsymbol{\nu}} \newcommand{\Bxi}{\boldsymbol{\xi}} \newcommand{\Bomicron}{\boldsymbol{\omicron}} \newcommand{\Bpi}{\boldsymbol{\pi}} \newcommand{\Brho}{\boldsymbol{\rho}} \newcommand{\Bsigma}{\boldsymbol{\sigma}} \newcommand{\Btau}{\boldsymbol{\tau}} \newcommand{\Bupsilon}{\boldsymbol{\upsilon}} \newcommand{\Bphi}{\boldsymbol{\phi}} \newcommand{\Bfi}{\boldsymbol{\fi}} \newcommand{\Bchi}{\boldsymbol{\chi}} \newcommand{\Bpsi}{\boldsymbol{\psi}} \newcommand{\Bomega}{\boldsymbol{\omega}} % .. calligraphic \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} % .. typewriter \newcommand{\TA}{\mathtt{A}} \newcommand{\TB}{\mathtt{B}} \newcommand{\TC}{\mathtt{C}} \newcommand{\TD}{\mathtt{D}} \newcommand{\TE}{\mathtt{E}} \newcommand{\TF}{\mathtt{F}} \newcommand{\TG}{\mathtt{G}} \newcommand{\TH}{\mathtt{H}} \newcommand{\TI}{\mathtt{I}} \newcommand{\TJ}{\mathtt{J}} \newcommand{\TK}{\mathtt{K}} \newcommand{\TL}{\mathtt{L}} \newcommand{\TM}{\mathtt{M}} \newcommand{\TN}{\mathtt{N}} \newcommand{\TO}{\mathtt{O}} \newcommand{\TP}{\mathtt{P}} \newcommand{\TQ}{\mathtt{Q}} \newcommand{\TR}{\mathtt{R}} \newcommand{\TS}{\mathtt{S}} \newcommand{\TT}{\mathtt{T}} \newcommand{\TU}{\mathtt{U}} \newcommand{\TV}{\mathtt{V}} \newcommand{\TW}{\mathtt{W}} \newcommand{\TX}{\mathtt{X}} \newcommand{\TY}{\mathtt{Y}} \newcommand{\TZ}{\mathtt{Z}}$

Ratio divergences1 are a (generally asymmetric) notion of statistical distance between a prior distribution $P$ and a posterior distribution $Q$, which depends on the typical values of their density ratio $\f{Q}{P}$. Concretely, for any convex function $f: \R_{\ge 0} \to \R$ with $f(1)=0$, we can define the $f$-divergence of $Q$ relative to $P$ as

\[ D_f\ppa Q P \ce \E_{x \sim P}\b{f\p{\f{Q(x)}{P(x)}}}. \]

By extension, we can define the divergence between two random variables as

\[ D_f\bpa{\BX'}{\BX} \ce \E_{x \sim \BX}\b{f\p{\frac{\Pr[\BX'=x]}{\Pr[\BX=x]}}}. \]

Note that this is only a function of the individual distributions of $\BX$ and $\BX'$; we’re not thinking of them as jointly distributed.

Intuitition

Crucially, just like all relative densities, $\f{Q}{P}$ has mean

\[ \E_{x \sim P}\b{\f{Q(x)}{P(x)}} = \sum_x P(x) \f{Q(x)}{P(x)} = \sum_x Q(x) = 1, \]

so ratio divergences are nonnegative by the convexity of $f$:

\[ \al{ D_f\ppa{Q}{P} &=\E_{x \sim P}\b{f\p{\f{Q(x)}{P(x)}}}\\ &\ge f\p{\E_{x \sim P}\b{\f{Q(x)}{P(x)}}}\tag{convexity}\\ %&= f\p{\sum_x Q(x)}\\ &= f\p{1}\\ &= 0. } \]

In addition, as long as $f$ is strictly convex at $1$, we have $D_f\ppa{Q}{P}=0$ if and only if $\f{Q(x)}{P(x)}=1$ all the time, i.e. $P$ and $Q$ are identical.

So ratio divergences represent some sort of “variance”2 of the relative density $\f{Q}{P}$, and the second derivative of $f$ will determine how quickly the divergence grows when $\f{Q}{P}$ diverges from $1$. We can characterize the asymptotic behavior of most ratio divergences by looking at the behavior of $f$ in two regimes:

Regime Relative density Intuitive description
“tweak” $\f{Q}{P} \le O(1)$ $Q$ is obtained by slightly “tweaking” $P$ up or down
“zoom-in” $\f{Q}{P} > \omega(1)$ $Q$ “zooms in” on a small part of the support of $P$, giving it much higher probability

Examples

Some ratio divergences are (squares of) distances that are symmetric and satisfy the triangle inequality, but most (including the information divergence) are asymmetric and don’t satisfy the triangle inequality.

Name Notation $f(t)$
total variation distance $\dTV(P,Q)$ $\f12\abs{t-1}$
(squared) Jensen–Shannon distance $\dJS^2(P,Q)$ $\frac{1}{2}\p{\log\f{2}{1+t} + t \log\f{2}{1+\f1t}}$
(squared) Hellinger distance $\dHel^2(P,Q)$ $\f12\p{\sqrt{t}-1}^2$
information divergence $\D\ppa{Q}{P}$ $t \log t$
$\chi^2$-divergence $\chi^2\ppa{Q}{P}$ $(t-1)^2$
power divergences ($\alpha>1$)3 $D_\alpha\ppa{Q}{P}$ $t^\alpha-1$

Of these, the most useful are $\dTV(P,Q)$, $\D\ppa{Q}{P}$ and $\chi^2\ppa{Q}{P}$.

Concretely, it’s useful to remember the behavior of each divergence in two representative situations:

  • tweak by $\eps$: $\f{Q(x)}{P(x)}$ is always either $1+ \eps$ or $1-\eps$;
  • zoom in $D$ times: $\f{Q(x)}{P(x)}$ is always either $D$ or $0$.4
Divergence Tweak by $\eps$ Zoom in $D$ times
$\dTV(P,Q)$ $\eps/2$ $1-\f1D$
$\dJS^2(P,Q)$ $\Theta\p{\eps^2}$ $1 - \Theta\p{\f{\log D}{D}}$
$\dHel^2(P,Q)$ $\Theta\p{\eps^2}$ $1-\sqrt{\f1D}$
$\D\ppa{Q}{P}$ $\Theta\p{\eps^2}$ $\log D$
$\chi^2\ppa{Q}{P}$ $\eps^2$ $D-1$
$D_\alpha\ppa{Q}{P}$ $\Theta_\alpha\p{\eps^2}$ $D^{\alpha-1} - 1$

Two forms

Depending on what you want to do with them, you can put ratio divergences into two canonical forms:

  • the nonnegative form, which will allow us to characterize the asymptotics of each divergence and is convenient in the “tweak” regime;
  • the posterior form, which rewrites the divergence as an expectation over the posterior $Q$ and is convenient in the “zoom-in” regime.

Both forms require “tilting” $f$ by some amount by adding some multiple of $t-1$ to it. Since $\f{Q}{P}$ has mean $1$, this doesn’t change the overall value of the divergence.

Nonnegative form

To make $f$ nonnegative, we subtract its tangent at $t=1$:

\[ f_+(t) \ce f(t) - f'(1)(t-1). \]

Nonnegativity is very convenient for estimating the value of $D_f\ppa{Q}{P}$ asymptotically.

For example,

  • for information divergence, we transform $f(t) = t \log t$ (which is negative between $0$ and $1$) into $f_+(t) = t \log t - (t-1)$;
  • similarly, for power entropies, we transform $f(t) = t^\alpha - 1$ (which has $f'(1)=\alpha$) into $f_+(t) = t^\alpha - 1 - \alpha(t-1)$.
  • for total variation distance, we can just keep $f(t) = \f12\abs{t-1}$, but sometimes it’s more convenient to tilt it to the right, giving $f_+(t) = \f12\abs{t-1} - \f12(t-1) = (1-t)^+$.

#figure

Here are the nonnegative versions of our examples as well as their growth rates in the “tweak” and “zoom-in” regimes. In another note, we use them to deduce inequalities between divergences.

Divergence $f_+(t)$ Tweak: $t \le O(1)$ Zoom-in: $t > \omega(1)$
$\dTV(P,Q)$ $\f12\abs{t-1}$ (original) $\Theta(\abs{t-1})$ $\Theta(t)$
$\dTV(P,Q)$ $(1-t)^+$ (tilted) $O(\abs{t-1})$5 $0$
$\dJS^2(P,Q)$ $\frac{1}{2}\p{\log\f{2}{1+t} + t \log\f{2}{1+\f1t}}$ $\Theta\p{(t-1)^2}$ $\Theta(t)$
$\dHel^2(P,Q)$ $\p{\sqrt{t}-1}^2$ $\Theta\p{(t-1)^2}$ $\Theta(t)$
$\D\ppa{Q}{P}$ $t \log t - (t-1)$ $\Theta\p{(t-1)^2}$ $\Theta(t \log t)$
$\chi^2\ppa{Q}{P}$ $(t-1)^2$ $\Theta\p{(t-1)^2}$ $\Theta\p{t^2}$
$D_\alpha\ppa{Q}{P}$ $t^\alpha - 1 - \alpha(t - 1)$ $\Theta_\alpha\p{(t-1)^2}$ $\Theta(t^\alpha)$

Posterior form

In the “zoom-in” regime, most of draws from $P$ are points where $Q(x) \ll P(x)$, which not very informative. So it can be more intuitive to rephrase the divergence as an expectation over $Q$ instead of $P$.

To do this, we first need to make sure that the expectation is $0$ whenever $Q(x)=0$, so we tilt $f$ until $f(0)=0$:

\[ f_0(t) \ce f(t) + f(0)(t-1) \]

For example, for the $\chi^2$-divergence, we obtain

\[ \al{ f_0(t) &= (t-1)^2 + 1(t-1)\\ &= t^2-t. } \]

Then we need make a $\f{Q(x)}{P(x)}$ factor appear inside $\E_P[\cdot]$. It often appears readily given that $f_0(0)=0$:

\[ \al{ \chi^2\ppa{Q}{P} &= \E_{x \sim P}\b{\p{\f{Q(x)}{P(x)}}^2 - \f{Q(x)}{P(x)}}\\ &= \E_{x \sim P}\b{\f{Q(x)}{P(x)}\p{\f{Q(x)}{P(x)}-1}}\\ &= \E_{x \sim Q}\b{\f{Q(x)}{P(x)}-1}. } \]

To see why this is useful, suppose that $Q$ zooms in $D$ times on a subset of $P$ (i.e. $\f{Q(x)}{P(x)} = D$ whenever $Q(x) \ne 0$), then we can immediately see that

\[ \chi^2\ppa{Q}{P} = \E\b{D-1} = D-1. \]

Note how this is linear in the typical value of $\f{Q(x)}{P(x)}$, rather than quadratic as we might have expected by looking at $f$ in the original form.

Divergence $f_0(t)$ $\E_{x \sim Q}[\cdot]$
$\dTV(P,Q)$ $(t-1)^+$ $\p{1-\f{P(x)}{Q(x)}}^+$
$\dJS^2(P,Q)$ $t - \f12\p{\log(t+1) + t \log\p{1+\f1t}}$ $1 - \f12\p{\f{P(x)}{Q(x)}\log\p{\f{Q(x)}{P(x)}+1} + \log\p{1+\f{P(x)}{Q(x)}}}$
$\dHel^2(P,Q)$ $t - \sqrt{t}$ $1-\sqrt{\f{P(x)}{Q(x)}}$
$\D\ppa{Q}{P}$ $t \log t$ $\log\f{Q(x)}{P(x)}$
$\chi^2\ppa{Q}{P}$ $t^2-t$ $\f{Q(x)}{P(x)}-1$
$D_\alpha\ppa{Q}{P}$ $t^\alpha-t$ $\p{\f{Q(x)}{P(x)}}^{\alpha-1} - 1$

Conditional divergence

The intuitive definition fails

It would be natural to define the conditional divergence as the expected divergence of the conditional distributions

\[ D_f\bpaco{\BY'}{\BX'}{\BY}{\BX} \sr{?}{\ce} \E_x\b{D_f\bpa{\at{\BY'}{\BX'=x}}{\at{\BY}{\BX=x}}}. \]

However, it’s not clear whether the expectation should be over $\BX$ or $\BX'$:

  • the divergence itself is an expectation over the prior, it would be natural to let $x \sim \BX$ so that the combined expectation is over $(x,y) \sim (\BX,\BY)$;
  • but information divergence (which is naturally in posterior view) only gets a chain rule if $x$ is drawn from the posterior $\BX'$.

And if we try to circumvent the question by requiring that $\BX$ and $\BX'$ are the same variable, the conditional divergence just becomes the joint divergence:

\[ \al{ D_f\bpaco{\BY'}{\BX}{\BY}{\BX} &=\E_{x \sim \BX}\b{D_f\bpa{\at{\BY'}{\BX=x}}{\at{\BY}{\BX=x}}}\\ &= \E_{(x,y) \sim (\BX,\BY)}\b{f\p{\f{\Pr\bco{\BY'=y}{\BX=x}}{\Pr\bco{\BY=y}{\BX=x}}}}\\ &= \E_{(x,y) \sim (\BX,\BY)}\b{f\p{\f{\Pr\b{\BX=x \and \BY'=y}}{\Pr\b{\BX=x \and \BY=y}}}}\\ &= D_f\bpa{\BX,\BY'}{\BX,\BY}. } \]

Chain rule

Instead, perhaps the right way to define conditional divergence is to give it whatever value we need to get something like the chain rule for entropy. First, let’s define

\[ \phi(x,y) \ce \f{\Pr\b{\BX'=x \and \BY'=y}}{\Pr\b{\BX=x \and \BY=y}} \]

for convenience and verify that adding data can only increase the divergence:

\[ \al{ D_f\bpa{\BX',\BY'}{\BX,\BY} &= \E_{(x,y) \sim (\BX,\BY)}\b{f\p{\phi(x,y)}}\\ &= \E_{x \sim \BX}\b{\E_{y \sim \at{\BY}{\BX=x}}\b{f\p{\phi(x,y)}}}\\ &\ge \E_{x \sim \BX}\b{f\p{\E_{y \sim \at{\BY}{\BX=x}}\b{\phi(x,y)}}}\tag{convexity}\\ &= \E_{x \sim \BX}\b{f\p{\f{\Pr\b{\BX'=x}}{\Pr\b{\BX=x}}}}\\ %&= \E_{(x,y) \sim (\BX,\BY)}\b{f\p{\f{Q(x,y)}{P(x,y)}}}\\ %&= \E_{x \sim \BX}\b{\E_{y \sim \at{\BY}{\BX=x}}\b{f\p{\f{Q(x,y)}{P(x,y)}}}}\\ %&\ge \E_{x \sim \BX}\b{f\p{\E_{y \sim \at{\BY}{\BX=x}}\b{\f{Q(x,y)}{P(x,y)}}}}\tag{convexity}\\ %&= \E_{x \sim \BX}\b{f\p{\f{Q(x)}{P(x)}}}\tag{$*$}\\ &= D_f\bpa{\BX'}{\BX}. } \]

Assuming $f$ is strictly convex everywhere, equality holds iff the conditionals $\at{\BY}{\BX=x}$ and $\at{\BY'}{\BX'=x}$ are identical for each $x$ such that $\Pr\b{\BX'=x} \ne 0$, or equivalently, iff $\BY$ and $\BY'$ can be obtained by applying the same stochastic map to both $\BX$ and $\BX'$.

The gap is

\[ \E_{x \sim \BX}\b{\E_{y \sim \at{\BY}{\BX=x}}\b{f\p{\phi(x,y)}}-f\p{\E_{y \sim \at{\BY}{\BX=x}}\b{\phi(x,y)}}}, \]

where contents of $\E_{x \sim \BX}\b{\cdot}$ look an awful lot like a ratio divergence except that:

  • $\phi(x,y)$ doesn’t necessarily have mean $1$ over $y \sim \at{\BY}{\BX=x}$ for every $x$,
  • and therefore to “correct for that” we’re explicitly subtracting $f\p{\E_{y \sim \at{\BY}{\BX=x}}\b{\phi(x,y)}}$ instead of relying on the fact that $f(1)=0$ like we did in our definition of $D_f$.

Let’s denote this “generalized” divergence as $D^f_{y \sim \at{\BY}{\BX=x}}\b{\cdot}$ and bravely define

\[ D_f\bpaco{\BY'}{\BX'}{\BY}{\BX} \ce \E_{x \sim \BX}\b{D^f_{y \sim \at{\BY}{\BX=x}}\b{\phi(x,y)}} \]

in order to get the chain rule

\[ D_f\bpa{\BX',\BY'}{\BX,\BY} = D_f\bpa{\BX'}{\BX} + D_f\bpaco{\BY'}{\BX'}{\BY}{\BX}. \]

Properties

  • When $\BX$ and $\BX'$ are the same variable, $D^f_{y \sim \at{\BY}{\BX=x}}\b{\phi(x,y)} = D_f\bpa{\at{\BY'}{\BX=x}}{\at{\BY}{\BX=x}}$, so the conditional divergence coincides with the intuitive definition, though both are equal to the joint divergence $D_f\bpa{\BX,\BY'}{\BX,\BY}$.
  • For information divergence, we can verify that $\E_{x \sim \BX}\b{D^f_{y \sim \at{\BY}{\BX=x}}\b{\phi(x,y)}} = \E_{(x,y) \sim (\BX',\BY')}\b{\log\frac{\Pr\bco{\BY'=y}{\BX'=x}}{\Pr\bco{\BY=y}{\BX = x}}}$, so it coincides the intuitive definition if we make the choice to sample $x$ from the posterior $\BX'$.6
  • For $\chi^2$-divergence, the chain rule we get is exactly the chain rule of variance applied to the random variable $\phi(\BX,\BY)$.

Mixtures and convexity

In general $D_f\bpaco{\BY'}{\BX'}{\BY}{\BX}$ is sometimes bigger and sometimes smaller than $D_f\bpa{\BY'}{\BY}$. But when $\BX$ and $\BX'$ are the same variable, since adding data only increases entropy, we have

\[ D_f\bpa{\BY'}{\BY} \le D_f\bpa{\BX,\BY'}{\BX,\BY} = D_f\bpaco{\BY'}{\BX}{\BY}{\BX}, \]

So conditioning by the same variable can only increase the divergence.

If we see $\BY$ and $\BY'$ as mixtures $\BY_{\BX}$ and $\BY'_\BX$ of the variables $\BY_x \ce \at{\BY}{\BX=x}$ and $\BY'_x \ce \at{\BY'}{\BX=x}$ weighted by $\BX$, then this is saying that ratio divergences are convex:

\[ \ub{D_f\bpa{\BY'_\BX}{\BY_\BX}}_\text{divergence of mixtures} \le \ub{\E_{x \sim \BX}\b{D_f\bpa{\BY'_x}{\BY_x}}}_\text{average of divergences}. \]

Assuming $f$ is strictly convex everywhere, I believe equality holds iff the relative density function $\f{\Pr\b{\BY'_x = y}}{\Pr\b{\BY_x= y}}$ is the same for each $x$ such that $\Pr\b{\BX=x} \ne 0$.

See also

  1. also known as $f$-divergences or $\fi$-divergences 

  2. And in fact, the $\chi^2$-divergence (defined below) is precisely the variance of the quantity $\f{Q(x)}{P(x)}$ over $x \sim P$

  3. In Power entropies and divergences, we define the power divergences without the $-1$ and raise them to the power $\f{1}{\alpha-1}$

  4. This happens when $Q$ is obtained by conditioning $P$ on some event $E$ that depends only on $x$ and has probability $P(E) = \f{1}{D}$. If $E$ is allowed to depend on variables other than $x$, then I think $D_f\ppa{Q}{P}$ will only be smaller in comparison. 

  5. You could also morally have written $\Theta\p{(t-1)^2}$ here since for any $C \ge 1$, the total variation distance is within a constant factor of a hypothetical $f$-divergence with $f(t) = \begin{cases}\abs{t-1}& \text{if } t \le C\\0&\text{otherwise}\end{cases}$ (even though such $f$ wouldn’t be convex). 

  6. My guess is that information divergence is the only divergence for which this happens.