Ratio divergences

$\require{mathtools} \newcommand{\nc}{\newcommand} % %%% GENERIC MATH %%% % % Environments \newcommand{\al}[1]{\begin{align}#1\end{align}} % need this for \tag{} to work \renewcommand{\r}{\mathrm} % BAD!! does cursed things with accents :(( \renewcommand{\t}{\textrm} \newcommand{\either}[1]{\begin{cases}#1\end{cases}} % % Delimiters % (I needed to create my own because the MathJax version of \DeclarePairedDelimiter doesn't have \mathopen{} and that messes up the spacing) % .. one-part \newcommand{\p}[1]{\mathopen{}\left( #1 \right)} \renewcommand{\P}[1]{^{\p{#1}}} \renewcommand{\b}[1]{\mathopen{}\left[ #1 \right]} \newcommand{\lopen}[1]{\mathopen{}\left( #1 \right]} \newcommand{\ropen}[1]{\mathopen{}\left[ #1 \right)} \newcommand{\set}[1]{\mathopen{}\left\{ #1 \right\}} \newcommand{\abs}[1]{\mathopen{}\left\lvert #1 \right\rvert} \newcommand{\floor}[1]{\mathopen{}\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\mathopen{}\left\lceil #1 \right\rceil} \newcommand{\round}[1]{\mathopen{}\left\lfloor #1 \right\rceil} \newcommand{\inner}[1]{\mathopen{}\left\langle #1 \right\rangle} \newcommand{\norm}[1]{\mathopen{}\left\lVert #1 \strut \right\rVert} \newcommand{\frob}[1]{\norm{#1}_\mathrm{F}} \newcommand{\mix}[1]{\mathopen{}\left\lfloor #1 \right\rceil} %% .. two-part \newcommand{\inco}[2]{#1 \mathop{}\middle|\mathop{} #2} \newcommand{\co}[2]{ {\left.\inco{#1}{#2}\right.}} \newcommand{\cond}{\co} % deprecated \newcommand{\pco}[2]{\p{\inco{#1}{#2}}} \newcommand{\bco}[2]{\b{\inco{#1}{#2}}} \newcommand{\setco}[2]{\set{\inco{#1}{#2}}} \newcommand{\at}[2]{ {\left.#1\strut\right|_{#2}}} \newcommand{\pat}[2]{\p{\at{#1}{#2}}} \newcommand{\bat}[2]{\b{\at{#1}{#2}}} \newcommand{\para}[2]{#1\strut \mathop{}\middle\|\mathop{} #2} \newcommand{\ppa}[2]{\p{\para{#1}{#2}}} \newcommand{\pff}[2]{\p{\ff{#1}{#2}}} \newcommand{\bff}[2]{\b{\ff{#1}{#2}}} \newcommand{\bffco}[4]{\bff{\cond{#1}{#2}}{\cond{#3}{#4}}} \newcommand{\sm}[1]{\p{\begin{smallmatrix}#1\end{smallmatrix}}} % % Greek \newcommand{\eps}{\epsilon} \newcommand{\veps}{\varepsilon} \newcommand{\vpi}{\varpi} % the following cause issues with real LaTeX tho :/ maybe consider naming it \fhi instead? \let\fi\phi % because it looks like an f \let\phi\varphi % because it looks like a p \renewcommand{\th}{\theta} \newcommand{\Th}{\Theta} \newcommand{\om}{\omega} \newcommand{\Om}{\Omega} % % Miscellaneous \newcommand{\LHS}{\mathrm{LHS}} \newcommand{\RHS}{\mathrm{RHS}} \DeclareMathOperator{\cst}{const} % .. operators \DeclareMathOperator{\poly}{poly} \DeclareMathOperator{\polylog}{polylog} \DeclareMathOperator{\quasipoly}{quasipoly} \DeclareMathOperator{\negl}{negl} \DeclareMathOperator*{\argmin}{arg\thinspace min} \DeclareMathOperator*{\argmax}{arg\thinspace max} \DeclareMathOperator{\diag}{diag} % .. functions \DeclareMathOperator{\id}{id} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\step}{step} \DeclareMathOperator{\err}{err} \DeclareMathOperator{\ReLU}{ReLU} \DeclareMathOperator{\softmax}{softmax} % .. analysis \let\d\undefined \newcommand{\d}{\operatorname{d}\mathopen{}} \newcommand{\dd}[1]{\operatorname{d}^{#1}\mathopen{}} \newcommand{\df}[2]{ {\f{\d #1}{\d #2}}} \newcommand{\ds}[2]{ {\sl{\d #1}{\d #2}}} \newcommand{\ddf}[3]{ {\f{\dd{#1} #2}{\p{\d #3}^{#1}}}} \newcommand{\dds}[3]{ {\sl{\dd{#1} #2}{\p{\d #3}^{#1}}}} \renewcommand{\part}{\partial} \newcommand{\ppart}[1]{\part^{#1}} \newcommand{\partf}[2]{\f{\part #1}{\part #2}} \newcommand{\parts}[2]{\sl{\part #1}{\part #2}} \newcommand{\ppartf}[3]{ {\f{\ppart{#1} #2}{\p{\part #3}^{#1}}}} \newcommand{\pparts}[3]{ {\sl{\ppart{#1} #2}{\p{\part #3}^{#1}}}} \newcommand{\grad}[1]{\mathop{\nabla\!_{#1}}} % .. sets \newcommand{\es}{\emptyset} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\R}{\mathbb{R}} \newcommand{\Rge}{\R_{\ge 0}} \newcommand{\Rgt}{\R_{> 0}} \newcommand{\C}{\mathbb{C}} \newcommand{\F}{\mathbb{F}} \newcommand{\zo}{\set{0,1}} \newcommand{\pmo}{\set{\pm 1}} \newcommand{\zpmo}{\set{0,\pm 1}} % .... set operations \newcommand{\sse}{\subseteq} \newcommand{\out}{\not\in} \newcommand{\minus}{\setminus} \newcommand{\inc}[1]{\union \set{#1}} % "including" \newcommand{\exc}[1]{\setminus \set{#1}} % "except" % .. over and under \renewcommand{\ss}[1]{_{\substack{#1}}} \newcommand{\OB}{\overbrace} \newcommand{\ob}[2]{\OB{#1}^\t{#2}} \newcommand{\UB}{\underbrace} \newcommand{\ub}[2]{\UB{#1}_\t{#2}} \newcommand{\ol}{\overline} \newcommand{\tld}{\widetilde} % deprecated \renewcommand{\~}{\widetilde} \newcommand{\HAT}{\widehat} % deprecated \renewcommand{\^}{\widehat} \newcommand{\rt}[1]{ {\sqrt{#1}}} \newcommand{\for}[2]{_{#1=1}^{#2}} \newcommand{\sfor}{\sum\for} \newcommand{\pfor}{\prod\for} % .... two-part \newcommand{\f}{\frac} \renewcommand{\sl}[2]{#1 /\mathopen{}#2} \newcommand{\ff}[2]{\mathchoice{\begin{smallmatrix}\displaystyle\vphantom{\p{#1}}#1\\[-0.05em]\hline\\[-0.05em]\hline\displaystyle\vphantom{\p{#2}}#2\end{smallmatrix}}{\begin{smallmatrix}\vphantom{\p{#1}}#1\\[-0.1em]\hline\\[-0.1em]\hline\vphantom{\p{#2}}#2\end{smallmatrix}}{\begin{smallmatrix}\vphantom{\p{#1}}#1\\[-0.1em]\hline\\[-0.1em]\hline\vphantom{\p{#2}}#2\end{smallmatrix}}{\begin{smallmatrix}\vphantom{\p{#1}}#1\\[-0.1em]\hline\\[-0.1em]\hline\vphantom{\p{#2}}#2\end{smallmatrix}}} % .. arrows \newcommand{\from}{\leftarrow} \DeclareMathOperator*{\<}{\!\;\longleftarrow\;\!} \let\>\undefined \DeclareMathOperator*{\>}{\!\;\longrightarrow\;\!} \let\-\undefined \DeclareMathOperator*{\-}{\!\;\longleftrightarrow\;\!} \newcommand{\so}{\implies} % .. operators and relations \renewcommand{\*}{\cdot} \newcommand{\x}{\times} \newcommand{\ox}{\otimes} \newcommand{\OX}[1]{^{\ox #1}} \newcommand{\sr}{\stackrel} \newcommand{\ce}{\coloneqq} \newcommand{\ec}{\eqqcolon} \newcommand{\ap}{\approx} \newcommand{\ls}{\lesssim} \newcommand{\gs}{\gtrsim} % .. punctuation and spacing \renewcommand{\.}[1]{#1\dots#1} \newcommand{\ts}{\thinspace} \newcommand{\q}{\quad} \newcommand{\qq}{\qquad} % % %%% SPECIALIZED MATH %%% % % Logic and bit operations \newcommand{\fa}{\forall} \newcommand{\ex}{\exists} \renewcommand{\and}{\wedge} \newcommand{\AND}{\bigwedge} \renewcommand{\or}{\vee} \newcommand{\OR}{\bigvee} \newcommand{\xor}{\oplus} \newcommand{\XOR}{\bigoplus} \newcommand{\union}{\cup} \newcommand{\dunion}{\sqcup} \newcommand{\inter}{\cap} \newcommand{\UNION}{\bigcup} \newcommand{\DUNION}{\bigsqcup} \newcommand{\INTER}{\bigcap} \newcommand{\comp}{\overline} \newcommand{\true}{\r{true}} \newcommand{\false}{\r{false}} \newcommand{\tf}{\set{\true,\false}} \DeclareMathOperator{\One}{\mathbb{1}} \DeclareMathOperator{\1}{\mathbb{1}} % use \mathbbm instead if using real LaTeX \DeclareMathOperator{\LSB}{LSB} % % Linear algebra \newcommand{\spn}{\mathrm{span}} % do NOT use \span because it causes misery with amsmath \DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\dom}{dom} \DeclareMathOperator{\Img}{Im} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\perm}{perm} \DeclareMathOperator{\haf}{haf} \newcommand{\transp}{\mathsf{T}} \newcommand{\T}{^\transp} \newcommand{\par}{\parallel} % .. named tensors \newcommand{\namedtensorstrut}{\vphantom{fg}} % milder than \mathstrut \newcommand{\name}[1]{\mathsf{\namedtensorstrut #1}} \newcommand{\nbin}[2]{\mathbin{\underset{\substack{#1}}{\namedtensorstrut #2}}} \newcommand{\ndot}[1]{\nbin{#1}{\odot}} \newcommand{\ncat}[1]{\nbin{#1}{\oplus}} \newcommand{\nsum}[1]{\sum\limits_{\substack{#1}}} \newcommand{\nfun}[2]{\mathop{\underset{\substack{#1}}{\namedtensorstrut\mathrm{#2}}}} \newcommand{\ndef}[2]{\newcommand{#1}{\name{#2}}} \newcommand{\nt}[1]{^{\transp(#1)}} % % Probability \newcommand{\tri}{\triangle} \newcommand{\Normal}{\mathcal{N}} \newcommand{\Exp}{\mathcal{Exp}} % .. operators \DeclareMathOperator{\supp}{supp} \let\Pr\undefined \DeclareMathOperator*{\Pr}{Pr} \DeclareMathOperator*{\G}{\mathbb{G}} \DeclareMathOperator*{\Odds}{Od} \DeclareMathOperator*{\E}{E} \DeclareMathOperator*{\Var}{Var} \DeclareMathOperator*{\Cov}{Cov} \DeclareMathOperator*{\K}{K} \DeclareMathOperator*{\corr}{corr} \DeclareMathOperator*{\median}{median} \DeclareMathOperator*{\maj}{maj} % ... information theory \let\H\undefined \DeclareMathOperator*{\H}{H} \DeclareMathOperator*{\I}{I} \DeclareMathOperator*{\D}{D} \DeclareMathOperator*{\KL}{KL} % .. other divergences \newcommand{\dTV}{d_{\mathrm{TV}}} \newcommand{\dHel}{d_{\mathrm{Hel}}} \newcommand{\dJS}{d_{\mathrm{JS}}} % % Polynomials \DeclareMathOperator{\He}{He} \DeclareMathOperator{\coeff}{coeff} % %%% SPECIALIZED COMPUTER SCIENCE %%% % % Complexity classes % .. keywords \newcommand{\coclass}{\mathsf{co}} \newcommand{\Prom}{\mathsf{Promise}} % .. classical \newcommand{\PTIME}{\mathsf{P}} \newcommand{\NP}{\mathsf{NP}} \newcommand{\coNP}{\coclass\NP} \newcommand{\PH}{\mathsf{PH}} \newcommand{\PSPACE}{\mathsf{PSPACE}} \renewcommand{\L}{\mathsf{L}} \newcommand{\EXP}{\mathsf{EXP}} \newcommand{\NEXP}{\mathsf{NEXP}} % .. probabilistic \newcommand{\formost}{\mathsf{Я}} \newcommand{\RP}{\mathsf{RP}} \newcommand{\BPP}{\mathsf{BPP}} \newcommand{\ZPP}{\mathsf{ZPP}} \newcommand{\MA}{\mathsf{MA}} \newcommand{\AM}{\mathsf{AM}} \newcommand{\IP}{\mathsf{IP}} \newcommand{\RL}{\mathsf{RL}} % .. circuits \newcommand{\NC}{\mathsf{NC}} \newcommand{\AC}{\mathsf{AC}} \newcommand{\ACC}{\mathsf{ACC}} \newcommand{\ThrC}{\mathsf{TC}} \newcommand{\Ppoly}{\mathsf{P}/\poly} \newcommand{\Lpoly}{\mathsf{L}/\poly} % .. resources \newcommand{\TIME}{\mathsf{TIME}} \newcommand{\NTIME}{\mathsf{NTIME}} \newcommand{\SPACE}{\mathsf{SPACE}} \newcommand{\TISP}{\mathsf{TISP}} \newcommand{\SIZE}{\mathsf{SIZE}} % .. custom \newcommand{\NCP}{\mathsf{NCP}} % % Boolean analysis \newcommand{\harpoon}{\!\upharpoonright\!} \newcommand{\rr}[2]{#1\harpoon_{#2}} \newcommand{\Fou}[1]{\widehat{#1}} \DeclareMathOperator{\Ind}{\mathrm{Ind}} \DeclareMathOperator{\Inf}{\mathrm{Inf}} \newcommand{\Der}[1]{\operatorname{D}_{#1}\mathopen{}} % \newcommand{\Exp}[1]{\operatorname{E}_{#1}\mathopen{}} \DeclareMathOperator{\Stab}{\mathrm{Stab}} \DeclareMathOperator{\Tau}{T} \DeclareMathOperator{\sens}{\mathrm{s}} \DeclareMathOperator{\bsens}{\mathrm{bs}} \DeclareMathOperator{\fbsens}{\mathrm{fbs}} \DeclareMathOperator{\Cert}{\mathrm{C}} \DeclareMathOperator{\DT}{\mathrm{DT}} \DeclareMathOperator{\CDT}{\mathrm{CDT}} % canonical \DeclareMathOperator{\ECDT}{\mathrm{ECDT}} \DeclareMathOperator{\CDTv}{\mathrm{CDT_{vars}}} \DeclareMathOperator{\ECDTv}{\mathrm{ECDT_{vars}}} \DeclareMathOperator{\CDTt}{\mathrm{CDT_{terms}}} \DeclareMathOperator{\ECDTt}{\mathrm{ECDT_{terms}}} \DeclareMathOperator{\CDTw}{\mathrm{CDT_{weighted}}} \DeclareMathOperator{\ECDTw}{\mathrm{ECDT_{weighted}}} \DeclareMathOperator{\AvgDT}{\mathrm{AvgDT}} \DeclareMathOperator{\PDT}{\mathrm{PDT}} % partial decision tree \DeclareMathOperator{\DTsize}{\mathrm{DT_{size}}} \DeclareMathOperator{\W}{\mathbf{W}} % .. functions (small caps sadly doesn't work) \DeclareMathOperator{\Par}{\mathrm{Par}} \DeclareMathOperator{\Maj}{\mathrm{Maj}} \DeclareMathOperator{\HW}{\mathrm{HW}} \DeclareMathOperator{\Thr}{\mathrm{Thr}} \DeclareMathOperator{\Tribes}{\mathrm{Tribes}} \DeclareMathOperator{\RotTribes}{\mathrm{RotTribes}} \DeclareMathOperator{\CycleRun}{\mathrm{CycleRun}} \DeclareMathOperator{\SAT}{\mathrm{SAT}} \DeclareMathOperator{\UniqueSAT}{\mathrm{UniqueSAT}} % % Dynamic optimality \newcommand{\OPT}{\mathsf{OPT}} \newcommand{\Alt}{\mathsf{Alt}} \newcommand{\Funnel}{\mathsf{Funnel}} % % Alignment \DeclareMathOperator{\Amp}{\mathrm{Amp}} % %%% TYPESETTING %%% % % In "text" \newcommand{\heart}{\heartsuit} \newcommand{\nth}{^\t{th}} \newcommand{\degree}{^\circ} \newcommand{\qu}[1]{\text{``}#1\text{''}} % remove these last two if using real LaTeX \newcommand{\qed}{\blacksquare} \newcommand{\qedhere}{\tag*{$\blacksquare$}} % % Fonts % .. bold \newcommand{\BA}{\boldsymbol{A}} \newcommand{\BB}{\boldsymbol{B}} \newcommand{\BC}{\boldsymbol{C}} \newcommand{\BD}{\boldsymbol{D}} \newcommand{\BE}{\boldsymbol{E}} \newcommand{\BF}{\boldsymbol{F}} \newcommand{\BG}{\boldsymbol{G}} \newcommand{\BH}{\boldsymbol{H}} \newcommand{\BI}{\boldsymbol{I}} \newcommand{\BJ}{\boldsymbol{J}} \newcommand{\BK}{\boldsymbol{K}} \newcommand{\BL}{\boldsymbol{L}} \newcommand{\BM}{\boldsymbol{M}} \newcommand{\BN}{\boldsymbol{N}} \newcommand{\BO}{\boldsymbol{O}} \newcommand{\BP}{\boldsymbol{P}} \newcommand{\BQ}{\boldsymbol{Q}} \newcommand{\BR}{\boldsymbol{R}} \newcommand{\BS}{\boldsymbol{S}} \newcommand{\BT}{\boldsymbol{T}} \newcommand{\BU}{\boldsymbol{U}} \newcommand{\BV}{\boldsymbol{V}} \newcommand{\BW}{\boldsymbol{W}} \newcommand{\BX}{\boldsymbol{X}} \newcommand{\BY}{\boldsymbol{Y}} \newcommand{\BZ}{\boldsymbol{Z}} \newcommand{\Ba}{\boldsymbol{a}} \newcommand{\Bb}{\boldsymbol{b}} \newcommand{\Bc}{\boldsymbol{c}} \newcommand{\Bd}{\boldsymbol{d}} \newcommand{\Be}{\boldsymbol{e}} \newcommand{\Bf}{\boldsymbol{f}} \newcommand{\Bg}{\boldsymbol{g}} \newcommand{\Bh}{\boldsymbol{h}} \newcommand{\Bi}{\boldsymbol{i}} \newcommand{\Bj}{\boldsymbol{j}} \newcommand{\Bk}{\boldsymbol{k}} \newcommand{\Bl}{\boldsymbol{l}} \newcommand{\Bm}{\boldsymbol{m}} \newcommand{\Bn}{\boldsymbol{n}} \newcommand{\Bo}{\boldsymbol{o}} \newcommand{\Bp}{\boldsymbol{p}} \newcommand{\Bq}{\boldsymbol{q}} \newcommand{\Br}{\boldsymbol{r}} \newcommand{\Bs}{\boldsymbol{s}} \newcommand{\Bt}{\boldsymbol{t}} \newcommand{\Bu}{\boldsymbol{u}} \newcommand{\Bv}{\boldsymbol{v}} \newcommand{\Bw}{\boldsymbol{w}} \newcommand{\Bx}{\boldsymbol{x}} \newcommand{\By}{\boldsymbol{y}} \newcommand{\Bz}{\boldsymbol{z}} \newcommand{\Balpha}{\boldsymbol{\alpha}} \newcommand{\Bbeta}{\boldsymbol{\beta}} \newcommand{\Bgamma}{\boldsymbol{\gamma}} \newcommand{\Bdelta}{\boldsymbol{\delta}} \newcommand{\Beps}{\boldsymbol{\eps}} \newcommand{\Bveps}{\boldsymbol{\veps}} \newcommand{\Bzeta}{\boldsymbol{\zeta}} \newcommand{\Beta}{\boldsymbol{\eta}} \newcommand{\Btheta}{\boldsymbol{\theta}} \newcommand{\Bth}{\boldsymbol{\th}} \newcommand{\Biota}{\boldsymbol{\iota}} \newcommand{\Bkappa}{\boldsymbol{\kappa}} \newcommand{\Blambda}{\boldsymbol{\lambda}} \newcommand{\Bmu}{\boldsymbol{\mu}} \newcommand{\Bnu}{\boldsymbol{\nu}} \newcommand{\Bxi}{\boldsymbol{\xi}} \newcommand{\Bpi}{\boldsymbol{\pi}} \newcommand{\Bvpi}{\boldsymbol{\vpi}} \newcommand{\Brho}{\boldsymbol{\rho}} \newcommand{\Bsigma}{\boldsymbol{\sigma}} \newcommand{\Btau}{\boldsymbol{\tau}} \newcommand{\Bupsilon}{\boldsymbol{\upsilon}} \newcommand{\Bphi}{\boldsymbol{\phi}} \newcommand{\Bfi}{\boldsymbol{\fi}} \newcommand{\Bchi}{\boldsymbol{\chi}} \newcommand{\Bpsi}{\boldsymbol{\psi}} \newcommand{\Bom}{\boldsymbol{\om}} % .. calligraphic \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} % .. typewriter \newcommand{\TA}{\mathtt{A}} \newcommand{\TB}{\mathtt{B}} \newcommand{\TC}{\mathtt{C}} \newcommand{\TD}{\mathtt{D}} \newcommand{\TE}{\mathtt{E}} \newcommand{\TF}{\mathtt{F}} \newcommand{\TG}{\mathtt{G}} \renewcommand{\TH}{\mathtt{H}} \newcommand{\TI}{\mathtt{I}} \newcommand{\TJ}{\mathtt{J}} \newcommand{\TK}{\mathtt{K}} \newcommand{\TL}{\mathtt{L}} \newcommand{\TM}{\mathtt{M}} \newcommand{\TN}{\mathtt{N}} \newcommand{\TO}{\mathtt{O}} \newcommand{\TP}{\mathtt{P}} \newcommand{\TQ}{\mathtt{Q}} \newcommand{\TR}{\mathtt{R}} \newcommand{\TS}{\mathtt{S}} \newcommand{\TT}{\mathtt{T}} \newcommand{\TU}{\mathtt{U}} \newcommand{\TV}{\mathtt{V}} \newcommand{\TW}{\mathtt{W}} \newcommand{\TX}{\mathtt{X}} \newcommand{\TY}{\mathtt{Y}} \newcommand{\TZ}{\mathtt{Z}} % % LEVELS OF CLOSENESS (basically deprecated) \newcommand{\scirc}[1]{\sr{\circ}{#1}} \newcommand{\sdot}[1]{\sr{.}{#1}} \newcommand{\slog}[1]{\sr{\log}{#1}} \newcommand{\createClosenessLevels}[7]{ \newcommand{#2}{\mathrel{(#1)}} \newcommand{#3}{\mathrel{#1}} \newcommand{#4}{\mathrel{#1\!\!#1}} \newcommand{#5}{\mathrel{#1\!\!#1\!\!#1}} \newcommand{#6}{\mathrel{(\sdot{#1})}} \newcommand{#7}{\mathrel{(\slog{#1})}} } \let\lt\undefined \let\gt\undefined % .. vanilla versions (is it within a constant?) \newcommand{\ez}{\scirc=} \newcommand{\eq}{\simeq} \newcommand{\eqq}{\mathrel{\eq\!\!\eq}} \newcommand{\eqqq}{\mathrel{\eq\!\!\eq\!\!\eq}} \newcommand{\lez}{\scirc\le} \renewcommand{\lq}{\preceq} \newcommand{\lqq}{\mathrel{\lq\!\!\lq}} \newcommand{\lqqq}{\mathrel{\lq\!\!\lq\!\!\lq}} \newcommand{\gez}{\scirc\ge} \newcommand{\gq}{\succeq} \newcommand{\gqq}{\mathrel{\gq\!\!\gq}} \newcommand{\gqqq}{\mathrel{\gq\!\!\gq\!\!\gq}} \newcommand{\lz}{\scirc<} \newcommand{\lt}{\prec} \newcommand{\ltt}{\mathrel{\lt\!\!\lt}} \newcommand{\lttt}{\mathrel{\lt\!\!\lt\!\!\lt}} \newcommand{\gz}{\scirc>} \newcommand{\gt}{\succ} \newcommand{\gtt}{\mathrel{\gt\!\!\gt}} \newcommand{\gttt}{\mathrel{\gt\!\!\gt\!\!\gt}} % .. dotted versions (is it equal in the limit?) \newcommand{\ed}{\sdot=} \newcommand{\eqd}{\sdot\eq} \newcommand{\eqqd}{\sdot\eqq} \newcommand{\eqqqd}{\sdot\eqqq} \newcommand{\led}{\sdot\le} \newcommand{\lqd}{\sdot\lq} \newcommand{\lqqd}{\sdot\lqq} \newcommand{\lqqqd}{\sdot\lqqq} \newcommand{\ged}{\sdot\ge} \newcommand{\gqd}{\sdot\gq} \newcommand{\gqqd}{\sdot\gqq} \newcommand{\gqqqd}{\sdot\gqqq} \newcommand{\ld}{\sdot<} \newcommand{\ltd}{\sdot\lt} \newcommand{\lttd}{\sdot\ltt} \newcommand{\ltttd}{\sdot\lttt} \newcommand{\gd}{\sdot>} \newcommand{\gtd}{\sdot\gt} \newcommand{\gttd}{\sdot\gtt} \newcommand{\gtttd}{\sdot\gttt} % .. log versions (is it equal up to log?) \newcommand{\elog}{\slog=} \newcommand{\eqlog}{\slog\eq} \newcommand{\eqqlog}{\slog\eqq} \newcommand{\eqqqlog}{\slog\eqqq} \newcommand{\lelog}{\slog\le} \newcommand{\lqlog}{\slog\lq} \newcommand{\lqqlog}{\slog\lqq} \newcommand{\lqqqlog}{\slog\lqqq} \newcommand{\gelog}{\slog\ge} \newcommand{\gqlog}{\slog\gq} \newcommand{\gqqlog}{\slog\gqq} \newcommand{\gqqqlog}{\slog\gqqq} \newcommand{\llog}{\slog<} \newcommand{\ltlog}{\slog\lt} \newcommand{\lttlog}{\slog\ltt} \newcommand{\ltttlog}{\slog\lttt} \newcommand{\glog}{\slog>} \newcommand{\gtlog}{\slog\gt} \newcommand{\gttlog}{\slog\gtt} \newcommand{\gtttlog}{\slog\gttt}$

Ratio divergences¹ are a (generally asymmetric) notion of statistical distance between a prior distribution $P$ and a posterior distribution $Q$, which depends on the typical values of the probability ratio $\f{Q}{P}$. Concretely, for any convex function $f$, we can define the $f$-divergence of $Q$ over $P$ as the $f$-dispersion of their relative density:

\[ \al{ D^f\pff{Q}{P} &= D^f_{x \sim P}\b{\f{Q(x)}{P(x)}}\\ &= \E_{x \sim P}\b{f\p{\f{Q(x)}{P(x)}}} - f\p{\E_{x \sim P}\b{\f{Q(x)}{P(x)}}}\\ &= \E_{x \sim P}\b{f\p{\f{Q(x)}{P(x)}}} - f\p{\E_{x \sim Q}\b{1}}\\ &= \E_{x \sim P}\b{f\p{\f{Q(x)}{P(x)}}} - f(1). } \]

This is nonnegative by convexity, and if $f$ is strictly convex at $1$, we have $D^f\pff{Q}{P}=0$ if and only if $\f{Q(x)}{P(x)}=1$ all the time, i.e. $Q$ is identical to $P$. For convenience, we will shift $f$ so that $f(1)=0$, which means the definition simplies to

\[ \al{ D^f\pff{Q}{P} &= \E_{x \sim P}\b{f\p{\f{Q(x)}{P(x)}}}. } \]

By extension, we can define the divergence between two random variables as

\[ D^f\bff{\BX'}{\BX} \ce \E_{x \sim \BX}\b{f\p{\frac{\Pr[\BX'=x]}{\Pr[\BX=x]}}}. \]

Note that this is only a function of the individual distributions of $\BX$ and $\BX'$; we’re not thinking of them as jointly distributed.

#to-write drop the confusing shortcut and just write it as $D^f\pff{\b{\BX'}}{\b\BX}$

Regimes

Because ratio divergences are defined as dispersions of $\f{Q}{P}$, they measure how far this ratio deviates from its mean of $1$. More precisely, the second derivative of $f$ will determine how quickly the divergence grows when $\f{Q}{P}$ strays further away from $1$. We can characterize the asymptotic behavior of most ratio divergences by looking at the behavior of $f$ in two regimes:

Regime	Relative density	Intuitive description
“tweak”	$\f{Q}{P} \le O(1)$	$Q$ is obtained by slightly “tweaking” $P$ up or down
“zoom-in”	$\f{Q}{P} > \omega(1)$	$Q$ “zooms in” on a small part of the support of $P$, giving it much higher probability

Examples

Some ratio divergences are (squares of) distances that are symmetric and satisfy the triangle inequality, but most (including the information divergence) are asymmetric and don’t satisfy the triangle inequality.

Name	Notation	$f(t)$
total variation distance	$\dTV(P,Q)$	$\f12\abs{t-1}$
(squared) Jensen–Shannon distance	$\dJS^2(P,Q)$	$\frac{1}{2}\p{\log\f{2}{1+t} + t \log\f{2}{1+\f1t}}$
(squared) Hellinger distance	$\dHel^2(P,Q)$	$\f12\p{\sqrt{t}-1}^2$
information divergence	$\D\pff{Q}{P}$	$t \log t$
$\chi^2$-divergence	$\chi^2\pff{Q}{P}$	$(t-1)^2$
power divergences ($\alpha>1$)²	$D_\alpha\pff{Q}{P}$	$t^\alpha-1$

Of these, the most important are $\dTV(P,Q)$, $\D\pff{Q}{P}$ and $\chi^2\pff{Q}{P}$.

Concretely, it’s useful to remember the behavior of each divergence in two representative situations:

tweak by $\eps$: $\f{Q(x)}{P(x)}$ is always either $1+ \eps$ or $1-\eps$;
zoom in $D$ times: $\f{Q(x)}{P(x)}$ is always either $D$ or $0$.³

Divergence	Tweak by $\eps$	Zoom in $D$ times
$\dTV(P,Q)$	$\eps/2$	$1-\f1D$
$\dJS^2(P,Q)$	$\Theta\p{\eps^2}$	$1 - \Theta\p{\f{\log D}{D}}$
$\dHel^2(P,Q)$	$\Theta\p{\eps^2}$	$1-\sqrt{\f1D}$
$\D\pff{Q}{P}$	$\Theta\p{\eps^2}$	$\log D$
$\chi^2\pff{Q}{P}$	$\eps^2$	$D-1$
$D_\alpha\pff{Q}{P}$	$\Theta_\alpha\p{\eps^2}$	$D^{\alpha-1} - 1$

Two forms

Depending on what you want to do with them, you can put ratio divergences into two canonical forms:

the nonnegative form, which will allow us to characterize the asymptotics of each divergence and is convenient in the “tweak” regime;
the posterior form, which rewrites the divergence as an expectation over the posterior $Q$ and is convenient in the “zoom-in” regime.

Both forms require “tilting” $f$ by some amount by adding some multiple of $t-1$ to it. Since $\f{Q}{P}$ has mean $1$, this doesn’t change the overall value of the divergence.

Nonnegative form

To make $f$ nonnegative, we subtract its tangent at $t=1$:

\[ f_+(t) \ce f(t) - f'(1)(t-1). \]

Nonnegativity is very convenient for estimating the value of $D^f\pff{Q}{P}$ asymptotically.

For example,

for information divergence, we transform $f(t) = t \log t$ (which is negative between $0$ and $1$) into $f_+(t) = t \log t - (t-1)$;
similarly, for power entropies, we transform $f(t) = t^\alpha - 1$ (which has $f'(1)=\alpha$) into $f_+(t) = t^\alpha - 1 - \alpha(t-1)$.
for total variation distance, we can just keep $f(t) = \f12\abs{t-1}$, but sometimes it’s more convenient to tilt it to the right, giving $f_+(t) = \f12\abs{t-1} - \f12(t-1) = (1-t)^+$.

#figure

Here are the nonnegative versions of our examples as well as their growth rates in the “tweak” and “zoom-in” regimes. In another note, we use them to deduce inequalities between divergences.

Divergence	$f_+(t)$	Tweak: $t \le O(1)$	Zoom-in: $t > \omega(1)$
$\dTV(P,Q)$	$\f12\abs{t-1}$ (original)	$\Theta(\abs{t-1})$	$\Theta(t)$
$\dTV(P,Q)$	$(1-t)^+$ (tilted)	$O(\abs{t-1})$⁴	$0$
$\dJS^2(P,Q)$	$\frac{1}{2}\p{\log\f{2}{1+t} + t \log\f{2}{1+\f1t}}$	$\Theta\p{(t-1)^2}$	$\Theta(t)$
$\dHel^2(P,Q)$	$\p{\sqrt{t}-1}^2$	$\Theta\p{(t-1)^2}$	$\Theta(t)$
$\D\pff{Q}{P}$	$t \log t - (t-1)$	$\Theta\p{(t-1)^2}$	$\Theta(t \log t)$
$\chi^2\pff{Q}{P}$	$(t-1)^2$	$\Theta\p{(t-1)^2}$	$\Theta\p{t^2}$
$D_\alpha\pff{Q}{P}$	$t^\alpha - 1 - \alpha(t - 1)$	$\Theta_\alpha\p{(t-1)^2}$	$\Theta(t^\alpha)$

Posterior form

In the “zoom-in” regime, most of draws from $P$ are points where $Q(x) \ap 0$, which are not very informative. So it can be more intuitive to rephrase the divergence as an expectation over $Q$ instead of $P$.

To do this, we first need to make sure that the expectation is $0$ whenever $Q(x)=0$, so we tilt $f$ until $f(0)=0$:

\[ f_0(t) \ce f(t) + f(0)(t-1) \]

For example, for the $\chi^2$-divergence, we obtain

\[ \al{ f_0(t) &= (t-1)^2 + 1(t-1)\\ &= t^2-t. } \]

Then we need make a $\f{Q(x)}{P(x)}$ factor appear inside $\E_P[\cdot]$. It often appears readily given that $f_0(0)=0$: for example,

\[ \al{ \chi^2\pff{Q}{P} &= \E_{x \sim P}\b{\p{\f{Q(x)}{P(x)}}^2 - \f{Q(x)}{P(x)}}\\ &= \E_{x \sim P}\b{\f{Q(x)}{P(x)}\p{\f{Q(x)}{P(x)}-1}}\\ &= \E_{x \sim Q}\b{\f{Q(x)}{P(x)}-1}. } \]

To see why this is useful, suppose that $Q$ zooms in $D$ times on a subset of $P$ (i.e. $\f{Q(x)}{P(x)} = D$ whenever $Q(x) \ne 0$), then we can immediately see that

\[ \chi^2\pff{Q}{P} = \E\b{D-1} = D-1. \]

Note how this is linear in the typical value of $\f{Q(x)}{P(x)}$, rather than quadratic as we might have expected by looking at $f$ in the original form.

Divergence	$f_0(t)$	$\E_{x \sim Q}[\cdot]$
$\dTV(P,Q)$	$(t-1)^+$	$\p{1-\f{P(x)}{Q(x)}}^+$
$\dJS^2(P,Q)$	$t - \f12\p{\log(t+1) + t \log\p{1+\f1t}}$	$1 - \f12\p{\f{P(x)}{Q(x)}\log\p{\f{Q(x)}{P(x)}+1} + \log\p{1+\f{P(x)}{Q(x)}}}$
$\dHel^2(P,Q)$	$t - \sqrt{t}$	$1-\sqrt{\f{P(x)}{Q(x)}}$
$\D\pff{Q}{P}$	$t \log t$	$\log\f{Q(x)}{P(x)}$
$\chi^2\pff{Q}{P}$	$t^2-t$	$\f{Q(x)}{P(x)}-1$
$D_\alpha\pff{Q}{P}$	$t^\alpha-t$	$\p{\f{Q(x)}{P(x)}}^{\alpha-1} - 1$

Conditional divergence

The intuitive definition fails

It would be natural to define the conditional divergence as the expected divergence of the conditional distributions

\[ D^f\bffco{\BY'}{\BX'}{\BY}{\BX} \sr{?}{\ce} \E_x\b{D^f\bff{\at{\BY'}{\BX'=x}}{\at{\BY}{\BX=x}}}. \]

However, it’s not clear whether the expectation should be over $\BX$ or $\BX'$:

the divergence itself is an expectation over the prior, it would be natural to let $x \sim \BX$ so that the combined expectation is over $(x,y) \sim (\BX,\BY)$;
but information divergence (which is naturally in posterior view) only gets a chain rule if $x$ is drawn from the posterior variable $\BX'$.

And if we try to circumvent the question by requiring that $\BX$ and $\BX'$ are the same variable, the conditional divergence just becomes the joint divergence:

\[ \al{ D^f\bffco{\BY'}{\BX}{\BY}{\BX} &=\E_{x \sim \BX}\b{D^f\bff{\at{\BY'}{\BX=x}}{\at{\BY}{\BX=x}}}\\ &= \E_{(x,y) \sim (\BX,\BY)}\b{f\p{\f{\Pr\bco{\BY'=y}{\BX=x}}{\Pr\bco{\BY=y}{\BX=x}}}}\\ &= \E_{(x,y) \sim (\BX,\BY)}\b{f\p{\f{\Pr\b{\BX=x \and \BY'=y}}{\Pr\b{\BX=x \and \BY=y}}}}\\ &= D^f\bff{\BX,\BY'}{\BX,\BY}. } \]

Chain rule

Instead, perhaps the right way to define conditional divergence is to give it whatever value we need to get something like the chain rule for entropy. First, let’s define

\[ \al{ \phi(x,y) \ce&{} \f{\Pr\b{\BX'=x \and \BY'=y}}{\Pr\b{\BX=x \and \BY=y}}\\ %=&{} \f{\Pr\b{\BX'=x}\Pr\bco{\BY'=y}{\BX'=x}}{\Pr\b{\BX=x}\Pr\bco{\BY=y}{\BX=x}} } \]

for convenience and verify that adding data can only increase the divergence:

\[ \al{ D^f\bff{\BX',\BY'}{\BX,\BY} &= \E\b{f\p{\phi(\BX,\BY)}}\\ &= \E\b{\E\bco{f\p{\phi(\BX,\BY)}}{\BX}}\\ &\ge \E\b{f\p{\E\bco{\phi(\BX,\BY)}{\BX}}}\tag{convexity}\\ &= \E_{x \sim \BX}\b{f\p{\f{\Pr\b{\BX'=x}}{\Pr\b{\BX=x}}}}\\ %&= \E_{(x,y) \sim (\BX,\BY)}\b{f\p{\f{Q(x,y)}{P(x,y)}}}\\ %&= \E_{x \sim \BX}\b{\E_{y \sim \at{\BY}{\BX=x}}\b{f\p{\f{Q(x,y)}{P(x,y)}}}}\\ %&\ge \E_{x \sim \BX}\b{f\p{\E_{y \sim \at{\BY}{\BX=x}}\b{\f{Q(x,y)}{P(x,y)}}}}\tag{convexity}\\ %&= \E_{x \sim \BX}\b{f\p{\f{Q(x)}{P(x)}}}\tag{$*$}\\ &= D^f\bff{\BX'}{\BX}. } \]

Assuming $f$ is strictly convex everywhere, equality holds iff $\phi(\BX,\BY)$ is fixed once $\BX$ is fixed, which means the conditional distributions are the same: $\bat{\BY'}{\BX'=x} = \bat{\BY}{\BX=x}$ for any $x \in \supp\b{\BX'}$. Equivalently, equality holds iff $\BY$ and $\BY'$ can be obtained by applying the same stochastic map to both $\BX$ and $\BX'$.

The gap is

\[ \E\b{\E\bco{f\p{\phi(\BX,\BY)}}{\BX}-f\p{\E\bco{\phi(\BX,\BY)}{\BX}}} = \E\b{D^f\bco{\phi(\BX,\BY)}{\BX}}, \]

where $D^f$ denotes the $f$-dispersion. We can then brazenly define the conditional divergence as

\[ \t{``$D^f\bffco{\BY'}{\BX'}{\BY}{\BX}$"} \ce \E\b{D^f\bco{\phi(\BX,\BY)}{\BX}} \]

in order to get the chain rule

\[ D^f\bff{\BX',\BY'}{\BX,\BY} = D^f\bff{\BX'}{\BX} + \E\b{D^f\bco{\phi(\BX,\BY)}{\BX}}. \]

Properties

When $\BX$ and $\BX'$ are identical, $D^f\bat{\phi(\BX,\BY)}{\BX=x} = D^f\bff{\at{\BY'}{\BX=x}}{\at{\BY}{\BX=x}}$, so the conditional divergence coincides with the intuitive definition, though both are equal to the joint divergence $D^f\bff{\BX,\BY'}{\BX,\BY}$.
For information divergence, we can verify that $\E_{x \sim \BX}\b{D^f_{y \sim \at{\BY}{\BX=x}}\b{\phi(x,y)}} = \E_{(x,y) \sim (\BX',\BY')}\b{\log\frac{\Pr\bco{\BY'=y}{\BX'=x}}{\Pr\bco{\BY=y}{\BX = x}}}$, so it coincides the intuitive definition if we make the choice to sample $x$ from the posterior $\BX'$.⁵

Mixtures and convexity

In general $D^f\bffco{\BY'}{\BX'}{\BY}{\BX}$ is sometimes bigger and sometimes smaller than $D^f\bff{\BY'}{\BY}$. But when $\BX$ and $\BX'$ are identical, since adding data only increases entropy, we have

\[ D^f\bff{\BY'}{\BY} \le D^f\bff{\BX,\BY'}{\BX,\BY} = D^f\bffco{\BY'}{\BX}{\BY}{\BX}, \]

So conditioning by the same variable can only increase the divergence.

If we see $\BY$ and $\BY'$ as mixtures $\BY_{\BX}$ and $\BY'_\BX$ of underlying variables $\BY_x \ce \at{\BY}{\BX=x}$ and $\BY'_x \ce \at{\BY'}{\BX=x}$ weighted by $\BX$, then this is saying that ratio divergences are convex:

\[ \UB{D^f\bff{\BY'_\BX}{\BY_\BX}}_\text{divergence of mixtures} \le \UB{\E_{x \sim \BX}\b{D^f\bff{\BY'_x}{\BY_x}}}_\text{average of divergences}. \]

Assuming $f$ is strictly convex everywhere, I believe equality holds iff the relative density function $\f{\Pr\b{\BY'_x = y}}{\Pr\b{\BY_x= y}}$ is the same for each $x$ such that $\Pr\b{\BX=x} \ne 0$.