$\require{mathtools} % %%% GENERIC MATH %%% % % Environments \newcommand{\al}[1]{\begin{align}#1\end{align}} % need this for \tag{} to work \renewcommand{\r}{\mathrm} % % Greek \newcommand{\eps}{\epsilon} \newcommand{\veps}{\varepsilon} \newcommand{\Om}{\Omega} \newcommand{\om}{\omega} \newcommand{\Th}{\Theta} \let\fi\phi % because it looks like an f \let\phi\varphi % because it looks like a p % % Miscellaneous shortcuts % .. over and under \newcommand{\ss}[1]{_{\substack{#1}}} \newcommand{\ob}{\overbrace} \newcommand{\ub}{\underbrace} \newcommand{\ol}{\overline} \newcommand{\tld}{\widetilde} \newcommand{\HAT}{\widehat} \newcommand{\f}{\frac} \newcommand{\s}[2]{#1 /\mathopen{}#2} \newcommand{\rt}[1]{ {\sqrt{#1}}} % .. relations \newcommand{\sr}{\stackrel} \newcommand{\sse}{\subseteq} \newcommand{\ce}{\coloneqq} \newcommand{\ec}{\eqqcolon} \newcommand{\ap}{\approx} \newcommand{\ls}{\lesssim} \newcommand{\gs}{\gtrsim} % .. miscer \newcommand{\q}{\quad} \newcommand{\qq}{\qquad} \newcommand{\heart}{\heartsuit} % % Delimiters % (I needed to create my own because the MathJax version of \DeclarePairedDelimiter doesn't have \mathopen{} and that messes up the spacing) % .. one-part \newcommand{\p}[1]{\mathopen{}\left( #1 \right)} \newcommand{\b}[1]{\mathopen{}\left[ #1 \right]} \newcommand{\set}[1]{\mathopen{}\left\{ #1 \right\}} \newcommand{\abs}[1]{\mathopen{}\left\lvert #1 \right\rvert} \newcommand{\floor}[1]{\mathopen{}\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\mathopen{}\left\lceil #1 \right\rceil} \newcommand{\inner}[1]{\mathopen{}\left\langle #1 \right\rangle} % .... (use phantom to force at least the standard height of double bars) \newcommand{\norm}[1]{\mathopen{}\left\lVert #1 \vphantom{f} \right\rVert} \newcommand{\frob}[1]{\norm{#1}_\mathrm{F}} %% .. two-part \newcommand{\incond}[2]{#1 \mathop{}\middle|\mathop{} #2} \newcommand{\cond}[2]{ {\left.\incond{#1}{#2}\right.}} \newcommand{\pco}[2]{\p{\incond{#1}{#2}}} \newcommand{\bco}[2]{\b{\incond{#1}{#2}}} \newcommand{\setco}[2]{\set{\incond{#1}{#2}}} \newcommand{\at}[2]{ {\left.#1\right|_{#2}}} \newcommand{\pat}[2]{\p{\at{#1}{#2}}} \newcommand{\bat}[2]{\b{\at{#1}{#2}}} % ..... (use phantom to force at least the standard height of double bar) \newcommand{\oldpara}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} %\newcommand{\para}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} \newcommand{\para}[2]{\mathchoice{\begin{matrix}#1\\\hdashline#2\end{matrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}} \newcommand{\ppa}[2]{\p{\para{#1}{#2}}} \newcommand{\bpa}[2]{\b{\para{#1}{#2}}} %\newcommand{\bpaco}[4]{\bpa{\incond{#1}{#2}}{\incond{#3}{#4}}} \newcommand{\bpaco}[4]{\bpa{\cond{#1}{#2}}{\cond{#3}{#4}}} % % Levels of closeness \newcommand{\scirc}[1]{\sr{\circ}{#1}} \newcommand{\sdot}[1]{\sr{.}{#1}} \newcommand{\slog}[1]{\sr{\log}{#1}} \newcommand{\createClosenessLevels}[7]{ \newcommand{#2}{\mathrel{(#1)}} \newcommand{#3}{\mathrel{#1}} \newcommand{#4}{\mathrel{#1\!\!#1}} \newcommand{#5}{\mathrel{#1\!\!#1\!\!#1}} \newcommand{#6}{\mathrel{(\sdot{#1})}} \newcommand{#7}{\mathrel{(\slog{#1})}} } \let\lt\undefined \let\gt\undefined % .. vanilla versions (is it within a constant?) \newcommand{\ez}{\scirc=} \newcommand{\eq}{\simeq} \newcommand{\eqq}{\mathrel{\eq\!\!\eq}} \newcommand{\eqqq}{\mathrel{\eq\!\!\eq\!\!\eq}} \newcommand{\lez}{\scirc\le} \newcommand{\lq}{\preceq} \newcommand{\lqq}{\mathrel{\lq\!\!\lq}} \newcommand{\lqqq}{\mathrel{\lq\!\!\lq\!\!\lq}} \newcommand{\gez}{\scirc\ge} \newcommand{\gq}{\succeq} \newcommand{\gqq}{\mathrel{\gq\!\!\gq}} \newcommand{\gqqq}{\mathrel{\gq\!\!\gq\!\!\gq}} \newcommand{\lz}{\scirc<} \newcommand{\lt}{\prec} \newcommand{\ltt}{\mathrel{\lt\!\!\lt}} \newcommand{\lttt}{\mathrel{\lt\!\!\lt\!\!\lt}} \newcommand{\gz}{\scirc>} \newcommand{\gt}{\succ} \newcommand{\gtt}{\mathrel{\gt\!\!\gt}} \newcommand{\gttt}{\mathrel{\gt\!\!\gt\!\!\gt}} % .. dotted versions (is it equal in the limit?) \newcommand{\ed}{\sdot=} \newcommand{\eqd}{\sdot\eq} \newcommand{\eqqd}{\sdot\eqq} \newcommand{\eqqqd}{\sdot\eqqq} \newcommand{\led}{\sdot\le} \newcommand{\lqd}{\sdot\lq} \newcommand{\lqqd}{\sdot\lqq} \newcommand{\lqqqd}{\sdot\lqqq} \newcommand{\ged}{\sdot\ge} \newcommand{\gqd}{\sdot\gq} \newcommand{\gqqd}{\sdot\gqq} \newcommand{\gqqqd}{\sdot\gqqq} \newcommand{\ld}{\sdot<} \newcommand{\ltd}{\sdot\lt} \newcommand{\lttd}{\sdot\ltt} \newcommand{\ltttd}{\sdot\lttt} \newcommand{\gd}{\sdot>} \newcommand{\gtd}{\sdot\gt} \newcommand{\gttd}{\sdot\gtt} \newcommand{\gtttd}{\sdot\gttt} % .. log versions (is it equal up to log?) \newcommand{\elog}{\slog=} \newcommand{\eqlog}{\slog\eq} \newcommand{\eqqlog}{\slog\eqq} \newcommand{\eqqqlog}{\slog\eqqq} \newcommand{\lelog}{\slog\le} \newcommand{\lqlog}{\slog\lq} \newcommand{\lqqlog}{\slog\lqq} \newcommand{\lqqqlog}{\slog\lqqq} \newcommand{\gelog}{\slog\ge} \newcommand{\gqlog}{\slog\gq} \newcommand{\gqqlog}{\slog\gqq} \newcommand{\gqqqlog}{\slog\gqqq} \newcommand{\llog}{\slog<} \newcommand{\ltlog}{\slog\lt} \newcommand{\lttlog}{\slog\ltt} \newcommand{\ltttlog}{\slog\lttt} \newcommand{\glog}{\slog>} \newcommand{\gtlog}{\slog\gt} \newcommand{\gttlog}{\slog\gtt} \newcommand{\gtttlog}{\slog\gttt} % % Miscellaneous \newcommand{\LHS}{\mathrm{LHS}} \newcommand{\RHS}{\mathrm{RHS}} % .. operators \DeclareMathOperator{\poly}{poly} \DeclareMathOperator{\polylog}{polylog} \DeclareMathOperator{\quasipoly}{quasipoly} \DeclareMathOperator{\negl}{negl} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} % .. functions \DeclareMathOperator{\id}{id} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\err}{err} \DeclareMathOperator{\ReLU}{ReLU} % .. analysis \let\d\undefined \newcommand{\d}{\operatorname{d}\mathopen{}} \newcommand{\df}[2]{ {\f{\d #1}{\d #2}}} \newcommand{\ds}[2]{ {\s{\d #1}{\d #2}}} \newcommand{\part}{\partial} \newcommand{\partf}[2]{\f{\part #1}{\part #2}} \newcommand{\parts}[2]{\s{\part #1}{\part #2}} \newcommand{\grad}[1]{\mathop{\nabla\!_{#1}}} % .. sets of numbers \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\F}{\mathbb{F}} % %%% SPECIALIZED MATH %%% % % Logic \renewcommand{\and}{\wedge} \newcommand{\AND}{\bigwedge} \newcommand{\or}{\vee} \newcommand{\OR}{\bigvee} \newcommand{\xor}{\oplus} \newcommand{\XOR}{\bigoplus} \newcommand{\union}{\cup} \newcommand{\inter}{\cap} \newcommand{\UNION}{\bigcup} \newcommand{\INTER}{\bigcap} \newcommand{\comp}{\overline} \newcommand{\true}{\r{true}} \newcommand{\false}{\r{false}} \newcommand{\tf}{\set{\true,\false}} \DeclareMathOperator{\One}{\mathbb{1}} \DeclareMathOperator{\1}{\mathbb{1}} % % Linear algebra \renewcommand{\span}{\mathrm{span}} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\dom}{dom} \DeclareMathOperator{\Img}{Im} \newcommand{\transp}{\mathsf{T}} \renewcommand{\t}{^\transp} % ... named tensors \newcommand{\namedtensorstrut}{\vphantom{fg}} % milder than \mathstrut \newcommand{\name}[1]{\mathsf{\namedtensorstrut #1}} \newcommand{\nbin}[2]{\mathbin{\underset{\substack{#1}}{\namedtensorstrut #2}}} \newcommand{\ndot}[1]{\nbin{#1}{\odot}} \newcommand{\ncat}[1]{\nbin{#1}{\oplus}} \newcommand{\nsum}[1]{\sum\limits_{\substack{#1}}} \newcommand{\nfun}[2]{\mathop{\underset{\substack{#1}}{\namedtensorstrut\mathrm{#2}}}} \newcommand{\ndef}[2]{\newcommand{#1}{\name{#2}}} \newcommand{\nt}[1]{^{\transp(#1)}} % % Probability \newcommand{\Normal}{\mathcal{N}} \let\Pr\undefined \DeclareMathOperator*{\Pr}{Pr} \DeclareMathOperator*{\G}{\mathbb{G}} \DeclareMathOperator*{\Odds}{Od} \DeclareMathOperator*{\E}{E} \DeclareMathOperator*{\Var}{Var} \DeclareMathOperator*{\Cov}{Cov} \DeclareMathOperator*{\corr}{corr} \DeclareMathOperator*{\median}{median} \newcommand{\dTV}{d_{\mathrm{TV}}} \newcommand{\dHel}{d_{\mathrm{Hel}}} \newcommand{\dJS}{d_{\mathrm{JS}}} % ... information theory \let\H\undefined \DeclareMathOperator*{\H}{H} \DeclareMathOperator*{\I}{I} \DeclareMathOperator*{\D}{D} % %%% SPECIALIZED COMPUTER SCIENCE %%% % % Complexity classes % .. classical \newcommand{\Poly}{\mathsf{P}} \newcommand{\NP}{\mathsf{NP}} \newcommand{\PH}{\mathsf{PH}} \newcommand{\PSPACE}{\mathsf{PSPACE}} \renewcommand{\L}{\mathsf{L}} % .. probabilistic \newcommand{\formost}{\mathsf{Я}} \newcommand{\RP}{\mathsf{RP}} \newcommand{\BPP}{\mathsf{BPP}} \newcommand{\MA}{\mathsf{MA}} \newcommand{\AM}{\mathsf{AM}} \newcommand{\IP}{\mathsf{IP}} \newcommand{\RL}{\mathsf{RL}} % .. circuits \newcommand{\NC}{\mathsf{NC}} \newcommand{\AC}{\mathsf{AC}} \newcommand{\ACC}{\mathsf{ACC}} \newcommand{\TC}{\mathsf{TC}} \newcommand{\Ppoly}{\mathsf{P}/\poly} \newcommand{\Lpoly}{\mathsf{L}/\poly} % .. resources \newcommand{\TIME}{\mathsf{TIME}} \newcommand{\SPACE}{\mathsf{SPACE}} \newcommand{\TISP}{\mathsf{TISP}} \newcommand{\SIZE}{\mathsf{SIZE}} % .. keywords \newcommand{\co}{\mathsf{co}} \newcommand{\Prom}{\mathsf{Promise}} % % Boolean analysis \newcommand{\zo}{\set{0,1}} \newcommand{\pmo}{\set{\pm 1}} \newcommand{\zpmo}{\set{0,\pm 1}} \newcommand{\harpoon}{\!\upharpoonright\!} \newcommand{\rr}[2]{#1\harpoon_{#2}} \newcommand{\Fou}[1]{\widehat{#1}} \DeclareMathOperator{\Ind}{\mathrm{Ind}} \DeclareMathOperator{\Inf}{\mathrm{Inf}} \newcommand{\Der}[1]{\operatorname{D}_{#1}\mathopen{}} \newcommand{\Exp}[1]{\operatorname{E}_{#1}\mathopen{}} \DeclareMathOperator{\Stab}{\mathrm{Stab}} \DeclareMathOperator{\T}{T} \DeclareMathOperator{\sens}{\mathrm{s}} \DeclareMathOperator{\bsens}{\mathrm{bs}} \DeclareMathOperator{\fbsens}{\mathrm{fbs}} \DeclareMathOperator{\Cert}{\mathrm{C}} \DeclareMathOperator{\DT}{\mathrm{DT}} \DeclareMathOperator{\CDT}{\mathrm{CDT}} % canonical \DeclareMathOperator{\ECDT}{\mathrm{ECDT}} \DeclareMathOperator{\CDTv}{\mathrm{CDT_{vars}}} \DeclareMathOperator{\ECDTv}{\mathrm{ECDT_{vars}}} \DeclareMathOperator{\CDTt}{\mathrm{CDT_{terms}}} \DeclareMathOperator{\ECDTt}{\mathrm{ECDT_{terms}}} \DeclareMathOperator{\CDTw}{\mathrm{CDT_{weighted}}} \DeclareMathOperator{\ECDTw}{\mathrm{ECDT_{weighted}}} \DeclareMathOperator{\AvgDT}{\mathrm{AvgDT}} \DeclareMathOperator{\PDT}{\mathrm{PDT}} % partial decision tree \DeclareMathOperator{\DTsize}{\mathrm{DT_{size}}} \DeclareMathOperator{\W}{\mathbf{W}} % .. functions (small caps sadly doesn't work) \DeclareMathOperator{\Par}{\mathrm{Par}} \DeclareMathOperator{\Maj}{\mathrm{Maj}} \DeclareMathOperator{\HW}{\mathrm{HW}} \DeclareMathOperator{\Thr}{\mathrm{Thr}} \DeclareMathOperator{\Tribes}{\mathrm{Tribes}} \DeclareMathOperator{\RotTribes}{\mathrm{RotTribes}} \DeclareMathOperator{\CycleRun}{\mathrm{CycleRun}} \DeclareMathOperator{\SAT}{\mathrm{SAT}} \DeclareMathOperator{\UniqueSAT}{\mathrm{UniqueSAT}} % % Dynamic optimality \newcommand{\OPT}{\mathsf{OPT}} \newcommand{\Alt}{\mathsf{Alt}} \newcommand{\Funnel}{\mathsf{Funnel}} % % Alignment \DeclareMathOperator{\Amp}{\mathrm{Amp}} % %%% TYPESETTING %%% % % In text \renewcommand{\th}{^{\mathrm{th}}} \newcommand{\degree}{^\circ} % % Fonts % .. bold \newcommand{\BA}{\boldsymbol{A}} \newcommand{\BB}{\boldsymbol{B}} \newcommand{\BC}{\boldsymbol{C}} \newcommand{\BD}{\boldsymbol{D}} \newcommand{\BE}{\boldsymbol{E}} \newcommand{\BF}{\boldsymbol{F}} \newcommand{\BG}{\boldsymbol{G}} \newcommand{\BH}{\boldsymbol{H}} \newcommand{\BI}{\boldsymbol{I}} \newcommand{\BJ}{\boldsymbol{J}} \newcommand{\BK}{\boldsymbol{K}} \newcommand{\BL}{\boldsymbol{L}} \newcommand{\BM}{\boldsymbol{M}} \newcommand{\BN}{\boldsymbol{N}} \newcommand{\BO}{\boldsymbol{O}} \newcommand{\BP}{\boldsymbol{P}} \newcommand{\BQ}{\boldsymbol{Q}} \newcommand{\BR}{\boldsymbol{R}} \newcommand{\BS}{\boldsymbol{S}} \newcommand{\BT}{\boldsymbol{T}} \newcommand{\BU}{\boldsymbol{U}} \newcommand{\BV}{\boldsymbol{V}} \newcommand{\BW}{\boldsymbol{W}} \newcommand{\BX}{\boldsymbol{X}} \newcommand{\BY}{\boldsymbol{Y}} \newcommand{\BZ}{\boldsymbol{Z}} \newcommand{\Ba}{\boldsymbol{a}} \newcommand{\Bb}{\boldsymbol{b}} \newcommand{\Bc}{\boldsymbol{c}} \newcommand{\Bd}{\boldsymbol{d}} \newcommand{\Be}{\boldsymbol{e}} \newcommand{\Bf}{\boldsymbol{f}} \newcommand{\Bg}{\boldsymbol{g}} \newcommand{\Bh}{\boldsymbol{h}} \newcommand{\Bi}{\boldsymbol{i}} \newcommand{\Bj}{\boldsymbol{j}} \newcommand{\Bk}{\boldsymbol{k}} \newcommand{\Bp}{\boldsymbol{p}} \newcommand{\Bq}{\boldsymbol{q}} \newcommand{\Br}{\boldsymbol{r}} \newcommand{\Bs}{\boldsymbol{s}} \newcommand{\Bt}{\boldsymbol{t}} \newcommand{\Bu}{\boldsymbol{u}} \newcommand{\Bv}{\boldsymbol{v}} \newcommand{\Bw}{\boldsymbol{w}} \newcommand{\Bx}{\boldsymbol{x}} \newcommand{\By}{\boldsymbol{y}} \newcommand{\Bz}{\boldsymbol{z}} \newcommand{\Balpha}{\boldsymbol{\alpha}} \newcommand{\Bbeta}{\boldsymbol{\beta}} \newcommand{\Bgamma}{\boldsymbol{\gamma}} \newcommand{\Bdelta}{\boldsymbol{\delta}} \newcommand{\Beps}{\boldsymbol{\eps}} \newcommand{\Bveps}{\boldsymbol{\veps}} \newcommand{\Bzeta}{\boldsymbol{\zeta}} \newcommand{\Beta}{\boldsymbol{\eta}} \newcommand{\Btheta}{\boldsymbol{\theta}} \newcommand{\Biota}{\boldsymbol{\iota}} \newcommand{\Bkappa}{\boldsymbol{\kappa}} \newcommand{\Blambda}{\boldsymbol{\lambda}} \newcommand{\Bmu}{\boldsymbol{\mu}} \newcommand{\Bnu}{\boldsymbol{\nu}} \newcommand{\Bxi}{\boldsymbol{\xi}} \newcommand{\Bomicron}{\boldsymbol{\omicron}} \newcommand{\Bpi}{\boldsymbol{\pi}} \newcommand{\Brho}{\boldsymbol{\rho}} \newcommand{\Bsigma}{\boldsymbol{\sigma}} \newcommand{\Btau}{\boldsymbol{\tau}} \newcommand{\Bupsilon}{\boldsymbol{\upsilon}} \newcommand{\Bphi}{\boldsymbol{\phi}} \newcommand{\Bfi}{\boldsymbol{\fi}} \newcommand{\Bchi}{\boldsymbol{\chi}} \newcommand{\Bpsi}{\boldsymbol{\psi}} \newcommand{\Bomega}{\boldsymbol{\omega}} % .. calligraphic \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} % .. typewriter \newcommand{\TA}{\mathtt{A}} \newcommand{\TB}{\mathtt{B}} \newcommand{\TC}{\mathtt{C}} \newcommand{\TD}{\mathtt{D}} \newcommand{\TE}{\mathtt{E}} \newcommand{\TF}{\mathtt{F}} \newcommand{\TG}{\mathtt{G}} \newcommand{\TH}{\mathtt{H}} \newcommand{\TI}{\mathtt{I}} \newcommand{\TJ}{\mathtt{J}} \newcommand{\TK}{\mathtt{K}} \newcommand{\TL}{\mathtt{L}} \newcommand{\TM}{\mathtt{M}} \newcommand{\TN}{\mathtt{N}} \newcommand{\TO}{\mathtt{O}} \newcommand{\TP}{\mathtt{P}} \newcommand{\TQ}{\mathtt{Q}} \newcommand{\TR}{\mathtt{R}} \newcommand{\TS}{\mathtt{S}} \newcommand{\TT}{\mathtt{T}} \newcommand{\TU}{\mathtt{U}} \newcommand{\TV}{\mathtt{V}} \newcommand{\TW}{\mathtt{W}} \newcommand{\TX}{\mathtt{X}} \newcommand{\TY}{\mathtt{Y}} \newcommand{\TZ}{\mathtt{Z}}$

Have you ever wondered…

  • what do the square brackets mean when we write things like $\Pr[\cdot]$ or $\E[\cdot]$?
  • in $\E\bco{\BY}{\BX}$, does the argument $\cond{\BY}{\BX}$ have its own independent meaning?
  • whether conditional entropy might be defined the wrong way?

Then you’ve come to the wrong place, because this note is confusing as heck! We’ll take the thought “any expression can become random… including random expressions” and run with it.

Basics

Distributions

Distribution are essentially weighted sets: sets only tell us what values are possibles, while distributions also tell us how likely each value is. Formally, for a set $\CX$, a distribution $P$ can be written as

\[ P = \setco{(x,p_x)}{x \in \CX} \]

where the probability weights $p_x \in [0,1]$ sum to $1$. We can also see $P$ as a function $\CX \to [0,1]$ in the usual way that functions are defined as sets of pairs: the probability of $x$ is then written $P(x)$.

Let $\CU_\CX$ denote uniform distribution over $\CX$, which (when $\CX$ is a finite set) is the distribution that gives each value the same probability: $\CU_\CX(x) \ce \f{1}{\abs\CX}$ for all $x \in \CX$.

#to-think how to write the support of a distribution? or should we just write things like $f \in [\Bf^*]$ without specifying?

Random comprehensions

Distributions are nice, but they’re not very convenient if we actually want to do things with their values. Sometimes, we might want to talk about e.g. what would $f(x)$ typically look like if we take a random draw $x$ from $P$. Instead of doing some sort of sum over all of $\CX$ involving the probability weights $P(x)$, it would be nice to talk about $f(x)$ directly as a random thing.

If we were talking about (unweighted) sets, we could use set comprehensions for this:

\[ \setco{f(x)}{x \in \CX}. \]

So let’s analogously write random comprehensions:

\[ \BX \sim P:f(\BX). \]

A random comprehension has two parts:

  • $\BX \sim P$ declares a random symbol $\BX$: it says “consider $\BX$ to be a random draw from $P$”;
  • $f(\BX)$ is a random expression: it says “take that $\BX$, perform such-and-such transformation, and look at the result”.

We’ll always write random symbols in bold font (e.g. $\BX, \Bx$), and often in uppercase for clarity.

Note that in this view, “$\BX$”, “$f(\BX)$” and “$\BX \sim P:f(\BX)$” don’t exist as a mathematical objects the same way that the distribution $P$ does. They’re just grammar that we use to describe operations on random values in a more concise manner than if you manipulated the distributions directly. In order to try and make this point, we’ll generally avoid the term “random variable”.

Informally, it’s convenient to declare a random symbol globally then talk about one or more random expressions: e.g. if we say “let $\BX \sim P$, and consider the mutual information between $f(\BX)$ and $g(\BX)$” then under the hood we’re actually making claims about the random comprehension $\BX \sim P: (f(\BX), g(\BX))$.

Capturing the distribution

Since random expressions are not themselves real mathematical objects, if we want to do anything useful with them, we eventually need to be able to get a distribution back. We do that using square brackets: if $\BX \sim P$, then $\b{f(\BX)}$ denotes “the distribution of $f(x)$, assuming $x$ was chosen at random according to $P$”.

Formally, we’ll write this distribution as $_{\BX \sim P}\b{f(\BX)}$, which is defined as

\[ _{\BX \sim P}\b{f(\BX)}(y) \ce \sum_{x: f(x)=y} P(x). \]

But in the language of random comprehensions, we’ll just write it as $\BX \sim P: [f(\BX)]$.

As a spoiler, we’ll later see that this operation of “capturing the distribution” of a random expression is precisely what the square brackets are doing in e.g. $\Pr[\BX=x]$ or $\E[\BX]$.

Working with random expressions

Aliases

We can define new random symbols as aliases for random expressions, just like we would do normal values: e.g. $\BY \ce f(\BX)$. If we then talk about some expression $g(\BY)$ based on $\BX$, this should be interpreted as the random comprehension $\BX \sim P: g(f(\BX))$.

Random events

Random events are random expressions that yield truth values $\tf$, such as “$\BX = 3$” or “$\BX \in S$” for some set $S \sse \CX$. We can denote events two ways:

  • as functions: e.g. $E(x) \ce (x = 3)$, so that $E(\BX)$ is equivalent to $\BX=3$;
    • as random symbols: e.g. $\BE \ce (\BX = 3)$ (in which case $\BE$ itself should be in bold font).

Joint distributions

If $R$ is a distribution on the set of pairs $\CX \times \CY$, then we can write $(\BX,\BY) \sim R$ to describe a joint distribution on $\BX$ and $\BY$. Formally, writing $(\BX,\BY) \sim R$ is equivalent to

  • writing $\BZ \sim R$ (where $\BZ$ is a random pair in $\CX \times \CY$)
  • then defining the aliases $\BX \ce \BZ_1$ and $\BY \ce \BZ_2$,

and the same applies more generally to tuples.

More generally, we say that random expressions are jointly distributed if their randomness comes from the same declaration: for example, if $\BX \sim P$, then $f(\BX)$ and $g(\BX)$ are jointly distributed, and we can think about the ways they depend on each other.

If we have two distributions $P$ and $Q$ over $\CX$ and $\CY$, we can form the product distribution $P \times Q$, defined as $(P \times Q)(x,y) \ce P(x)Q(y)$. In the other direction, given a joint distribution $R$ we can use our notation to marginalize it: for example, $(\BX,\BY) \sim R : \b{\BX}$ gives the marginal for the first element of the pair.

Equality of random expressions

There are two ways that two random expressions $\BX$ and $\BY$ can be equal.

They can be “equal in distribution”, i.e. their distributions are identical. We can write this as $[\BX] = [\BY]$. For example, if $\Bomega \sim \CU_{[0, 2\pi]}$, then $[\cos(\Bomega)] = [\cos(\Bomega+\theta)]$ for any offset $\theta$, even though for most values of $\Bomega$ these expressions would be different (as long as $\theta \not\in 2\pi\Z$). Note that for this notion to work out, the expressions $\BX$ and $\BY$ don’t need to be jointly distributed: for example, it’s completely fine to write $_{\BX \sim \CU_{[-1,0]}}\b{-2\BX} = {}_{\BY \sim \CU_{[0,2]}}\b{2-\BY}$.

More strongly, $\BX$ and $\BY$ can be “equal in probability”, i.e. they are always1 equal. We can write this as $[\BX = \BY](\true) = 1$ (which we’ll see later is equivalent to $\Pr[\BX = \BY] = 1$). Note that $\BX$ and $\BY$ must be jointly distributed for this to make sense.

Computing statistics

Now that we have a nice way to manipulate probability distributions, it would be nice if we could get more information about them.

Principles

Suppose $\BX \sim P$ takes real values, and we would like to know the mean of $f(\BX)$. Thanks to our judicious notational choices, we can just denote it as $\E[f(\BX)]$, where

  • $\b{f(\BX)}$ captures the randomness of $f(\BX)$ and returns its distribution $Q$;
  • $\E$ is an operator on distributions that takes $Q$ and computes the mean from it.

This operator is defined as $\E(Q) \ce \sum_y Q(x) y$ for discrete distributions and $\E(Q) \ce \int_y y \d Q(y)$ for continuous distributions. And the notation $\E_{\BX \sim P}\b{f(\BX)}$ can be similarly understood as applying the operator $\E$ to the distribution $_{\BX \sim P}\b{f(\BX)}$.

In general, let’s call any operator on distributions that return a single number a statistic. Even though they’re only defined on distributions, they automatically work on random expressions as long as you surround the random expression with brackets.

Types of statistics

There are two main types of statistics: those that are happy to be given values from any set and those that expect the values to be a specific types (e.g. $\R$, $\R^n$, $\tf$).

  • Some examples of generic statistics:
    • Entropy: $\H(P) \ce \E_{\Bx \sim P}\b{\log \f{1}{P(\Bx)}}$.
    • Entropy deficit: $\D(P) \ce \E_{\Bx \sim P}\b{\log \frac{P(\Bx)}{\CU_\CX(\Bx)}} = \E_{\Bx \sim P}\b{\log \p{P(\Bx)\cdot\#\CX}}$
      • where $\CX$ is the domain of $P$.
    • Ratio divergences: $D_f\ppa{Q}{P} \ce \E_{\Bx \sim P}\b{f\p{\f{Q(\Bx)}{P(\Bx)}}}$
      • and in particular, the information divergence $\D\ppa{Q}{P} \ce \E_{\Bx \sim Q}\b{\log \f{Q(\Bx)}{P(\Bx)}}$.
      • (This one is a bit special because it takes in two distributions but I like to use the unorthodox shorthand $D_f\bpa{\BX'}{\BX} \ce D_f\ppa{\b{\BX'}}{\b{\BX}}$.)
  • Some examples of statistics which require require values of a specific type:
    • Probability: $\Pr(P) \ce P(\true)$ (requires $P$ to output truth values).
    • Expectation: see earlier.
    • Variance: $\Var(P) \ce \E_{\Bx \sim P}\b{\p{\Bx - \E(P)}^2}$.
    • The mutual information is kind of an edge case: while $\I[\BX;\BY] \ce \D\ppa{[\BX]\times[\BY]}{[(\BX,\BY)]}$ doesn’t require $\BX$ and $\BY$ to return values of a certain type, if we understand it as an operator on the distribution $R \ce \b{(\BX,\BY)}$, then $\I(R) \ce \D_{(\Bx,\By)}\ppa{_{(\Bx,\By) \sim R}[\Bx]{}_{(\Bx,\By) \sim R}[\By]}{R}$ does require $R$ to be a distribution over pairs.

#to-write change all of these from $\Bx$ to $\BX$ for consistency?

Even though we to define them based on the distributions to make a pedagogical point, most of the statistics in the latter category are simpler to define by using the random expressions directly than going through distributions.

Note: The operator “$\1(\cdot)$”, which is often used to turn a random event into a 0/1 random variable, does not inherently work on the distributions: it’s just a function that takes a truth value ($\true$ or $\false$) and returns $0$ or $1$. For example, we can perfectly write $\1\p{x=3}$ to denote “the value which is $1$ if $x=3$ and $0$ otherwise” even if $x$ is deterministic. Because of that, we shouldn’t use brackets for it, since it’s not capturing randomness but just doing data processing.

Higher-level randomness

Hierarchy in randomness

There are many situations where we find ourselves groping for “random random expressions”. And not only that, but there is a sense of hierarchy to it:

  • You might have a random object and want to perform random experiments on it, adding some “inner randomness”.
  • Or, the other way around, you might have a computation that involves a random experiment and want to randomize one parameter of it, adding some “outer randomness”.

As an example of the former situation, suppose that you wanted to learn some function $f$ over an input distribution $\CD$, and you use a sample $\BS \sim \CD^m$ of $m$ data points from $\CD$, to learn some best fit $\HAT{f}_\BS$. Now, you want to know how likely it is that your learned $\HAT{f}_S$ is in fact close to $f$ on the whole distribution $\CD$, not just the sample $\BS$, so you want to make sure that on the average $\Bx \sim \CD$, the squared error $\p{\HAT{f}_\BS(\Bx) - f(\Bx)}^2$ is not too large. In this last expression, there are two sources of randomness: the sample $\BS$ (the outer randomness) and the input $\Bx$ (the inner randomness).

Crucially, they’re on the same level! In $\HAT{f}_\BS(\Bx)$, there is a sense in which the randomness of $\Bx$ as “comes after” $\BS$. Or put yet another way, we want to see it as “draw a random $\BS$, and for each fixed outcome of $\BS$, consider a random $\Bx$” rather than “draw a random $\Bx$, and for each fixed outcome of $\Bx$, consider a random $\BS$”. In our random comprehension notation, this is akin to allowing the random expression on the right hand side to itself be random: for example, we’ll represent the squared error $\p{\HAT{f}_\BS(\Bx) - f(\Bx)}^2$ as

\[ \BS \sim \CD^m:\Bx \sim \CD : \p{\HAT{f}_\BS(\Bx) - f(\Bx)}^2. \]

More generally, it feels like we should be able to take any computation and make it “more random” by taking one part and replacing it by a random symbol, even if the computation was itself already random. That is, the notation should allow us to make any random expression such as $f(x,\BY)$ more random by randomizing $x$ in an “outer loop”:

\[ \BX \sim P: \BY \sim Q: f(\BX,\BY). \]

Of course, we can keep adding more levels, e.g.

\[ \BX \sim P : \BY \sim Q : \BZ \sim R: f(\BX,\BY,\BZ). \]

Capturing the innermost level

When working with higher-level random expressions, we want to work (e.g. compute probabilities or averages) with the innermost randomness first. For example, in our machine learning example, we should be able write the probability that your model has a mean error greater than $\eps$ as

\[ \Pr\b{\E\b{\p{\HAT{f}_\BS(\Bx)-f(\Bx)}^2} > \eps}, \]

where the expectation is over the randomness of $\Bx$ only and the probability is over the randomness of $\BS$ only.

And for all we know, the distribution of the inner symbols might even depend on the values of the outer random symbols! E.g. we could define a family of distributions $Q_x$ and consider the expression

\[ \BX \sim P: \BY \sim Q_\BX: f(\BX,\BY), \]

in which case it would make no sense to do things like “average $f(\BX,\BY)$ over $\BX$ only”, expecting to get an expression that depends on $\BY$.

This means that the rule for capturing distributions of higher-level random expressions is to capture only the innermost symbol: e.g. the right way to resolve

\[ \BX \sim P: \BY \sim Q: [f(\BX,\BY)] \]

is as

\[ \BX \sim P: {}_{\BY \sim Q}\b{f(\BX,\BY)}, \]

where ${}_{\BY \sim Q}\b{f(\BX,\BY)}$ is now a distribution whose probabilities are random over $\BX$ only:

\[ {}_{\BY \sim Q}\b{f(\BX,\BY)}(z) = \sum_{y: f(\BX,y)=z} Q(y). \]

This makes sense given our intuition that “we should be allowed to take any computation and make it random”. Indeed, if we started out with the computation $\E[f(x,\BY)]$ and decided to make the “$x$” part random by introducing a random symbol $\BX \sim P$, then the meaning of $\E[f(\BX,\BY)]$ ought to be “the value $\E[f(x,\BY)]$, for a random $x$”, not “the expectation of $f(\BX,\BY)$ over the randomness of both $\BX$ and $\BY$, assuming they’re independent” nor “the value $\E[(\BX,y)]$, for a random $y$”.

If we bracket this expression a second time, i.e. $\b{\b{f(\BX,\BY)}}$, then this resolves to a distribution over distributions $_{\BX \sim P}\b{_{\BY \sim Q}\b{f(\BX,\BY)}}$. The probability that it gives to some distribution $R$ is the probability that if you pick a random $x$ from $P$, the distribution $_{\BY \sim Q}\b{f(x,\BY)}$ happens to be exactly $R$:

\[ \al{ _{\BX \sim P}\b{_{\BY \sim Q}\b{f(\BX,\BY)}}(R) &= \sum_{x:R = \b{\BY \sim Q: f(\BX,\BY)}} P(x)\\ &= \sum_{x:\forall z,R(z) = \sum_{y: f(x,y)=z} Q(y)} P(x). } \]

Resolving ambiguity

In general, if we define random symbols like $\BX \sim P$ and $\BY \sim Q$ in text and then use expressions like $f(\BX,\BY)$, there can be ambiguity over what hierarchy is implied. It could be that $\BX$ is the outer symbol, $\BY$ is the outer symbol, or that they “share a level” and we should consider them as jointly distribution but independent:

\[ \al{ \BX \sim P: \BY \sim Q: f(\BX, \BY)&\\ \BY \sim Q: \BX \sim P: f(\BX, \BY)&\\ \BX \sim P, \BY \sim Q: f(\BX, \BY)& \quad \text{(understood as $(\BX, \BY) \sim P \times Q: f(\BX,\BY)$)}. } \]

And even if the relative order of the symbols is clear from the context, we might sometimes deal with expressions where one of the symbols “dropped out”. For example, suppose we’re dealing with the expectation

\[ \BX \sim P: \BY \sim Q: \E[\BX+\BY-\BY]. \]

At this stage, the expectation should clearly be over $\BY$, keeping $\BX$ fixed. But as the computation progresses, $\BY$ disappears, and we’d like to write

\[ \E[\BX + \BY - \BY] = \E[\BX], \]

but given that the rule is “always capture the innermost level”, the former is a random expression in $\BX$, whereas the latter is a fixed value!

To deal with such situations, we can specify explicitly which level should be captured. For example,

  • in the first situation, we might
    • write $\E_\BX[f(\BX,\BY)]$ to clarify that $\BX$ was on its own at the innermost level (this yields a random expression in $\BY$ only)
      • and vice versa;
    • write $\E_{\BX,\BY}[f(\BX,\BY)]$ to clarify that $\BX$ and $\BY$ were on the same level (this yields a deterministic value);
  • in the second situation, we could write $\E[\BX + \BY - \BY] = \E_\BY[\BX] = \BX$ (in which case all three expressions are random in $\BX$ only).

Or in any of these situations, one could just have chosen to write the declarations explicitly:

\[ \al{ &\E_{\BX \sim P}[f(\BX,\BY)]\\ &\E_{\BX\sim P, \BY \sim Q}[f(\BX,\BY)]\\ &\E_{\BY \sim Q}[\BX + \BY - \BY] = \E_{\BY \sim Q}[\BX] = \BX. } \]

Note: When the hierarchy is clear, this notation should not be used to force the brackets to capture a level that is outermore than some variable in the expression. For example, if the hierarchy is $\BX \sim P: \BY \sim Q: f(\BX,\BY)$, then talking about $\E_\BY[f(\BX,\BY)]$ makes no sense. So when the hierarchy is clear, the semantics of this subscript is just to ask the brackets to pretend that the symbol was present inside.

Examples

  • We can equivalently define the entropy of a random variable $\BX \sim P$ as $\H[\BX] \ce \E\b{\log\f{1}{[\BX](\BX)}}$. At first sight it looks like there might be a naming collision, but it actually resolves normally, in the following way:
    • in $[\BX]$, $\BX$ is the only random symbol, so it gets captured as $_{\BX \sim P}[\BX]$, which evaluates to the deterministic value $P$;
    • the whole expression is now $\E\b{\log \f{1}{P(\BX)}}$;
    • in $\b{\log \f{1}{P(\BX)}}$, there is again only one random symbol $\BX$, which gets captured as $_{\BX \sim P}\b{\log \f{1}{P(\BX)}}$, giving some distribution $Q$ (whose values are the negative log-probabilities);
    • then we apply the expectation operator to $Q$, giving us the entropy of $P$.
  • Similarly, we can define the variance simply as $\Var[\BX] \ce \E\b{\p{\BX-\E[\BX]}^2}$ without worrying about the two occurrences of $\BX$ interferering.
    • And more generally, Wick products (e.g. $\inner{\BX} \ce \BX - \E[\BX]$) and cumulants.
  • Coming back to equality of random expressions, if $\BX$ and $\BX'$ are two level-$l$ random expressions over the same $l$ levels, then both $\BX=\BX'$ and $[\BX]=[\BX']$ are level-$(l-1)$ random events, and only the innermost level dropped.

Gluing, hacking and splitting

#to-write

Mixtures

#to-write to make this work you need to “merge” symbols into a joint space?

Let $\BX \sim P$, and let $Q_x$ is a family of distributions (one for each $x \in \CX$). If we define the random symbols $\BY_x \sim Q_x$ for each $x$, then we can consider the mixture $\BY_\BX$.

  • mixtures
    • let $\BY_x$ be random variables, and $\BX$ be a “random index”, then let’s call
      • the mapping $x \mapsto \BY_x$ a stochastic map
      • $\BY_\BX$ a mixture of the variables $\BY_x$ or a stochastic transformation of $\BX$
  • $\b{\BY_\BX} = \E\b{\b{\BY_\BX}}$?? this makes no sense
  • I guess you’d have to use an actual symbol for this?
  • #to-think actually maybe I want notation for defining a distribution and drawing from it immediately?
  • is this even a stochastic map and not a random function?
  • $\CP$ returns distributions, just like $\CN(0, \sigma^2)$ and $\CU_{S}$!!
    • maybe want to rewrite $\CU_S$ as $\CU(S)$
    • and stop using $\CX$ for the domain?
  • then mixtures are like “apply $\CP(\BX)$
    • yup, would indeed be nice to have a symbol for “draw from this distribution”
    • $\tld{\CP(\theta)}$? $\sim \CP(\theta)$? $]\CP(\theta)[$?
      • and generally $]P[$
    • but otoh it seems like $]\CP(\BX)[$ would have separate randomness
      • i.e. it seems like $(\BX, ]\CP(\BX)[)$ would just be $\BX \sim P: \BY \sim \CP(\BX): (\BX,\BY)$, as opposed to being jointly distributed?
      • if you wanted to merge, couldn’t you just write things like $_{\BX,\BY}\b{(\BX,\BY)}$? hmm why is this all so messed up :’)
      • and maybe given that it’s easy to make the separate version, you could just assume that they are jointly distributed by default?
        • but joint with whom?? just with the innermost level?
    • or just $_{\BY \sim \CP(\BX)}(\BY)$ instead of brackets? certainly less nice-looking but could be useful if you want to mention it twice?
      • and makes sense that it would be joint with the innermost level because you don’t want it to cause issues when writing things like $_{\BX \sim P}[\BX+{}_{\BY \sim Q}(\BY)]$, which should be interpreted as $_{\BX \sim P,\BY \sim Q}\b{\BX+\BY}$?
      • actually seems pretty useless in light of this?
  • maybe just say $\BX \sim P, \BY \sim \CP(\BX)$ is a mixture man
    • and $\cond{\BY}{\BX}$ is the stochastic map??? but then it’s just $\BX \sim P: \BY \sim \CP(\BX)$ no?
  • whereas $\Bf(\BX)$ is a random function, and is just a perfectly normal thing?
  • and $\BY_\BX$ is not that special; $\BX$ is a random index and $\BY$ is a random vector / indexable thing
    • can be joint or either order, no big deal
    • no different from a random function (from indices to values)

#to-think

  • ==potentially critical: how to make the difference between a random function and a stochastic map?==
    • seems like you’d expect $\Bf(x) = \Bf(x)$ to be true all the time, so in that sense more of a random function
    • whereas a stochastic map might be more like a function from that returns distributions?
      • in that way, proto-conditionals like $\at{\BY}{\BX=x}$ are more like stochastic maps (from $x$), since they create a random variable every time they’re called?
  • is it clear that $\E_{\Bx_{i-1} \sim \b{\BX_{i-1}}}\b{\D\bpa{\at{s_i(\BY_{i-1})}{\BX_{i-1} = \Bx_{i-1}}}{\at{\BY_i}{\BX_i = \tld\Bf_i(\Bx_{i-1})}}}$ can be written as $\E\b{\D\bco{\para{s_i(\BY_{i-1})}{\at{\BY_i}{\BX_i = \tld\Bf_i(\BX_{i-1})}}}{\BX_{i-1}}}$?
    • the issue with $\at{\BY_i}{\BX_i = \tld\Bf_i(\BX_{i-1})}$ not always being well-defined even when the above is very well defined perhaps points at a deeper issue and a connection to the stochastic vs random nature?
    • and on the other hand maybe it also points at the former being unreasonable in some way?

Restrictions

  • motivate in the closest analogue to the motivation for conditionals as possible
  • rename to “zoom”
  • “the distribution Y would have if E had probability 1”
  • want: $\Pr[\BY = y \and \BE] = \Pr[\at{\BY}{\BE} = y] \Pr[\BE]$?
    • or $[\BY,\BE](y, \true)$ =

guess: $\at{\BY}{\BE}$ means $\BY' \sim \set{\p{y, \f{\b{(\BY,\BE)}(y,\r{true})}{\b{\BE}(\r{true})}}}: \BY'$

  • if $\BE$ is a random event, $\at{\BY}{\BE}$ denotes the “restriction” of $\BY$ to $\BE$: the random variable that has the distribution $\BY$ has after you condition on event $\BE$ being true
    • its distribution is described by: $\b{\at{\BY}{\BE}}(y) \ce \f{\b{(\BY,\BE)}(y,\r{true})}{\b{\BE}(\r{true})} = \f{\Pr[\BY = y \and \BE]}{\Pr\b{\BE}}$
    • can generalize to arbitrary factors (instead of just random events) (e.g. in the context of undirected graphical models), where $\BE$ takes any value between $0$ and $1$
      • then $\b{\at{\BY}{\BE}}(y) \ce \f{\E[1[\BY=y]\cdot\BE]}{\E[\BE]}$
  • but importantly, if $\BY$ and $\BZ$ are jointly distributed, then so are their restrictions $\at{\BY}{\BE}$ and $\at{\BZ}{\BE}$ on the same event $\BE$
    • the same way that two transformations $f(\BX)$ and $g(\BX)$ of the same random variables are still jointly distributed
    • the right way to see them is as the two components of $\at{\p{\BY,\BZ}}{E}$
  • in $\at{\BY}{\BE}$, $\BY$ and $\BE$ are not taken for its values; their randomness is being caught!
    • $\at{\BY}{\BE}$ depends on the overal (joint) distribution of $\BY$ and $\BE$
    • so at first sight it might have been more honest to denote it as something like $\r{cond}\b{\BY,\BE}$
      • but that would have been confusing too because as mentioned before, we want $\at{\BY}{\BE}$ and $\at{\BZ}{\BE}$ to be jointly distributed if $\BY$ and $\BZ$ are
    • in particular, $\at{\BY}{\BE}$ can’t be thought of as a transformation of the form $f(\BY,\BE)$ (where $f$ is is some function)
      • because it’s not like you could have plugged in an outcome $y$ of $\BY$ into it to get $\at{y}{\BE}$, or plugged in an outcome for $\BE$ like $\at{\BY}{\r{true}}$
        • neither of these make any sense
      • instead, the expression “$\at{\BY}{\BE}$” as a whole is a completely new random variable
  • when we write an expectation/probability/etc conditioned on an event $\BE$, we really mean the expectation/probability/etc of the corresponding restriction
    • e.g. $\E\bco{\BY}{\BX=x}$ is a “shorthand” for $\E\b{\at{\BY}{\BX=x}}$
    • and maybe we should avoid this superfluous and possibly confusing notation? :)
  • example: $\at{\BY}{\BX=x}$
    • e.g. $\Pr\b{\at{\BY}{\BX=x} = y} = \Pr\b{\at{(\BY=y)}{\BX=x}} = \Pr\bco{\BY=y}{\BX=x}$
      • here, $\at{(\BY=y)}{\BX=x}$ is a random event
      • and $\Pr\bco{\BY=y}{\BX=x}$ is just a shorthand for $\Pr\b{\at{(\BY=y)}{\BX=x}}$
  • when dealing with higher-level expressions like $\setco{f(x,\BY)}{x \gets \BX} = \setco{\setco{f(x,y)}{x \gets \BX}}{y \gets \BY}$, we can only restrict on the inner variable (i.e. $\BX$)
    • for example, $\at{\setco{f(x,\BY)}{x \gets \BX}}{\BX \in S} = \setco{\at{\setco{f(x,y)}{x \gets \BX}}{\BX \in S}}{y \gets \BY}$
      • checking through the “definition”: $\b{\at{\setco{f(x,\BY)}{x \gets \BX}}{\BX \in S}}(z) = \f{\b{\p{\setco{f(x,\BY)}{x \gets \BX}, \BX \in S}}(z, \r{true})}{\b{\BX \in S}(\r{true})}$
      • the numerator is a random variable over $\BY$ (the brackets caught only $\BX$)
      • #to-think … but aren’t the distributions of $\BX$ and $\BY$ supposed to become joint in the $(\cdot,\cdot)$?
    • on the other hand, if you try to talk about $\at{\setco{f(x,\BY)}{x \gets \BX}}{\BY \in S}$
      • checking through the “definition”: $\b{\at{\setco{f(x,\BY)}{x \gets \BX}}{\BY \in S}}(z) = \f{\b{\p{\setco{f(x,\BY)}{x \gets \BX}, \BY \in S}}(z, \r{true})}{\b{\BY \in S}(\r{true})}$
      • in the numerator, we’re trying to catch the randomness of $\BY$ and not $\BX$, which is not what the brackets are made for
      • and indeed, it’s not clear what type $\at{\setco{f(x,\BY)}{x \gets \BX}}{\BY \in S}$ might have: intuitively, it would be in $\CD(\CE(\CZ))$, since each value $z$ in the above is a random expression depending on $\BX$ but random expressions were never meant to be used as values
  • #to-think how do you make “$\at{\BY_i}{\BX_i=\tld{f}_i(\BX_{i-1})}$” have the same type as “$\cond{\BY_i}{\BX_i}$”?
    • feels like there’s an odd parallel between this and
      • $\D\ppa{Q}{P}$ where $Q$ is used both inside the expression and as the variable average over
      • $\inner{\BX}$ where $\BX$ is used both for the value and the moment
        • actually this one is stronger?

Conditionals

  • whereas zooms cut the randomness in half and keep only one part, splits keep both parts?
  • split into marginal and conditional
  • define as unique inverse of mixtures

  • suppose $\BX$ and $\BY$ are jointly distributed
  • then $\cond{\BY}{\BX}$ denotes the “conditional of $\BY$ given $\BX$
    • meaning: $\setco{\at{\BY}{\BX=x}}{x \gets \BX}$
      • i.e. for each outcome $x$ of $\BX$, give me the restriction of $\BY$ to the event $\BX=x$
  • it’s level-$2$ random variable: if $\BY$ takes values over $\CY$, then this is in $\CE(\CE(\CY))$
    • it’s jointly distributed with $\BX$ on the outside
    • and jointly distributed with e.g. $\at{\BZ}{\BX=x}$ on the inside
  • $\cond{\BX}{\BX}$ is not quite the same thing as $\BX$: it’s $\BX$ but where we replaced each outcome $x$ by a random variable that has value $x$ with probability $1$
    • same thing for $\cond{f(\BX)}{\BX}$ more generally
  • applying brackets
    • we have $\bco{\BY}{\BX} = \b{\setco{\at{\BY}{\BX=x}}{x \gets \BX}} = \setco{\b{\at{\BY}{\BX=x}}}{x \gets \BX}$
      • this is now in $\CE(\CD(\CY))$
    • and $\b{\bco{\BY}{\BX}}$ is in $\CD(\CD(\CY))$
      • where $\b{\bco{\BY}{\BX}}(x)(y) = \Pr\bco{\BY=y}{\BX=x}$
  • therefore $\E\bco{\BY}{\BX} = \setco{\E\b{\at{\BY}{\BX=x}}}{x \gets \BX}$
    • in particular, $\E\bco{\BY}{\BX}$ is a random variable depending on $\BX$
    • also (assuming $f$ is deterministic), $\E\bco{f(\BX)}{\BX}$ is the same as just $f(\BX)$
  • what people traditionally call the “conditional entropy” isn’t $\H\bco{\BY}{\BX}$ (which is a random variable depending on $\BX$) but instead $\E\b{\H\bco{\BY}{\BX}}$
    • I think it makes sense to call this random variable $\H\bco{\BY}{\BX}$ conditional entropy, because it is the entropy of the conditional, and it is a conditional itself: it’s $\cond{\H[\at{\BY}{\BX=\Bx}]}{\Bx \gets \BX}$
      • actually I’m not sure this expression quite makes sense since $\BX=\Bx$ is arguably always true (depending on what you think the precise semantics of $\Bx \gets \BX$ are) but you get the gist
        • #to-think find a way to write it nicely
    • so chain rule of entropy is more properly written as $\H\b{\BX,\BY} = \H\b{\BX} + \E\b{\H\bco{\BY}{\BX}}$
    • similarly, the “consistent conditioning” property of Ratio divergences is more properly written as $D_f\bpa{\BX,\BY'}{\BX,\BY} = \E\b{D_f\bpaco{\BY'}{\BX}{\BY}{\BX}}$ (where $D_f\bpaco{\BY'}{\BX}{\BY}{\BX}$ is a random variable depending on $\BX$)
      • and maybe it even makes it clear that there’s no super natural answer to the question of “what variable should you be averaging over when define the conditional divergence”!
  • note that e.g. $\D\bco{\pat{\BX_S}{E_i(\BX)}}{\pat{\BX_{\comp{S}}}{E_i(\BX)}}$ takes the same values as $\D\bco{\pat{\BX_S}{E_i(\BX)}}{\BX_{\comp{S}}}$ (assuming the latter makes sense which I think it does?), but not with the same probabilities
  • #to-think way to make sense of notations like $\D\bpa{\Bg_i}{\Bf_i}$ where $\Bf_i$ and $\Bg_i$ are stochastic maps (i.e. lost the external randomness but kept the dependence)?
    • and actually the result would itself be a function…
    • and something like $\E_{P}\p{\D\bpa{\Bg_i}{\Bf_i}}$ to take the average of that particular function over a distribution?
      • arguably you should just write this as $\E_{\Bx \sim P}\b{\D\bpa{\Bg_i(\Bx)}{\Bf_i(\Bx)}}$, not even that much longer
      • because otherwise there could be ambiguity?
  • #to-think is the equality $\Cov[f(\BX), g(\BY)] = \Cov\b{f(\BX), \E\bco{g(\BY)}{\BX}}$ true, and does it make sense?
    • in particular, it requires that $f(\BX)$ and $\E\bco{g(\BY)}{\BX}$ be able to cohabit in the same space
    • yeah seems fair, since both are expressions quantified by $\BX$ (or whatever underlies $\BX,\BY$)
  • #to-think make sense of $\cond{\BY_i}{\BX_i} = \at{\BY_i}{\BX_{i-1} = \cond{\BX_{i-1}}{\BX_i}}$ (provided that we have the Markov property $\BX_i \leftarrow \BX_{i-1} \rightarrow \BY_i$)
  • to #to-think make sense of $\E_{\BX_{i-1}}\b{\E_{\cond{\cdot}{\BX_{i-1}}}\b{\norm{\pco{s_i(\BY_{i-1})}{\BX_{i-1}} - \p{\at{\BY_i}{\BX_i = \tld{f}_i(\BX_{i-1})}}}^2}}$
    • i.e. for each fixed value $x_{i-1}$ of $\BX_{i-1}$, look at the distribution of $s_i(\BY_{i-1})$ conditioned on $\BX_{i-1} = x_{i-1}$ and $\BY_i$ conditioned on $\BX_i = \tld{f}_i(\BX_{i-1})$
    • (and hopefully the subscripts for the $\E$’s would be superfluous)
    • actually maybe just $\E\b{\E\bco{\norm{s_i(\BY_{i-1}) - \at{\BY_i}{\BX_i = \tld{f}_i(\BX_{i-1})}}^2}{\BX_{i-1}}}$
    • but also I’m not sure that this was a case that made that much sense?
      • like, why did you want to condition on $\BX_{i-1}$, exactly?
      • isn’t it supposed to be the case that $\E\b{\E\bco{\BY}{\BX}}$ is just $\E\b{\BY}$?
        • if putting a restriction inside the $\BY$ part breaks this property, this is an extremely bad sign!
        • probably good to investigate whether restrictions are well-defined using examples like these
    • still, would be good to have some notation analogous to $\E_{\cond{\cdot}{\BX_{i-1}}}$? or think in general about what you might need
    • #to-think also!!
      • relate to thoughts about stochastic maps vs random functions above
  • #to-think should we talk about $\D\b{\at{(\cond{\BX_S}{\BX_{\comp{S}}})}{E_i}}$ or $\D\b{\cond{\at{\BX_S}{\BE_i}}{\at{\BX_{\comp{S}}}{\BE_i}}}$?

to drop:

Appendix: sanity checks

(where the word “sanity” is to be interpreted more literally than usual, i.e. questioning whether I might actually be deranged)

#to-think dude, what if you actually read a textbook on this and discovered that random variables are just functions from a sample space $\Omega$?

  • does that answer your questions about whether to “merge the randomness”?
  • e.g. if $\BX: \Omega \to \R$, $\BY: \Omega' \to \R$
    • for constants $a,b$, it makes sense to define the functions
      • $\BX a: \Omega \to \R$ where $x \mapsto xa$
      • $b\BY: \Omega' \to \R$ where $y \mapsto by$
    • but then $\BX\BY$ is still ambiguous:
      • if we “use the $\BX a$ rule first”, this would mean $x \mapsto x\BY$:
        • i.e. a random variable (based on $\BX$) whose values are the random variables $x\BY$
        • i.e. $\setco{\setco{xy}{y \sim \BY}}{x \sim \BX}$
        • i.e. $\BX\BY: \Omega \to (\Omega' \to \R)$
        • and the $\b{\cdot}$ operator would just catch the “$\Omega' \to$” part
      • but could do the reverse…
      • or could just multiply them as functions the usual way:
        • $\BX\BY: \Omega \times \Omega' \to \R$
        • (or if $\Omega = \Omega'$, could just be $\Omega \to \R$)
        • but this would never allow you to “just capture one level of randomness”
  • tbh it’s kind of nice that in this note
    • there is only one type of formal mathematical object: distributions
      • the rest is grammar!
    • there is no useless “sample space” object that you sort of need to redefine every time you introduce a new variable
  1. The technical term is “almost surely”, to account for edge cases with continuous random variables where they technically could be equal but it happens with probability $0$