Random expressions
Have you ever wondered…
- what do the square brackets mean when we write things like $\Pr[\cdot]$ or $\E[\cdot]$?
- in $\E\bco{\By}{\Bx}$, does the argument $\co{\By}{\Bx}$ have its own independent meaning?
- whether conditional entropy might be defined the wrong way?
Then you’ve come to the wrong place, because this note is confusing as heck! We’ll take the thought “any expression can become random… including random expressions” and run with it.
Basics
Distributions
Distribution are essentially weighted sets: sets only tell us what values are possible, while distributions also tell us how likely each value is. Formally, for a set $X$, a distribution $P$ can be written as
where the probability weights $p_x \in [0,1]$ sum to $1$. We can also see $P$ as a function $X \to [0,1]$ in the usual way that functions are defined as sets of pairs: the probability of $x$ is then written $P(x)$.
Some useful notation:
- Let $\tri(X)$ be the set of distributions over $X$.
- Let $\supp(P) \ce \setco{x}{P(x) \ne 0}$ denote the support of distribution $P$, i.e. the set of points to which $P$ gives nonzero probability.
- Let $\CU(X)$ denote the uniform distribution over $X$, which (when $X$ is a finite set) is the distribution that gives each value the same probability: $\CU(X)(x) \ce \f{1}{\abs X}$ for all $x \in X$.
Random comprehensions
Distributions are nice, but they’re not very convenient if we actually want to do things with their values. Sometimes, we want to talk about e.g. what $f(x)$ typically looks like if we take a random draw $x$ from $P$. Instead of doing some sort of sum over all of $X$ involving the probability weights $P(x)$, it would be nice to talk about $f(x)$ directly as a random thing.
When we talk about (unweighted) sets, we can use set comprehensions for this:
So let’s analogously write random comprehensions:
A random comprehension has two parts:
- $\Bx \sim P$ declares a random symbol $\Bx$: it means “consider $\Bx$ to be a random draw from $P$”;
- $f(\Bx)$ is a random expression: it means “take that $\Bx$, perform such-and-such transformation, and look at the result”.
We will always write random symbols in bold font (e.g. $\BA, \Ba$).
Note that in this view, “$\Bx$”, “$f(\Bx)$” and “$\Bx \sim P:f(\Bx)$” don’t exist as a mathematical objects the same way that the distribution $P$ does. They’re just grammar that we use to describe operations on random values in a more concise manner than if you manipulated the distributions directly. In order to try and make this point, we’ll generally avoid the term “random variable”.
Informally, it’s convenient to declare a random symbol globally then talk about one or more random expressions: e.g. if we say “let $\Bx \sim P$, and consider the correlation between $f(\Bx)$ and $g(\Bx)$” then under the hood we’re actually making claims about the random comprehension $\Bx \sim P: (f(\Bx), g(\Bx))$.
Capturing the distribution
Since random expressions are not themselves real mathematical objects, if we want to do anything useful with them, we eventually need to be able to get a distribution back. We do that using square brackets: if $\Bx \sim P$, then $\b{f(\Bx)}$ denotes “the distribution of $f(x)$, assuming $x$ was chosen at random according to $P$”.
Formally, we’ll write this distribution as $_{\Bx \sim P}\b{f(\Bx)}$, which is defined as
and we’ll also use the shorthand $_P\p{f} \ce {}_{\Bx \sim P}\b{f(\Bx)}$ in cases where the function $f$ is already explicitly defined. But in the language of random comprehensions, we’ll just write it as $\Bx \sim P: [f(\Bx)]$.
As a spoiler, we’ll later see that this operation of “capturing the distribution” of a random expression is precisely what the square brackets are doing in expressions like “$\Pr[\Bx=x]$” or “$\E[\Bx]$”.
Working with random expressions
Aliases
We can define new random symbols as aliases for random expressions, just like we would do normal values: e.g. $\By \ce f(\Bx)$. If we then talk about some expression $g(\By)$ based on $\By$, this should be interpreted as the random comprehension $\Bx \sim P: g(f(\Bx))$.
Random events
Random events are just random expressions that yield truth values $\tf$, such as “$\Bx = 3$” or “$\Bx \in S$” for some set $S \sse X$. We can denote events two ways:
- as functions: e.g. $E(x) \ce (x = 3)$, so that $E(\Bx)$ is equivalent to $\Bx=3$;
- as random symbols: e.g. $\BE \ce (\Bx = 3)$ (in this case $\BE$ is itself in bold font).
Joint distributions
If $R$ is a distribution on the set of pairs $X \times Y$, then we can write $(\Bx,\By) \sim R$ to describe a joint distribution on $\Bx$ and $\By$. Formally, writing $(\Bx,\By) \sim R$ is equivalent to writing $\Bz \sim R$ (where $\Bv$ is a random pair in $X\x Y$) then defining the aliases $\Bx \ce \Bz_1$ and $\Bz \ce \Bz_2$. The same applies more generally to tuples.
More generally, we say that random expressions are jointly distributed if their randomness comes from the same declaration: for example, if $\Bx \sim P$, then $f(\Bx)$ and $g(\Bx)$ are jointly distributed, and we can think about the ways they depend on each other.
If we have two distributions $P \in \tri(X)$ and $Q \in \tri\p Y$, we can form the product distribution $P \times Q \in \tri\p{X \x Y}$, defined as $(P \times Q)(x,y) \ce P(x)Q(y)$. In the other direction, given a joint distribution $R$ we can use our notation to marginalize it: for example, $(\Bx,\By) \sim R : \b{\Bx}$ gives the marginal for the first element of the pair.
Equality of random expressions
There are two ways that two random expressions $\Bx$ and $\By$ can be equal.
They can be “equal in distribution”, i.e. their distributions are identical. We can write this as $[\Bx] = [\By]$. For example, if $\Bom \sim \CU\p{[0, 2\pi]}$, then $[\cos(\Bom)] = [\cos(\Bom+\theta)]$ for any offset $\theta$, even though for almost all values of $\Bom$ these expressions would be different (as long as $\theta \not\in 2\pi\Z$). Note that for this notion to work out, the expressions $\Bx$ and $\By$ don’t need to be jointly distributed: for example, it’s completely fine to write $_{\Bx \sim \CU\p{[-1,0]}}\b{-2\Bx} = {}_{\By \sim \CU\p{[0,2]}}\b{2-\By}$ (and it is a true fact).
More strongly, $\Bx$ and $\By$ can be “equal in probability”, i.e. they are always1 equal. We can write this as $[\Bx = \By](\true) = 1$ (which we’ll see later is what is meant by $\Pr[\Bx = \By] = 1$). Note that $\BX$ and $\BY$ must be jointly distributed in order for this to make sense.
Computing statistics
Now that we have a nice way to manipulate probability distributions, it would be nice if we could get more information about them.
Principles
Suppose $\Bx \sim P$ takes real values, and we would like to know the mean of $f(\Bx)$. Thanks to our judicious notational choices, we can just denote it as $\E[f(\Bx)]$, where
- $\b{f(\Bx)}$ captures the randomness of $f(\Bx)$ and returns its distribution—name it $Q$;
- $\E$ is an operator on distributions that takes $Q$ and computes its mean.
This operator is defined as $\E(Q) \ce \sum_y Q(x) y$ for discrete distributions and $\E(Q) \ce \int_y y \d Q(y)$ for continuous distributions. And the notation $\E_{\Bx \sim P}\b{f(\Bx)}$ can be similarly understood as applying the operator $\E$ to the distribution $_{\Bx \sim P}\b{f(\Bx)}$.
In general, let’s call any operator on distributions that returns a single number a statistic. Even though they’re only defined on distributions, they automatically work on random expressions as long as you capture their distribution with brackets.
Types of statistics
There are two main types of statistics: those that are happy to be given values from any set and those that expect the values to be a specific types (e.g. $\R$, $\R^n$, $\tf$).
- Some examples of generic statistics:
- Entropy: $\H(P) \ce \E_{\Bx \sim P}\b{\log \f{1}{P(\Bx)}}$, works an arbitrary distribution $P$.
- Ratio divergences: $D^f\pff{Q}{P} \ce \E_{\Bx \sim P}\b{f\p{\f{Q(\Bx)}{P(\Bx)}}}$ can take arbitrary distributions $P,Q$ as long as $\supp\p Q \sse \supp \p P$
- (and in particular, the information divergence $\D\pff{Q}{P} \ce \E_{\Bx \sim Q}\b{\log \f{Q(\Bx)}{P(\Bx)}}$).
- Alternatively, as in Density functions#As a random variable one can understand $\ff Q P$ to be a shortcut for the distribution ${}_{\Bx \sim P}\b{\f{Q(\Bx)}{P(\Bx)}}$.
- Some examples of statistics which require require values of a specific type:
- Truth values: $\Pr(P) \ce P(\true)$.
- Real values or vectors:
- expectation: see earlier;
- variance: $\Var(P) \ce \E_{\Bx \sim P}\b{\p{\Bx - \E(P)}^2}$.
- The mutual information is kind of an edge case: while $\I[\Bx;\By] \ce \D\pff{[\Bx]\x[\By]}{[(\Bx,\By)]}$ doesn’t require $\Bx$ or $\By$ to return values of a certain type, if we understand it as an operator on the distribution $R \ce \b{(\Bx,\By)}$, then $\I(R) \ce \D_{(\Bx,\By)}\pff{_{(\Bx,\By) \sim R}[\Bx]{}_{(\Bx,\By) \sim R}[\By]}{R}$ does require $R$ to be a distribution over pairs.
Even though we to define them based on the distributions to make a pedagogical point, most of the statistics in the latter category are simpler to define by using the random expressions directly than going through distributions.
Note: The operator “$\1(\cdot)$”, which is often used to turn a random event into a $\zo$ random variable, does not inherently work on the distributions: it’s just a function that takes a truth value ($\true$ or $\false$) and returns $0$ or $1$. For example, it’s perfectly okay to write $\1\p{x=3}$ to denote “the value which is $1$ if $x=3$ and $0$ otherwise” even if $x$ is deterministic. Because of that, we shouldn’t use brackets for it, since it’s not capturing randomness but just doing data processing.
Higher-level randomness
Hierarchy in randomness
There are many situations where we find ourselves groping for “random random expressions”. And not only that, but there is a sense of hierarchy to it:
- You might have a random object and want to perform random experiments on it, adding some “inner randomness”.
- Or, the other way around, you might have a computation that involves a random experiment and want to randomize one parameter of it, adding some “outer randomness”.
As an example of the former situation, suppose that you wanted to learn some function $f$ over an input distribution $D$, and you use a sample $\BS \sim D^m$ of $m$ data points from $D$ to learn some best fit $\HAT{f}_\BS$. Now, you want to know how likely it is that your learned $\HAT{f}_\BS$ is in fact close to $f$ on the whole distribution $D$, not just the sample $\BS$, so you want to make sure that on the average $\Bx \sim \CD$, the squared error $\p{\HAT{f}_\BS(\Bx) - f(\Bx)}^2$ is not too large. In this last expression, there are two sources of randomness: the sample $\BS$ (the outer randomness) and the input $\Bx$ (the inner randomness).
Crucially, they’re not on the same level! In $\HAT{f}_\BS(\Bx)$, there is a sense in which the randomness of $\Bx$ as “comes after” $\BS$. Or put yet another way, we want to see it as “draw a random $\BS$, and for each fixed outcome of $\BS$, consider a random $\Bx$” rather than “draw a random $\Bx$, and for each fixed outcome of $\Bx$, consider a random $\BS$”. In our random comprehension notation, this is akin to allowing the random expression on the right hand side to itself be random: for example, we might represent the squared error $\p{\HAT{f}_\BS(\Bx) - f(\Bx)}^2$ as
or for notational simplicity
More generally, it feels like we should be able to take any computation and make it “more random” by taking one part and replacing it by a random symbol, even if the computation was itself already random. That is, the notation should allow us to make any random expression such as $f(x,\By)$ more random by randomizing $x$ in an “outer loop”:
Of course, we can keep adding more levels, e.g.
Capturing the innermost level
When a random expression has several levels of randomness, we want to work (e.g. compute probabilities or averages) with the innermost randomness first, so that’s what the brackets should capture. For example, in our machine learning example, we should be able write the probability that your model has a mean error greater than $\eps$ as
where the expectation is over the randomness of $\Bx$ only and the probability is over the randomness of $\BS$ only.
And for all we know, the distribution of the inner symbols might even depend on the values of the outer random symbols! E.g. we could define a family of distributions $\CQ$ parameterized by $x$ and consider the expression
in which case it would make no sense to do things like “average $f(\Bx,\By)$ over $\Bx$ only”, expecting to get an expression that depends on $\By$.
So we’ll set the rule for capturing distributions of higher-level random expressions to be to capture only the randomness from the innermost symbol(s) involved in it: e.g. the right way to resolve the comprehension
is as
where ${}_{\By \sim Q}\b{f(\Bx,\By)}$ is now a distribution whose probabilities are random over $\Bx$ only:
This makes sense given our intuition that “we should be allowed to take any computation and make it random”. Indeed, if we started out with the computation $\E[f(x,\By)]$ and decided to make the “$x$” part random by introducing a random symbol $\Bx \sim P$, then the meaning of $\E[f(\Bx,\By)]$ ought to be “the value $\E[f(x,\By)]$, for a random $x$”, not “the expectation of $f(\Bx,\By)$ over the randomness of both $\Bx$ and $\By$, assuming they’re independent” nor “the value $\E[(\Bx,y)]$, for a random $y$”.
If we bracket this expression a second time, i.e. $\b{\b{f(\BX,\BY)}}$, then this resolves to a distribution over distributions, $_{\Bx \sim P}\b{_{\By \sim Q}\b{f(\Bx,\By)}}$. The probability that it gives to some distribution $R$ is the probability that if you pick a random $x$ from $P$, the distribution $_{\Bx \sim Q}\b{f(x,\By)}$ happens to be exactly $R$:
Resolving ambiguity
In general, if we define random symbols like $\Bx \sim P$ and $\By \sim Q$ in text and then use expressions like $\E\b{f(\Bx,\By)}$, there can be ambiguity over what hierarchy is implied. It could be that $\Bx$ is the outer symbol, that $\By$ is the outer symbol, or that they “share a level” and we should consider them as jointly distribution but independent. In other words, $f(\Bx,\By)$ could mean any of the following:
and each of these possibilities gives a different meaning to $\E\b{f(\Bx,\By)}$.
Even if the relative order of the symbols is clear from the context, we might sometimes deal with expressions where one of the symbols “dropped out”. For example, suppose we’re dealing with the expectation
At this stage, the expectation should clearly be over $\BY$, keeping $\BX$ fixed. But as the computation progresses, $\BY$ disappears, and we’d like to write
but given that the rule is “always capture the innermost level”, the former is a random expression in $\Bx$, whereas (as written) the latter would be a fixed value!
To deal with such situations, we can specify explicitly which level should be captured. For example,
- in the first scenario, we could write
- $\E_\Bx[f(\Bx,\By)]$ to clarify that $\Bx$ was on its own at the innermost level (this yields a random expression in $\By$ only), or vice versa;
- $\E_{\Bx,\By}[f(\Bx,\By)]$ to clarify that $\Bx$ and $\By$ were on the same level (this yields a deterministic value);
- in the second scenario, we could write $\E[\Bx + \By - \By] = \E_\By[\Bx] = \Bx$ (in which case all three expressions are random in $\Bx$ only).
Of course, in any of these scenarios, one could just have chosen to write the declarations explicitly:
And most of the time this is the sane thing to do. But the reason why we need to introduce default rules for the order of capture is that in the next section, we’ll deal with higher-level randomness where the levels were not created by explicitly drawing a random variable from a distribution.
Note: When the hierarchy is clear, this notation should not be used to force the brackets to capture a level that is outermore than some variable in the expression. For example, if the hierarchy is $\Bx \sim P: \By \sim Q: f(\Bx,\By)$, then talking about $\E_\By[f(\Bx,\By)]$ makes no sense. So when the relative levels are clear, the semantics of this subscript is just to ask the brackets to pretend that the symbol was present inside, as in $\E_\By\b\Bx$.
Examples
- Previously, we defined defined the entropy as an operator $\H$ over distributions, which automatically defines $\H\b\Bx$ as $\H(P)$ for $P \ce \b\Bx$. We can equivalently define the latter as $\H[\Bx] = \E\b{\log\f{1}{[\Bx](\Bx)}}$. At first sight it looks like there might be a naming collision, but it actually resolves normally, in the following way:
- in $[\Bx]$, $\Bx$ is the only random symbol, so it gets captured as $_{\BX \sim P}[\BX]$, which evaluates to the deterministic value $P$;
- the whole expression is now $\E\b{\log \f{1}{P(\Bx)}}$;
- in $\b{\log \f{1}{P(\Bx)}}$, there is again only one random symbol $\Bx$, which gets captured as $_{\Bx \sim P}\b{\log \f{1}{P(\Bx)}}$, giving some distribution $Q$ (whose values are the negative log-probabilities);
- then we apply the expectation operator to $Q$, giving us the entropy of $P$.
- Similarly, we can define the variance simply as $\Var[\Bx] \ce \E\b{\p{\Bx-\E[\Bx]}^2}$ without worrying about the two occurrences of $\Bx$ interferering.
- And more generally, Wick products (e.g. $\inner{\Bx} \ce \Bx - \E[\Bx]$) and cumulants.
- Coming back to equality of random expressions, if $\Bx$ and $\Bx'$ are level-$l$ random expressions over the same $l$ levels, then both “$[\Bx]=\b{\Bx'}$” and “$\Pr\b{\Bx = \Bx'}=1$” are level-$(l-1)$ random events, and only the innermost level dropped.
Manipulating the sample space
So far, the only way that we’ve manipulated the hierarchy of randomness was by using brackets to “pop” the innermost level off by turning into a distribution. Now let’s look at operations which can transform the innermost level while keeping it random, or even split it into two levels of randomness.
Folding mixtures
Consider an expression like
where $\CQ\p\*$ is a parameterized family of distributions. If we consider the two levels of randomness as a whole, then the distribution of $\By$ could be called a mixture of the distributions $\CQ(x)$ weighted according to $P$, and we might say that we applied a stochastic map $\CQ$ to $\Bx$ in order to produce $\By$.
The distribution of the mixture in this case is given by $\E\b{\CQ\p\Bx}$ (where the expectation averages over the distributions as vectors). More generally, it can be useful to “fold” or “convolute” the two innermost levels of randomness into one. We’ll denote this by the notation $\mix{\*}$ (perhaps because we’re “rounding up” the levels of randomness together). Formally, for some random expression $\Be$, $\mix{\Be}$ is distributed according to $\E\b{\b\Be}$. For example, for the mixture described above, we have $\b{\mix{\By}} = \E\b{\b\By} = \E\b{\CQ(\Bx)}$, as desired.
Another example would be to consider the result of applying a random function
In this case, the randomness of $\Bf$ doesn’t “depend on $\Bx$”, so we could have just seen $\Bx$ and $\Bf$ as jointly drawn from $P \x F$. But we still see the overall distribution as a mixture of the distributions $[\Bf(x)]$ for each possible value of $x$. And the folding operation gives us a random expression $\mix{\Bf\p\Bx}$ whose distribution is given by $\b{\mix{\Bf\p\Bx}} = {}_{\Bx \sim P,\Bf \sim F}\b{\Bf\p\Bx}$.
Note: I don’t think this $\mix\*$ notation has great use cases in practice, but I think it’ll be useful to have it when we talk about conditionals later on, because it’s more intuitive to think about folding the randomness than to have to parse complicated expressions like $\E\b{\b\Be}$.
Zooming in on a part of the space
#to-write
- if $\BE$ is a random event, $\at{\BY}{\BE}$ denotes the “restriction” of $\BY$ to $\BE$: the random variable that has the distribution $\BY$ has after you condition on event $\BE$ being true
- notation mnemonic: it’s $\BY$ “at” $\BE$ in the sense of $\BY$ on the part of the space where $\BE$
- its distribution is described by: $\b{\at{\BY}{\BE}}(y) \ce \f{\b{(\BY,\BE)}(y,\r{true})}{\b{\BE}(\r{true})} = \f{\Pr[\BY = y \and \BE]}{\Pr\b{\BE}}$
- can generalize to arbitrary factors (instead of just random events) (e.g. in the context of undirected graphical models), where $\BE$ takes any value between $0$ and $1$
- then $\b{\at{\BY}{\BE}}(y) \ce \f{\E[1\p{\BY=y}\BE]}{\E[\BE]}$
- rules:
- only applies to the innermost level (clear if you look at the definition of $\bat\By\BE$ above)
- since it’s more of a modification of the sample space than an operation over a given random expression, should always apply it right before capturing the distribution
- but in text, might informally talk about $\at{\BY}{\BE}$ and $\at{\BZ}{\BE}$ as jointly distributed if $\BY$ and $\BZ$ were jointly distributed
- but the right way to see them is as the two components of $\at{\p{\BY,\BZ}}{E}$
- obvious notes:
- in $\at{\BY}{\BE}$, $\BY$ and $\BE$ are not taken for their values; their randomness is being caught!
- $\at{\BY}{\BE}$ depends on the overall (joint) distribution of $\BY$ and $\BE$
- so it might have been maximally honest to denote it as something like $\r{zoom}\b{\BY,\BE}$?
- when we write an expectation/probability/etc conditioned on an event $\BE$, we really mean the expectation/probability/etc of the corresponding restriction
- e.g. $\E\bco{\BY}{\BX=x}$ really means $\E\b{\at{\BY}{\BX=x}}$
- in fact it might make sense to use $\co\BY\BE$ to mean $\at\BY\BE$ more generally; it collides with our notation for conditionals, but it seems mostly fine since we’ll probably rarely want to condition on either value of a boolean expression, and if we want to do this we could just transform it into a $\zo$ value using the operator $\1\p\*$
Slicing up the space
#to-write
Zooming in on one value
- suppose $\Bx,\By$ jointly distributed and want to study conditionals like $\at\By{\Bx = x}$ for fixed values $x$
- let $\By|^\Bx_x$ be “shorthand” for $\at{\By}{\Bx = x}$
- tbh it doesn’t really need to be an arrow; could just be $\left.\By\right\lvert^\Bx_x$
- or $\By|^\Bx(x)$ to emphasize that it’s a “function”?
- i’m a bit uneasy with the idea of having “functions” returning random expressions; the types seem messed up
- that allows you to replace $x$ by things that are in the same level of randomness without it being interpreted as a single event
- which would make things like $\left.\BY_i\strut\right|^{\BX_i}_{\~f_i\p{\BX_{i-1}}}$ valid
- find a couple of VI apps?
- though i’d expect that ML-style notation will be king here
Conditionals
- let $\co{\By}{\Bx}$ be a shortcut for $\left.\By\strut\right|^\Bx_\Bx$
- both conditioned on $\Bx$ and evaluated at $\Bx$
- whereas the function version cuts the randomness in half and keep only one part, this keeps both parts?
- in particular $\b{\bco\By\Bx}(Q) = \Pr_{\Bx' \sim \b{\Bx}}\b{\bco{\By}{\Bx=\Bx'} = Q}$
- it’s level-$2$ random expression
- in particular, $\E\bco{\By}{\Bx}$ is a random variable depending on $\Bx$, giving the value $\E\bco{\By}{\Bx = x}$ for each value $x$ of $\Bx$
- outer level is jointly distributed with $\Bx$
- note that we really want this: it’s okay to talk about things like $\Bx\E\bco{\By}{\Bx}$ and it would be awkward to try and pass $\Bx$ through the $\E$?
- and it’s also eminently okay given where it came from: $\left.\By\strut\right\downarrow^\Bx_x$ is just a totally fine expression in $x$ that we should be allowed to want to randomize
- sort of a inverse of mixture:
- we have $\b{\mix{\co\By\Bx}} = \b\By$ (but not uniquely)
- and the joint $\p{\Bx,\pco\By\Bx}$ (or equivalently, $\co{\p{\Bx,\By}}{\Bx}$) is the unique random expression $\Be$ such that $\b{\mix{\Be}} = \b{\p{\Bx,\By}}$ and the first component is made entirely of point masses (i.e. $\Pr\b{\exists x:\b{\Be_1} = 1_x} = 1$)
- more precisely, $\b{\b\Be}$ is unique
- $\E\b{\E\bco\By\Bx} = \E\b\By$
- but this is a special property of $\E$ due to its linearity (?); this is not true for all operators
- e.g. $\Var[\BY] = \Var\b{\E\bco\BY\BX} + \E\b{\Var\bco\BY\BX}$
- $\cond{\Bx}{\Bx}$ is not quite the same thing as $\Bx$: it’s $\Bx$ but where we replaced each outcome $x$ by a random variable that has value $x$ with probability $1$
- corresponding (random) distribution: $\bco\Bx\Bx = 1_\Bx$
- same thing for $\cond{f(\Bx)}{\Bx}$ more generally: $\bco{f(\Bx)}\Bx = 1_{f\p\Bx}$
- so for example, $\E\bco{f(\Bx)}{\Bx}=f(\Bx)$
Using conditionals
- what people traditionally call the “conditional entropy” isn’t $\H\bco{\BY}{\BX}$ (which is a random variable depending on $\BX$) but instead $\E\b{\H\bco{\BY}{\BX}}$
- so chain rule of entropy is more properly written as $\H\b{\BX,\BY} = \H\b{\BX} + \E\b{\H\bco{\BY}{\BX}}$
- and i think it makes sense for this random variable $\H\bco{\BY}{\BX}$ to be the conditional entropy: it is the entropy of the conditional over $\BX$, and it itself also depends on $\BX$: one way to describe it is that it takes the value $\H\bat{\BY}{\BX = x}$ when $\BX$ takes the value $x$
- similarly, the “consistent conditioning” property of Ratio divergences is more properly written as $D^f\bff{\BX,\BY'}{\BX,\BY} = \E\b{D^f\bffco{\BY'}{\BX}{\BY}{\BX}}$ (where $D^f\bffco{\BY'}{\BX}{\BY}{\BX}$ is a random variable depending on $\BX$, and the expectation is over $\BX$)
- more precisely, $D^f\pff{\b{\BX,\BY'}}{\b{\BX,\BY}} = \E\b{D^f\pff{\bco{\BY'}{\BX}}{\bco{\BY}{\BX}}}$
- and maybe it even makes it clear that there’s no super natural answer to the question of “what variable should you be averaging over when you try define the conditional divergence $D^f\pff{\b{\BX',\BY'}}{\b{\BX,\BY}}$”: $\bco{\BY'}{\BX'}$ and $\bco{\BY}{\BX}$ are both random variables, and they’re (a priori) not jointly distributed
- $\Cov[\BX, \BY] = \Cov\b{\BX, \E\bco{\BY}{\BX}}$
- the expression makes sense because $\E\bco{\BY}{\BX}$ is indeed jointly distributed with $\BX$
- clear that $\E\b\Bx\E\b\By = \E\b\Bx\E\b{\E\bco\By\Bx}$, so suffices to show $\E\b{\Bx\By} = \E\b{\Bx\E\bco\Bx\By}$
- make sense of $\cond{\BZ}{\BX} = \at{\BZ}{\BY = \cond{\BY}{\BX}}$ (provided that we have the Markov property $\BX \leftarrow \BY \rightarrow \BZ$)
- i guess first of all you’d first have to write it as $\co\BZ\BX = \BZ|^\BY_{\co\BY\BX}$?
- but then the LHS is level-2 ($\BZ$’s is random over $\BX$) while the RHS seems level-3 ($\BZ$’s distribution is random over $\BY$, whose distribution is random over $\BX$)
- so maybe the claim we’re trying to make is that $\bco\BZ\BX = \E\b{\b{\BZ|^\BY_\co\BY\BX}}$?
- aka $\co\BZ\BX = \mix{\BZ|^\BY_\co\BY\BX}$
- where the mixing folds the randomness of $\BY$ into the randomness of $\BZ$!
- i guess first of all you’d first have to write it as $\co\BZ\BX = \BZ|^\BY_{\co\BY\BX}$?
- should we talk about $\D\b{\at{(\cond{\BX_S}{\BX_{\comp{S}}})}{\BE}}$ or $\D\b{\cond{\at{\BX_S}{\BE}}{\at{\BX_{\comp{S}}}{\BE}}}$?
- the former only conditions over the $\BX_S$ part (because you can only do zooms for the innermost randomness)
- the latter doesn’t make sense unless you consider $\at{\BX_S}{\BE}$ and $\at{\BX_S}{\BE}$ to be joint
- really what you should do is to let $\BX' \ce \at\BX{\BE}$ then consider $\D\bco{\BX'_S}{\BX'_\comp{S}}$
#to-think related jumble of issues:
- figure out the best possible way to write the math Variational inference then give up on using this notation for ML once and for all
- is $\ff{\Bx'}{\Bx} \ce \f{\b{\Bx'}}{\b{\Bx}}\p{\Bx'}$ meaningful?
- how should you actually notate cross-entropy? (or entropy, for that matter)
- actually cross-entropy is very specific to $f(t) = t \log t$: if we let $g(t) \ce \f{f(t)}{t}$ it’s only because $g\p{\f Q P} - g(Q)$ has the nice form of $g\p{\f1p}$ that thinks work out!
- yup, and for general ratio divergences we don’t have $\H_Q\p{P} \ge \H\p{Q}$
- … i guess this is in the same way that entropy is nice because it’s also the divergence $\E\b{\D\pff{1_\Bx}{\b\Bx}}$
- and that view is much more amenable to cross-entropy, which is $\E\b{\D\pff{1_\By}{\b{\Bx}}}$
- suggests you might want to denote cross-entropy as something like
- $\H\pat{P}{Q}$? (though this would be )
- (maybe you should give up on the double bar having a consistent meaning?)
- actually cross-entropy is very specific to $f(t) = t \log t$: if we let $g(t) \ce \f{f(t)}{t}$ it’s only because $g\p{\f Q P} - g(Q)$ has the nice form of $g\p{\f1p}$ that thinks work out!
-
The technical term is “almost surely”, to account for edge cases with continuous random variables where they technically could be equal but it happens with probability $0$. ↩