The update rule
The inferential update rule (aka Bayes’ rule) describes how the probability of $A$ changes if you know that $B$ happened:
\Pr[A \mid B] = \Pr[A] \frac{\Pr[B \mid A]}{\Pr[B]}.
Proof. $\Pr[A \and B]$ can be written as either $\Pr[B]\cdot \Pr[A \mid B]$ or $\Pr[A] \cdot \Pr[B \mid A]$.
Sign of the correlation
Rearranged, Bayes’ rule shows an intuitive (?) yet powerful fact:
\[\frac{\Pr[A \mid B]}{\Pr[A]} = \frac{\Pr[B \mid A]}{\Pr[B]}\ \p{= \frac{\Pr[A \and B]}{\Pr[A]\Pr[B]}}.\]
In particular, $B$ makes $A$ more likely if and only if $A$ makes $B$ more likely. And even more particularly, if $A \implies B$ then $A$ must make $B$ more likely (unless $\Pr[B]=1$), and thus conversely $B$ makes $A$ more likely.
Put it another way, there is a meaningful notion of “positively correlated” vs “negatively correlated”, whose sign is consistent with the Pearson correlation coefficient of the corresponding $\zo$ variables (but there is no direct quantitative relationship between correlation and either this probability ratio or the likelihood ratio $\f{\Pr\bco{A}{B}}{\Pr\bco{A}{\neg B}}$).
Concretely, this can be useful if you want to show that some probability $\Pr[A \mid B]$ is small but you can only deal with $\Pr[A]$, and you want to argue that “$B$ shouldn’t make $A$ more likely”. This is the case as long as you can prove that $A$ doesn’t make $B$ more likely, i.e. $\Pr[B \mid A] \leq \Pr[B]$. This trick is used in Håstad’s proof of the switching lemma.
Updating the odds of a hypothesis
This section is adapted from two videos by 3blue1brown.
Bayes’ rule tells us how much to update the probability of a hypothesis $H$ (e.g. “do I have COVID”) based on evidence $E$ (e.g. “my PCR test came back positive”). Concretely, the information you have is typically:
- $\Pr[H]$, the prior probability of the hypothesis $H$ (the prevalence of COVID);
- $\Pr[E \mid H]$, the likelihood of seeing the evidence if $H$ is true (the sensitivity of the test);
- $\Pr[E \mid \neg H]$, the likelihood of seeing the evidence if $H$ is false (the false positive rate).
To compute the posterior probability, you need the gruesome computation
\[\Pr[H \mid E] = \Pr[H] \frac{\Pr[E \mid H]}{\Pr[E]} = \Pr[H] \frac{\Pr[E \mid H]}{\Pr[H] \Pr[E \mid H] + \Pr[\neg H] \Pr[E \mid \neg H]}.\]
Indeed, usually the only way to compute something like “what is the overall chance I get a positive test result” is to split into the case where I do have COVID and the case where I don’t.
But if you’re working with odds, it’s just a multiplication! You go from
\Odds[H] = \frac{\Pr[H]}{\Pr[\neg H]}
&= \frac{\Pr[H \and E]}{\Pr[\neg H \and E]}\\
&= \frac{\Pr[H] \Pr[E \mid H]}{\Pr[\neg H] \Pr[E \mid \neg H]}\\
&= \Odds[H] \frac{\Pr[E \mid H]}{\Pr[E \mid \neg H]},
where the factor $\frac{\Pr[E \mid H]}{\Pr[E \mid \neg H]}$ is known as the likelihood ratio.
Multiple possible outcomes
More generally, suppose that instead of a single hypothesis $H$, we have a random variable $\BX$ with some prior distribution $P$ before we see the evidence, and the probability of seeing evidence $E$ depends on $\BX$. What is the posterior distribution $Q$ that we can deduce for $\BX$ after seeing the evidence? We have
&\ce \Pr_{\BX \sim P}\bco{\BX=x}{E}\\
&= P(x) \f{\Pr\bco{E}{\BX = x}}{\Pr_{\BX \sim P}[E]}\\
&\propto P(x) \Pr\bco{E}{\BX = x}.
That is, the relative odds of different values of $\BX$ are multiplied by the factors $\Pr\bco{E}{\BX=x}$. Because of this, it can be convenient to think about distributions in terms of logits (log odds) and only generate probabilities from them when needed, using the softmax function. Indeed, updating on evidence means adding the log likelihood to each logit.
What we mean by priors and posteriors
These factors $\Pr\bco{E}{\BX=x}$ can have arbitrary ratios between different values of $x$, so $Q$ can be any rescaling of $P$ (while keeping the sum at $1$). In other words, the only constraint that $P$ and $Q$ need to satisfy in order be to be a realistic prior-posterior pair is
P(x)=0 \implies Q(x) = 0,
i.e. the range of $Q$ is a subset of the range of $P$.
So more broadly, it makes sense to talk of “priors” and “posteriors” when we have two distributions $P$ and $Q$ that satisfy the above constraint and for which
- $P$ is thought of as some kind of more uniform baseline,
- and $Q$ is thought of as taking more specific values (and is often the more interesting object of study).
In this context, we’ll call the ratio $\f{Q(x)}{P(x)}$ the density of $Q$ relative to $P$.