Mutual information
The mutual information measures the dependence between
Equivalently, let
Since the divergence is nonnegative,
Properties
Another link to information divergence
We also have
In other words, the mutual information measures how much (on average) the distribution of
Random bits
If
If
- up to
if they’re positively correlated; - only up to
if they’re negatively correlated1 (note that their correlation cannot be lower than ).
Additivity
Mutual information itself is not generally sub- or superadditive. For example, if
but if
On the other hand, whenever the
Indeed,
This is an example of a reverse entropic inequality.
Lower and upper bounds
#to-write
- two main bounds from
murphy--probabilistic-ml-advanced.pdf
- and entropy as a special case?
Conditional mutual information
Similarly, the conditional mutual information
#to-write
- also define this one using distributions instead of random variables?
- or just rather, just define it based on the non-conditional version
Chain rule
By telescoping, we get
Interaction information
#to-write
- argue that the base case (1 variable) is under-defined: it could be taken as either
or , given that the entropy of the reference distribution gets cancelled out from then on? - in this choice of sign,
(or denoted with ?) makes sense (?) because- it’s positive when
have a “shared effect”, e.g. , which leads us to be able to predict them better together than expected- (but otoh if you think about it like shared entropy, then it’s clearly negative: everything was looking completely independent so far and now you have a correlation reducing your entropy)
- it’s negative when
have a “shared cause”, e.g. and , which leads us to be somewhat disappointed in our ability to predict them together given that the pairwise dependences were so strong!- (and if you think about like shared entropy, then it’s positive (??))
- it’s positive when
- I think this choice of sign makes more sense because it’s about revealing information about things that looked more random, it’s about being able to predict better than expected!
- information != entropy
- if you want
, that would force the prior to be itself?