The mutual information measures the dependence between X,Y by comparing their distribution to what it would have been if X and Y had been independent, using information divergence. Formally, the mutual information is the divergence between the joint distribution PX,Y and the product of the marginals PX and PY:

I[X;Y]:=D((PX,Y)PX,Y(PXPY)PXPY)=E(x,y)(X,Y)[Pr[X=xY=y]Pr[X=x]Pr[Y=y]](1)=H[X]+H[Y]H[X,Y](2)=H[Y]H[YX].

Equivalently, let X be distributed identically to X and Y be distributed identically to Y, but with X,Y independent. Then

I[X;Y]=D[(X,Y)X,Y(X,Y)X,Y].

Since the divergence is nonnegative, (1) shows the subadditivity of entropy and (2) shows that conditioning only decreases entropy.

Properties

We also have

I[X;Y]=E(x,y)(X,Y)[Pr[Y=y|X=x]Pr[Y=y]]=ExX[D[(Y|X=x)Y|X=x(Y)Y]].

In other words, the mutual information measures how much (on average) the distribution of Y changes when it gets conditioned on X. This expression is convenient if we want to lower bound the mutual information because the term D[(Y|X=x)Y|X=x(Y)Y] is always nonnegative, and if Y is binary we can use asymptotic estimates for the binary divergence to compute it.

Random bits

If X,Y are uniform random bits with correlation ρ, then using the properties of entropy deficit (and assuming WLOG that X,Y{±1}), their mutual information is

I[X;Y]=ExX[D[(Y|X=x)Y|X=x(Y)Y]](Y uniform)=ExX[D[Y|X=x]](Y{±1})=ExX[Θ(E[Y|X=x]2)]=ExX[Θ((ρx)2)]=Θ(ρ2)

If X,Y are biased bits with Pr[X=1]=Pr[Y=1]=p (with p<1/2), then their mutual information is

  • up to H[X]=H(p)=Θ(plog1p) if they’re positively correlated;
  • only up to Θ(p2) if they’re negatively correlated1 (note that their correlation cannot be lower than p1p).

Additivity

Mutual information itself is not generally sub- or superadditive. For example, if X1,,Xn,Y are all the same random coin flip, we have

I[X;Y]=1I[Xi;Y]=n,

but if X1,,Xn are independent coin flips and Y:=X1Xn is their parity, then

I[X;Y]=1I[Xi;Y]=0.

On the other hand, whenever the Xi’s are independent like in the second example, then it is superadditive:

I[X;Y]I[Xi;Y].

Indeed,

I[X;Y]=H[X]H[XY](independence)=(H[Xi])H[XY](subadditivity)(H[Xi]H[XiY])I[Xi;Y].

This is an example of a reverse entropic inequality.

Lower and upper bounds

#to-write

  • two main bounds from murphy--probabilistic-ml-advanced.pdf
    • and entropy as a special case?

Conditional mutual information

Similarly, the conditional mutual information I[X;Y|Z] measures the average (over zZ) divergence between the conditional random variables (X,Y)|Z=z and an independent copy X|Z=z,Y|Z=z:

I[X;Y|Z]:=EzZ[D[((X,Y)|Z=z)(X,Y)|Z=z(X|Z=z,Y|Z=z)X|Z=z,Y|Z=z]]=E[logPr[X=xY=y|Z=z]Pr[X=x|Z=z]Pr[Y=y|Z=z]]=H[X|Z]+H[Y|Z]H[X,Y|Z](3)=H[Y|Z]H[Y|X,Z].

#to-write

  • also define this one using distributions instead of random variables?
    • or just rather, just define it based on the non-conditional version

Chain rule

By telescoping, we get

(by (2))I[X1,,Xn;Y]=H[Y]H[Y|X1,,Xn]=H[Y|X1,,Xi1]H[Y|X1,,Xi](by (3))=I[Xi;Y|X1,,Xi1].

Interaction information

#to-write

  • argue that the base case (1 variable) is under-defined: it could be taken as either H[X] or D[X], given that the entropy of the reference distribution gets cancelled out from then on?
  • in this choice of sign, I[X;Y;Z]=I[X;Y|Z]I[X;Y] (or denoted with D?) makes sense (?) because
    • it’s positive when X,Y have a “shared effect”, e.g. Z=XY, which leads us to be able to predict them better together than expected
      • (but otoh if you think about it like shared entropy, then it’s clearly negative: everything was looking completely independent so far and now you have a correlation reducing your entropy)
    • it’s negative when X,Y have a “shared cause”, e.g. X=Z and Y=Z, which leads us to be somewhat disappointed in our ability to predict them together given that the pairwise dependences were so strong!
      • (and if you think about like shared entropy, then it’s positive (??))
  • I think this choice of sign makes more sense because it’s about revealing information about things that looked more random, it’s about being able to predict better than expected!
    • information != entropy
  • if you want I[X;E]=E[I[X|E]], that would force the prior to be [X] itself?
  1. Proof. (a) We have I[X;Y]O(p2) because the total variation distance between X,Y and X,Y is O(p2) and the smallest probability of an outcome of X,Y is p2, so the information divergence must be O(p2). (b) We get I[X;Y]=Θ(p2) when X=1 and Y=1 are mutually exclusive: this means that Pr[X=Y=1]=0, while Pr[X=Y=1]=p2, so I[X;Y]=D((X,Y)X,Y(X,Y)X,Y)D((0)0(p2)p2)=Θ(p2)