Adapted from the tutorial Information theory in combinatorics I by Shachar Lovett at Simons.

Entropy is subadditive: the information you have in a pair of random variables (X,Y) is bounded by the information you have in each of them individually

H[X,Y]H[X]+H[Y].

In fact, the difference between the two is precisely the mutual information I[X:Y].

Shearer’s lemma generalizes this principle to groups of variables, not just individual variables: the point is that if both sides have the same variables, but the right side takes the entropy of smaller groups, then the right side overcounts some of the mutual information, so it’s bigger. For example:

2H[X,Y,Z]H[X,Y]+H[X,Z]+H[Y,Z].

And in general, if you have n variables X1,,Xn and m groups I1,,Im[n] such that each variable is contained in exactly1 q of the groups, Shearer’s lemma says that

qH[X1,,Xn]H[XI1]++H[XIm].

Proof

You can “prove” the simple subadditivity principle using the chain rule:

H[X,Y]=H[X]+H[YX]H[X]+H[Y],

since conditioning can only decrease entropy.

Similarly, we can prove Shearer’s lemma by applying the chain rule on both sides, and observing that the left side has stronger conditioning. Specifically, we rewrite the LHS as

qH[X1,,Xn]=qH[X1]+qH[X2X1]++qH[XnX[n1]]

and for each set Ij={i1,,ik} (with i1<<ik), we rewrite the RHS as

H[XIj]=H[Xi1]+H[Xi2Xi1]++H[XikXi1,,Xik1]H[Xi1X[i11]]+H[Xi2X[i21]]++H[XikX[ik1]].

Because each variable i is contained in exactly q of the sets Ij, summing this up will produce exactly q copies of H[XiX[i1]].

Shadows

Shearer’s lemma is particularly useful in combinatorics if you let (X1,,Xn) be uniform over some set of points you want to show is small.

As a toy example, suppose I have a set SR2 and I tell you that its “shadows” on the x- and y-axis have n points.2 How many points can S have? The answer is easy: there are only n possible x-coordinates and n possible y-coordinates, so there can be at most n2 points. This is achieved by an n×n grid.

Although it’s very overkill, you could have used the subadditivity of entropy to prove this. Let (X,Y) be a uniformly random point in S. Then H[X,Y]=log|S|. But X and Y are distributions over n points, so each can only contain logn bits of information. So we have

log|S|=H[X,Y]H[X]+H[Y]2logn|S|n2.

If you move to three dimensions, though, it’s less obvious. This time, I have SR3, and I tell you that its shadows on the xy-, xz- and yz-plane have n points. How many points can S have? The best “elementary” argument observes that there are n possible values (x,y), and there are also n possible values for z, so |S|n2. Again letting (X,Y,Z) be a uniformly random point in S, this corresponds to the following entropy argument

log|S|=H[X,Y,Z]H[X,Y]+H[Z]H[X,Y]+H[X,Z]2logn|S|n2.

But the obvious example is the n×n×n grid, which has only n3/2 points. To show that this is optimal, we use Shearer’s lemma:

2log|S|=2H[X,Y,Z]H[X,Y]+H[X,Z]+H[Y,Z]3logn|S|n3/2.

Entropy deficit version

Because entropy deficit is basically “negative entropy”, we get an inequality in the other direction:

qD[X1,,Xn]D[XI1]++D[XIm].

(where each variable is in exactly3 q of the groups I1,,Im). This says that if X is “locally far from uniform” in the sense that the “marginals” XIj are far from uniform, then X must also be proportionally far.

For example, suppose I have some events E1,,Em, and I tell you that each Pr[Ej]p. What can you say about the probability Pr[jEj] that they all happen simultaneously?

  • If the events are independent, you get pm.
  • But absent any additional information, the best you can get is p: indeed, they could all be the same event.

Now, assume the events E1,,Em are “read-q”: that is, each Ej depends only on a groups Ij[n] of some underlying independent random variables x1,,xn, so that each variable is q groups. We’ll see that you get pm/q. This is optimal, since those m events could be m/q independent events, each repeated q times.

Let Ij[m] be the set of variables queried by event Ej. Let X be uniform over {x|jEj(x)}. Then

  • D[X]=log1Pr[jEj];
  • XIj can only take values in {xIj|Ej(x)}, so D[XIj]log1Pr[Ej].

#to-write make the logic above clearer

Therefore, we get

Pr[jEj]=2D[X]2D[XI1]++D[XIm]q(Pr[E1]Pr[Em])1/qpm/q.

Other applications

  • Shearer’s lemma gives tight bounds on the number of independent sets in bipartite n×n d-regular graphs: (2d+11)n/d2n+n/d, which is achieved by n/d copies of Kd,d. The variable to look at is the indicator (in {0,1}2n) of a uniformly random independent set.

Similar results

  • Gavinsky, Lovett, Saks, and Srinivasan4 proved a read-q variant of the Chernoff bound (which similarly worsens the bound from m to m/q).
  • The Local lemma is another result about a relaxed notion of independence. However,
    • the assumption is stronger: the events are not allowed to share variables with too many other events, whereas Shearer’s lemma only assumes that each variable is queried few times;
    • it’s a lower bound on Pr[jEj], not an upper bound on Pr[jEj].
  1. We could also write “at least q”, since adding more variables to a group can only increase the RHS. 

  2. That is, if we write Sx:={x(x,y)S} then |Sx|n, and similarly for y

  3. #to-write 

  4. Dmytro Gavinsky, Shachar Lovett, Michael Saks, Srikanth Srinivasan, “A Tail Bound for Read-k Families of Functions”.