The proof is adapted from Hamed Hatami’s lecture notes for “COMP 531: Advanced Complexity Theory” at McGill. Hamed Hatami doesn’t condone my crappy metaphors.

This note presents Håstad’s original proof of his switching lemma.

Hermits and terrorists

You’re at a board game night, playing a very disturbing version of one-night werewolf. You don’t know how many players there are, but you know that tonight:

  • some of the villagers are hermits, who will find their vocation and leave the village to go live in isolation;
  • those who are not hermits have probability p of being a terrorist, who will blow themselves up and kill everyone;
  • the rest are normal villagers, who are fast asleep.

What’s the chance that at least t villagers are still alive and present in the village the next day?

First of all, we can just disregard all the hermits. We don’t know how many non-hermits there are, but if at least t people survive, then that means there were at least t non-hermits, and all of them failed to be terrorists. That’s a pretty big surprise: the probability that they all fail to be terrorists was at most (1p)t! So regardless of how many non-hermits there were (and also regardless of how many people there were originally), the probability that t people survive is (1p)t.

Here, crucially, we assumed they all independently had a p chance of being a terrorist. If they were correlated (for example, if you know they’re either all terrorists or all non-terrorists), then the probability that at least t people survive could be arbitrarily large, and we would have to know something about the number of people there were originally.

ANDs and ORs

The politically incorrect situation above is analogous to what happens to DNFs under random restrictions. Take some term Tj of width w in a DNF f=T1Ts, and apply a random restriction ρ with star probability δ.

  • If any of the variables in Tj get set to 0, then Tj evaluates to 0 and disappears from the DNF without affecting the other terms (i.e. Tj is a hermit).1
  • On the other hand, assuming Tj doesn’t get set to 0 (all its variables were set to 1 or ), there is a chance p:=(1δ1+δ)w that all of the variables in Tj get set to 1, in which case Tj evaluates to 1 and the whole DNF collapses to the constant 1 (i.e. Tj is a terrorist).

As long as δ=O(1/w), this probability p of being a terrorist is a constant, and assuming these are all independent, this shows that

Pr[fρ has size t](1p)t2Ω(t).

But the terms of a DNF tend to share a lot of variables,2 so the events that two terms Tj1 and Tj2 are terrorists are far from independent: if Tj1 and Tj2 share a lot of the same variables, then conditioned on the fact that Tj1 has some stars, Tj2 is quite likely to have some stars too.

To rephrase, what goes wrong is: The terms shared some stars, which allowed them to both not be terrorists without it being a huge surprise.

It turns out this is the only issue! So if somehow after looking at Tj1 we could fix all its stars, then it’s as if down the line, Tj2 has a fresh p chance of being a terrorist. Indeed, if we do that, the only variables we’ll have seen are all fixed to 0 or 1 (either by ρ or by us), so conditioning on those should only increase the chance that the next term is a terrorist!

This insight then allows us to show that f is very likely to be a small-depth decision tree. Indeed, say Tj1 has r stars. If we can inductively show that for all possible fixings of those r stars, the resulting CNF has small decision tree depth, then so does f: just start out by querying those r stars in the first r levels.3

What we lost: a union bound

However, we did lose something: we now have to union-bound over all the possible assignments of those r stars, and when the term isn’t a terrorist, that’s at least 2 possibilities! So to counterbalance for this, it’s no longer enough for the probability p of being a terrorist to be a constant; it should be at the very least >1/2! This means that we cannot say anything when δ is bigger than 1/w

And this union bound over the 2r branches seems to be justified in practice. Indeed, consider the “choose your DNF” DNF:

f(x,y(1),,y(2w))=x1xwg1(y(1))x1xw1xwg2(y(2))x1xwg2w(y(2w)).

where each y(i) is an input in {0,1}m and gj is itself a width-w DNF (so that f is a width-2w DNF). Suppose that r of the w variables x1,,xw end up being stars in ρ, then 2r of the gj’s will still be relevant after the restriction. So even pretty rare events about the structure of the gj’s after the random restriction (like “gjρ has large decision tree depth”) could end up happening: they’re all independent and there’s 2r of them, so the probability that at least one of them happens is roughly 2r bigger than the probability of any specific one happening!

Nevertheless, this case is not actually problematic for say δ=10/w, since the probability to have r stars in the first w variables itself decreases very quickly (on the order of rΩ(r), as mentioned in this earlier footnote3). It would be very interesting to still be able to say something beyond the δ=1/w barrier.

Statement and proof

Let f=T1Ts be a width-w DNF, and let ρ be a random restriction with star probability δ. Håstad’s switching lemma4 says that for any t>0,

Pr[DT(fρ)t](Cδw)t

(for some constant C to be set later).

We want to show that by induction: looking at the first non-hermit term Tj, then recursing on the rest of the DNF after fixing all its stars to 0 or 1. But this requires us to condition on T1,,Tj1 being hermits, and on the other variables in Tj having the values that they had. Intuitively, this is fine, because those things only make it less likely to have stars going forward. But we do need to strengthen the induction hypothesis in order to handle this conditioning.

What we’ll actually prove is that for any auxiliary function F:{0,1}n{0,1}, then

Pr[DT(fρ)tFρ0](Cδw)t.

This condition “Fρ0” will be enough to encode all the things we wanted to condition on. And indeed, intuitively, conditioning on some function collapsing to a constant seems to only make it less likely for stars to appear.

We’ll induce on s, the number of terms of f. The base case when f has no terms is obvious: that makes f constant, so DT(f)=0. Otherwise, look at the first term in f, i.e. T1.

We can split the event DT(fρ)t into two cases: either T1ρ0 (i.e. T1 is a hermit), or T1ρ0 (it’s not). It suffices to show that conditioned on either of those, the probability is (Cδw)t.

Case 1: T1 is a hermit

Nothing interesting happens here: we’re just showing that “conditioning on the first few terms being hermits” can be achieved by “conditioning on some auxiliary function F collapsing to 0”.

We want to bound Pr[DT(fρ)tFρ0T1ρ0]. We can observe two things:

  • Once we’ve conditioned on T1 collapsing to 0, then f is equivalent T2Ts, and so their depths are the same.
  • Conditioning on F and T1 both collapsing to 0 is the same as conditioning on their OR FT1 collapsing to 0.

So we can rewrite

Pr[DT(fρ)tFρ0T1ρ0]=Pr[DT((T2Ts)ρ)t(FT1)ρ0],

and then the induction hypothesis for f:=T2Ts and F:=FT1 tells us that the RHS is at most (Cδw)t.

Case 2: T1 is not a hermit

This is the interesting case.

Without loss of generality (up to a flipping of the variables), we can assume that T1 contains no negations. This means that under our assumption that T1ρ0, all the variables in T1 are set either 1 or . We will roughly split the argument into two parts

  • the probability to get r stars in T1 decreases fast as r grows;
  • the probability that “the rest of f after fixing those stars” has depth tr is small.

First, split into cases based on the subset ST1 of the variables in T1 that are set to :

 Pr[DT(fρ)tFρ0,T1ρ0]= Pr[DT(fρ)tFρ0,ρT1{1,}]=ST1Pr[DT(fρ)t,ρS=,ρT1S=1Fρ0,ρT1{1,}]=r=1wST1|S|=rPr[ρS=,ρT1S=1Fρ0,ρT1{1,}]Pr[DT(fρ)tFρ0,ρS=,ρT1S=1]r=1wST1|S|=rPr[ρS=Fρ0,ρT1{1,}](a): ``Pr[T1 has r stars]''Pr[DT(fρ)tFρ0,ρS=,ρT1S=1](b): ``Pr[the rest of f has depth tr]''

(note that we’re using shortcuts like “ρT1{1,}” to mean “ρ sets all of the variables of T1 to either 1 or ”). A crucial step in the above was to exclude the case S=. We can do this because when there are no stars, T1 is a terrorist: it evaluates to 1 and collapses the whole function to the constant 1, which sets the decision tree depth to 0.

Part (a): T1 is unlikely to have many stars

First, we deal with the sum ST1|S|=rPr[ρS=Fρ0,ρT1{1,}], which represents the probability that T1 has at least r stars (conditioned on being all 1 and ). These probabilities would be trivial to compute if only we were not conditioning on Fρ0: we have

Pr[ρS=ρT1{1,}]=(Pr[ρi=1]Pr[ρi{1,}])r=(δ1+δ2)r(2δ)r.

We want to somehow argue that conditioning on Fρ0 can only decrease the probability that ρS=. By Bayes’ rule, it suffices to prove the opposite: that conditioning on ρS= can only decrease the probability of Fρ0, that is

Pr[Fρ0ρS=,ρT1{1,}]Pr[Fρ0ρT1{1,}].

We’re comparing two situations. In both of them, we’re forcing ρ to take values 1 and over T1S. But in the first situation, we’re forcing ρS=, while in the second one we’re only forcing ρS{1,}. Clearly, if F becomes fixed to 0 in the first situation, it will still be fixed to 0 in the second situation where some of its stars in S are fixed to 1’s.5

Therefore, we obtain6

ST1|S|=rPr[ρS=Fρ0,ρT1{1,}]ST1|S|=rPr[ρS=ρT1{1,}]ST1|S|=r(2δ)r(wr)(2δ)r(2δw)r.

Part (b): the rest is unlikely to be deep

To bound the probability Pr[DT(fρ)tFρ0,ρS=,ρT1S=1], it suffices to union bound over all possible fixings σ{0,1}S of the stars: indeed, if fσρ has decision tree depth <tr for each σ, then fρ must have decision tree depth <t.

  • If σ=1S, then T1 gets set to 1, so fσρ1, so Pr[DT(fσρ)tr]=0 and we can just drop that case.
  • Otherwise, T1 gets set to 0, so fσρ(T2Ts)σρ. This will allow us to induce on f:=(T2Ts)σ, which has less terms than f.

But in order to apply the inductive hypothesis, we also need to somehow get rid of the conditioning on ρS= and incorporate ρT1S=1 into a the condition Fρ0. To do this, observe that:

  • conditioned on ρS=, Fρ0 holds iff it holds under every possible fixing of the variables in S: that is, F can be replaced by σ{0,1}SFσ;7
  • to make sure that ρT1S=1, it suffices to add iT1Sxi to F.

All in all, “Fρ0,ρS=,ρT1S=1” is equivalent to “Fρ0,ρS=”, where the new function

F:=(σ{0,1}SFσ)(iT1Sxi)

doesn’t depend on the variables in S anymore.

This gives us

Pr[DT(fρ)tFρ0,ρS=,ρT1S=1]σ{0,1}SPr[DT(fσρ)trFρ0,ρS=,ρT1S=1]=σ{0,1}SPr[DT(fσρ)trFρ0,ρS=]=σ{0,1}SPr[DT(fσρ)trFρ0]=σ{0,1}Sσ1SPr[DT(((T2Ts)σf)ρ)trFρ0]σ{0,1}Sσ1S(Cδw)tr2r(Cδw)tr,

where the second equality holds because neither fσ nor F depend on the variables in S.

Putting (a) and (b) together

Coming back to our sum, we have

Pr[DT(fρ)tFρ0,T1ρ0]r=1wST1|S|=rPr[ρS=Fρ0,ρT1{1,}](a): ``Pr[T1 has r stars]''Pr[DT(fρ)tFρ0,ρS=,ρT1S=1](b): ``Pr[the rest of f has depth tr]''r=1w(2δw)r2r(Cδw)tr(Cδw)tr=1w(4/C)r.

It then suffices to set C:=8 in order to bound the sum r=1w(4/C)r by 1 and prove the lemma.

  1. When w is at least somewhat large, this is by far the most common outcome: the chance that no variable in Tj gets set to 0 is only (1+δ2)w=2Ω(w). But Håstad’s switching lemma doesn’t use this fact at all! 

  2. If they don’t (i.e. if f is read-once), then the above reasoning is completely valid, though! 

  3. To make this complete, we also have to argue that r won’t be too big, but this is the case. In fact, the probability that Tj1 has r stars decreases as rΩ(r), because we’re counterbalancing a union bound (wr)wr/r! over the choice of the r variables with a probability δr=O(1/w)r 2

  4. Johan Håstad, “Computational limitations of small-depth circuits” (PhD thesis). 

  5. This is not quite formal, but can be made formal by first drawing ρ[n]S, then drawing ρS (which is okay to do because all coordinates of ρ are independent). 

  6. Note that we’re not using the factor 1/r! that we could have gotten from the binomial (wr). In this case, it would only help us improve the constant C slightly. 

  7. Instead of applying this ORing trick, Håstad instead makes the induction hypothesis more general by allowing f and ρ to be defined on a subset I[n] of the variables whereas F remains a function over all n variables (Fρ is understood to leave the variables outside of I unfixed). In that case, we can replace ρ by its projection ρ on [n]S, which doesn’t change anything because under the assumption ρS=, we have fσρfσρ and FρFρ. And once that’s done, the conditioning ρS= disappears because ρ doesn’t depend on ρS