Suppose we have two (unnormalized) density functions p(x) and q(x) that can be easily evaluated on a point, and suppose that we’re able to get sample from q:

Pr[X=x]q(x).

Then, assuming that p is not too much more concentrated than q at any point, we can get samples from p by sampling from q then “tweaking the probabilities” appropriately.

Algorithm

Repeatedly:

  1. sample x from q;
  2. with probability cp(x)q(x), return x;
  3. otherwise, “reject” x and continue.

Analysis

As long as the “probability” in step 2 is never greater than 1, we’ll get value x with probability

q(x)×(cp(x)q(x))=cp(x),

so this has the right distribution. And it’s enough to set c:=minxq(x)p(x).

But the issue is that the smaller we set c, the longer the algorithm might take until it accepts for the first time. Indeed, assuming that p and q are actually probabilities (they sum to 1), the probability of accepting at any iteration of the loop is

xq(x)×(cp(x)q(x))=cxp(x)=c,

so it would take 1/c=maxxp(x)q(x) iterations on average until the algorithm succeeds. This quantity maxxp(x)q(x) is the max-divergence between p and q (see Power entropies and divergences).

Example: sampling from DNFs

Say we want to sample from the set of satisfying assignments of a DNF formula

f=T1Tm.

It’s easy to sample uniformly over the assignments Ti1(1) that satisfy a particular term Ti of the formula, since they form a hypercube. But the overlaps between the hypercubes Ti1(1) for different i make it nontrivial to sample uniformly over the entirety of f1(1).

That said, we can still easily sample according to the following distribution, which is somewhat close to the desired distribution:

  • pick a term Ti at random according to how large |Ti1(1)| is,
  • then pick a uniformly random x within Ti1(1).

This would be perfect if the hypercubes Ti1(1) were disjoint, but in general, assignments x that satisfy many terms have a higher chance of being drawn: we get density

q(x):=#{i|Ti(x)},

whereas the distribution we wish for is just

p(x):=1[T1(x)Tn(x)].

The ratio p(x)/q(x) is always at most 1, so we can just set c:=1, and each iteration has at least a

p(x)q(x)1m

chance of accepting x, so it takes m iterations on average.

#to-write swap p and q and capitalize