Adapted from a talk by Guy Blanc at the teach-us-anything seminar at Stanford.

We explain the link between PAC learning and VC dimension.

Theorem

Let X be a set, F{f:X{0,1}} be a concept class over X, and let ΠF(m) be largest possible the number of different ways functions in F can label m given points. Then except with probability 2ΠF(2m)2ϵm/2, a consistent learner1 will output a hypothesis with error at most ϵ.

Proof sketch

Let fF be the function being learned, and D be any distribution over X. For any hypothesis h, let LD(f,h):=PrxD[f(x)h(x)] be the “true error”. For some sample S from X, let LS(f,h) be the “training error”.

Let S be a set of m samples. Let A be the event that some hF gets LS(f,h)=0 (i.e. h is consistent with the sample) and yet LD(f,h)>ϵ (h’s true error is big). We want to upper bound Pr[A].

However, it is hard to reason about A directly: A involves the quantity LD(f,h) which depends on the behavior of h over the entire domain, so if we wanted to union bound over h, we would need to union bound over a potentially infinite number of functions. Crucial trick number 1 is to instead consider a proxy for event A, which instead of looking at LD(f,h), looks at an approximation LT(f,h) over an imaginary fresh sample T of m samples. That is, let B be the event that some hF gets LS(f,h)=0 and yet LT(f,h)ϵ/2 (for some fresh sample T of m samples).

We’ll upper bound Pr[B] instead. Why is this okay? Because conditioned on A happening, B is likely to happen. This can be seen from any old concentration bound: given that we know h has true error >ϵ, it’s likely to display a large error on a (large enough, on the order of Θ(1/ϵ)?) fresh sample T. So at the cost of making m large enough, we get Pr[BA]1/2, and thus Pr[B]Pr[A]Pr[BA]Pr[A]/2. Therefore, it is enough to show that Pr[B]ΠF(2m)2ϵm/2.

All that’s good, but it isn’t clear that we’ve made any progress. Sure, now technically only 2m points are involved, but we’re still “selecting” h from a potentially infinite set of functions. Why would “testing it again once” be enough to counterbalance this potentially infinite selection bias and make this sound? Crucial trick number 2 is not to focus on the chance that some h can fool us, but rather to focus on the imbalance between the number of errors that show up in S and T, irrespective of what h actually ends up being.

Here’s the thought experiment: Fix some h, and suppose that you first run h on a sample U of 2m points, and only then split U randomly into S and T. Let UbadU be the set of the ϵm/2 points in U that h labels incorrectly. In order for B to happen, all points in Ubad need to all end up in S. Each point in Ubad has a 1/2 chance to end up in S, and these events are negatively correlated, so the chance that they all end up in S is 2ϵm/2. Now, Ubad only depends on the behavior of h on the 2m points of the sample U, so there are only ΠF(2m) possible values for Ubad. Therefore,

Pr[B]``number of Ubad''×``probability for each Ubad''ΠF(2m)2ϵm/2.
  1. Any learner that returns some hypothesis hF that is consistent with all samples.