Adapted from a talk by Guy Blanc at the teach-us-anything seminar at Stanford.
We explain the link between PAC learning and VC dimension.
Theorem
Let be a set, be a concept class over , and let be largest possible the number of different ways functions in can label given points. Then except with probability , a consistent learner will output a hypothesis with error at most .
Proof sketch
Let be the function being learned, and be any distribution over . For any hypothesis , let be the “true error”. For some sample from , let be the “training error”.
Let be a set of samples. Let be the event that some gets (i.e. is consistent with the sample) and yet (’s true error is big). We want to upper bound .
However, it is hard to reason about directly: involves the quantity which depends on the behavior of over the entire domain, so if we wanted to union bound over , we would need to union bound over a potentially infinite number of functions. Crucial trick number 1 is to instead consider a proxy for event , which instead of looking at , looks at an approximation over an imaginary fresh sample of samples. That is, let be the event that some gets and yet (for some fresh sample of samples).
We’ll upper bound instead. Why is this okay? Because conditioned on happening, is likely to happen. This can be seen from any old concentration bound: given that we know has true error , it’s likely to display a large error on a (large enough, on the order of ?) fresh sample . So at the cost of making large enough, we get , and thus . Therefore, it is enough to show that .
All that’s good, but it isn’t clear that we’ve made any progress. Sure, now technically only points are involved, but we’re still “selecting” from a potentially infinite set of functions. Why would “testing it again once” be enough to counterbalance this potentially infinite selection bias and make this sound? Crucial trick number 2 is not to focus on the chance that some can fool us, but rather to focus on the imbalance between the number of errors that show up in and , irrespective of what actually ends up being.
Here’s the thought experiment: Fix some , and suppose that you first run on a sample of points, and only then split randomly into and . Let be the set of the points in that labels incorrectly. In order for to happen, all points in need to all end up in . Each point in has a chance to end up in , and these events are negatively correlated, so the chance that they all end up in is . Now, only depends on the behavior of on the points of the sample , so there are only possible values for . Therefore,