Adapted from lecture 12 of Ryan O’Donnell’s Fall 2017 graduate complexity theory class at CMU, lecture 12c of his Spring 2020 “CS theory toolkit” class (also at CMU), and sections 1.2-1.3 of the lecture notes for Emanuele Viola’s Fall 2017 “Special topics in complexity theory” class at Northeastern.
A distribution over is -wise uniform if looks uniform over every set of variables. Similarly, a random hash function is -wise uniform if the hashes of distinct inputs are distributed uniformly. In fact (and this took me way too much time to realize), this is the same notion: a -wise uniform hash function over corresponds to a -wise uniform distribution over : just interpret each of the outputs
as one coordinate of the distribution. So we will use them interchangeably.
This is equivalent to fooling tests of degree . Indeed,
- if has degree , then we can decompose into the expectation of its monomials, and those must be fooled since they each depend on only variables;
- conversely, if for some set of variables and some assignment , then the polynomial has degree and isn’t fooled.
In this note, we’ll try to build -wise uniform distributions and hash functions that can be efficiently computed from a short random seed.
The case is trivial: we just need to make sure that individually, each coordinate of the distribution is or with probability each. So we can let be uniform over the two strings and . In terms of hash functions, we can let be the all-zeroes or all-ones function with probability each.
This requires only bit of randomness: the seed length is .
When , in addition to making sure with probability for each , we also need to make sure that the outputs of two distinct points are equal as often as they’re unequal. In particularly, and shouldn’t always have the same value.
In the abstract, how would you try to design a hash function that avoids pairwise collisions? One attempt might be to do some kind of “parity check”, say XOR all the bits together. But for whatever parity check you come up with, there will be some strings that nevertheless map to the same value.
On the other hand, taking a random parity check seems more promising: that is, we’ll draw a random “parity check” string , then hash to
If , then their hashes will differ with probability : indeed, the “difference” is , and if you take a random parity check of a nonzero string, it will be with probability .
This doesn’t give us pairwise uniformity, though: the issue is that is always . Luckily, this is the only issue, because for every that’s not the all-zeros string, we do have with probability . In fact, this means we already have a pairwise-uniform distribution over coordinates given by .
To get pairwise uniformity, it’s enough to avoid the all-zeros string by first mapping each to then taking a random parity check for . This is equivalent to drawing a random string and a random bit , and mapping to
This requires seed length for hash functions, and for distributions. And it’s very easy to compute. :)
Dual distance
It’s fairly natural to try and achieve -wise uniformity by making the distribution uniform over a linear code . Indeed, a linear code is uniform over a set of coordinates if and only if it is “complete over ”, i.e. . In particular, this means that will be -wise uniform as long as any columns in the generator matrix of are linearly independent.
For example, the distribution with coordinates we described in the previous section is the Hadamard code: when , it is generated by the matrix of all nonzero vectors
with . It’s pairwise uniform because any two distinct nonzero vectors are linearly independent.
The condition “any columns in are linearly independent” can be rephrased as “no nonzero vector of Hamming weight is perpendicular to all rows of ”: such a vector would give the coefficients of a linear combination of columns that sums to zero. Since the vectors perpendicular to all rows of form the dual code , this is also equivalent to the dual code having distance (we call this distance the dual distance).
This means that to find a small -wise uniform distribution , it’s enough to find a large linear code with distance : the parity checks of will be the generators of , so their number is the seed length needed. In the example above, is the Hamming code, which has distance and has the parity checks indicated by .
General case: Reed-Solomon codes
The Hamming bound tells us that a code of distance can only cover a fraction of . So in the best case, would need parity checks, and thus the seed length would be at least in the best case.
Embarrasingly, I don’t know how to match the Hamming bound in general over , but I do know how to do this over larger fields such as . And fortunately, if we have a -wise uniform distribution over , we still get a -wise uniform distribution over by interpreting the elements of as strings in and concatenating them together.
Specifically, Reed-Solomon codes are great at achieving dual distance. Pick a field of size , and fix distinct elements in it (the “evaluation points”). Let be the set of all polynomials of degree . Consider the code where each codeword is the values of some polynomial at the evaluation points :
Polynomials of degree are uniquely determined by their value at distinct points, so is complete on every set of points, and therefore the uniform distribution on is -wise uniform.
Each coefficient takes bits of randomness, so by setting and , we get a -wise uniform distribution over with seed length as hoped, and it’s very easy to compute.
Note: If you really care, you can get seed length from BCH codes, and you can generalize that to if both the seeds and the outputs are over instead of .
Lower bounds
We seemed to get the best possible seed length we could have gotten from linear codes, but maybe there was some a more efficient way to do things, with a seed length much smaller than ?
Let’s start with a very basic observation. Consider the matrix whose rows are all possible outcomes of the distribution. If the distribution is pairwise uniform, then at the very least, no two columns shouldn’t be completely identical: otherwise those two coordinates would be perfectly correlated. There are only possible columns, so we get .
But we have a stronger guarantee than this: for any two columns , we should have that as often as . If we switch the domain from to , this means their inner product is : and are orthogonal. A space of dimension cannot have more than pairwise orthogonal vectors, so we have .
More generally, if the distribution is -wise uniform, then for any columns , the -tuple should take every value in equally often. If we split them into two groups (assume is even), then the products and should be equal as often as they’re unequal. This means that the element-wise products and are orthogonal. There are such products of columns, so by the same argument we get .