See also: p-norms.
Suppose you want to show some vector $(x_1, \ldots, x_d)$ is sparse in $q$-norm. That is, you want to pick out some small subset of the coordinates $I \sse [d]$ such that the rest is small:
\[\sum_{i \not\in I} |x_i|^q \leq \eps.\]
Then intuitively, it should suffice to prove an upper bound on its $p$-norm for some $p<q$. Indeed, in lower-degree norms, small elements count for much more. So if they’re bounded in $p$-norm, they should become negligible in $q$-norm.
Bounded perimeter means sparse in area
The idea is simplest to grasp with $p=1$ and $q=2$. Indeed, if we visualize the $x_i$’s as $x_i \times x_i$ squares and draw them next to each other on a line, then
- the $1$-norm is the total length (or up to a factor $4$, the “total perimeter”);
- the squared $2$-norm is the total area.
Say the total length is $L$. The goal here is to pick a small subset of the squares so as to minimize the total area of the squares you don’t pick.
Clearly, to minimize the area you give up on, you should pick all the squares whose height is above some threshold $h$. If you do this, then the squares you didn’t pick are all within an $L \times h$ rectangle, so their total area is $\leq Lh$. On the other hand, since all the squares you picked had side length $\geq h$ and their total length is $\leq L$, there can only be $\leq L/h$ of them.
If you want to make sure the area you abandon is $\leq \eps$, then you need to set $h \ce \eps/L$, which means you’ll need to pick up to $L/h = L^2/\eps$ squares.
This trick is used in the proof of Mansour’s theorem about $w$-DNFs: one of the consequences of the switching lemma is that the coefficients of degree $O(w)$ have Fourier 1-norm at most $L \leq w^{O(w)}$, which implies Fourier concentration on about $L^2/\eps =w^{O(w)}$ of those sets (for constant $\eps$).
In general, suppose that
\[\norm{x}_p^{p} = \sum_{i=1}^d |x_i|^p \leq L.\]
If you let $I \ce \set{i \mid |x_i| \geq h}$ be the set of all $x_i$ above a threshold $h$, then the total $q$-norm outside of $I$ is
\[\sum_{i \not\in I} |x_i|^q = \sum_{i \not\in I} |x_i|^p |x_i|^{q-p} \leq \p{\sum_{i \not \in I} |x_i|^p}h^{q-p} \leq Lh^{q-p}.\]
On the other hand, each $i \in I$ contributes at least $h^p$ to the the $p$-norm, so $|I| \leq L/h^p$.
To make sure the error in $q$-norm is at most $\eps$, we need to set $h \ce (\eps/L)^{\frac{1}{q-p}}$, which gives sparsity
\[|I| \leq L / h^p = \frac{L}{(\eps/L)^{\frac{p}{q-p}}} = \frac{L^{1+\frac{p}{q-p}}}{\eps^{\frac{p}{q-p}}}.\]
Those $\frac{p}{q-p}$ terms in the exponents make sense: the closer $p$ is to $q$, the less small elements will shrink when going from $p$-norm to $q$-norm, so you’ll need to include more of them. In the limit when $q$ is much larger than $p$, you only have to pay $L$: the worst case is when you just have $x_1 = \cdots = x_L = 1$, because they don’t get smaller as the exponent grows.
This more general form of the trick is used in Lovett, Wu and Zhang’s DNF sparsification result. By hypercontractivity, they show that the $j\nth$ term is only used a $x_j^{3/2}$ fraction of the time, where $x_j$ is the probability that $j\nth$ stays alive under a mild random restriction. In addition, they use a switching lemma to show that $\sum_j x_j \leq 2^{O(w)}$. With $p\ce 1$, $q \ce 3/2$ and $L \ce 2^{O(w)}$, this means that you can get rid of all but $L^3/\eps^2 = 2^{O(w)}/\eps^2$ terms without changing the value of the function more than an $\eps$ fraction of the time.
As a fraction of the $q$-norm
If we express the loss $\eps$ we’re willing to incur as a fraction of the $q$-norm $\norm{x}_q^q$ instead of an absolute amount, then we get
h \ce \p{\frac{\eps}{L}}^{\frac{1}{q-p}}= \p{\frac{\delta \norm{x}_q^q}{\norm{x}_p^p}}^{\frac{1}{q-p}},
and the sparsity becomes
|I| \leq \frac{\norm{x}_p^p}{h^p} = \p{\frac{\p{\norm{x}_p/\norm{x}_q}^q}{\delta}}^{\frac{p}{q-p}}.
This says something pretty deep: as long as the decrease $\frac{\norm{x}_p}{\norm{x}_q}$ between the $p$-norm and the $q$-norm isn’t too big, the vector $x$ must be sparse.
Note: you can probably get even stronger guarantees when $\frac{\norm{x}_p}{\norm{x}_q}$ is very close to $1$ (you should be able to prove that basically all the weight is in a single coordinate).
Other applications
Daniel Hsu told me this is often used in the study of sparse recovery problems (e.g. compressed sensing), which makes sense. :)