p-norms are ways to aggregate numbers based on their pth power (for 1p). The higher p is, the more it cares about extreme values. In fact, the -norm is precisely the maximum value.

Depending on how you scale them, there are two kinds of p-norms:

  • p-lengths, which “sum” the numbers,
  • p-means, which “average” the numbers.

p-length

This is the usual notion of p-norm of a vector xRn:

xp:=(|x1|p+|xn|p)1/p.

In particular:

  • x1 is the Manhattan length,
  • x2 is the Euclidean length,
  • x is just the (absolute) maximum: x=maxi|xi|.

The p-length decreases with p: when p=1, every number contributes to the sum “equally”, but as the power p increases, the lower elements become less and less significant to the sum, until only the maximum value matters at all.

It decreases a lot unless x is sparse

Let 1p<q<. Since the function xq/p is strictly convex, by superadditivity, we have

xq=(|x1|q+|xn|q)1/q=((|x1|p)q/p++(|xn|p)q/p)1/q(superadditivity)((|x1|p++|xn|p)q/p)1/q=(|x1|p++|xn|p)1/p=xp,

with equality iff all but one of the coordinates are zero: vector x is “perfectly sparse”.

This is actually robust fact: if the ratio xpxq is small, then x must be sparse (in q-norm). Indeed, for any threshold h, you can separate x=xh+x<h where xh contains the entries bigger than h and x<h contains the rest. Then by the same calculation as above,

x<hqqx<hpphqpxpphqp,

so you can make sure that x<h contains at most a δ fraction of the q-norm by setting

xpphqp=δxqqh:=(δxqqxpp)1qp,

and then the number of nonzero entries in xh is at most

xpphp=((xp/xq)qδ)pqp.

See Lower-degree norms give sparsity for more intuition and details.

p-mean

Let a1,,anR0. If you divide the sum of the pth powers by n, you get a sort of average called “p-mean”1

Mp(a):=(a1p+anpn)1/p.

More generally, we can consider the pth moment of a random variable X (which we’ll assume is nonnegative for simplicity)

Xp:=E[Xp]1/p.

In particular:

  • M1(a) is the arithmetic mean and X1 is the expectation,
  • M2(a) is the quadratic mean,
  • M(a) and X are just the maximum.2

The p-mean / pth moment increases with p: when p=1, every number contributes to the average “equally”, but as the power p increases, the lower elements become less and less significant to the average, until only the maximum value matters at all.

It increases a lot unless a is spread / X is reasonable

Let 1p<q<. Since the function xq/p is strictly convex, by convexity, we have

Xp=(E[Xp])1/p=(E[Xp]q/p)1/q(convexity)(E[Xq])1/q=Xq,

and thus also Mp(a)Mq(a). In particular, the inequality M1(a)M2(a) (known as “AM-QM”) is a special case of Cauchy–Schwarz, and the inequality X1X2 is a consequence of the definition of variance.

The equality case is when

  • X is constant: it’s “perfectly reasonable”,
  • all the ai’s are equal: they are “perfectly spread”.

This is also a robust fact: if the ratio XqXp is small, then X must be reasonable: in particular, it can’t be almost always smaller than its p-norm. Indeed,

E[Xp](tXp)p+E[Xp1[XtXp]](by Hölder's)(tXp)p+E[Xq]pqPr[XtXp]qpqPr[XtXp]qpp(1tp)XppXqpPr[XtXp](1tp(Xq/Xp)p)qqp,

so in particular, when the ratio is constant, Pr[XtXp]Ω(1) for any t<1.

Note: I’m pretty sure you can also show that when XqXp1, it’s not only reasonable but actually concentrated:

Pr[X(1±o(1))Xp]o(1).

For p=1,q=2, it is true by Chebyshev since X22X12 is precisely the variance of X. I suspect you could prove it in general using a quantitative version of Jensen’s inequality and a lot of courage.

Overall picture

Bounding either side

The p-length inequalities and the p-mean inequalities complement each other. On the one hand, we can use the p-mean inequalities to bound how fast the p-length decreases:

xq=n1qMq(x)n1qMp(x)=xpn1p1q.

And on the other hand, we can use the p-length inequalities bound how fast the p-means increases:

Mq(a)=aqn1qapn1q=n1p1qMp(a).

But on the third hand, you can’t bound how fast the moments of a continuous random variable increase: in fact, it could be that the pth moment is finite but the qth moment is infinite.

So we get the following picture (for 1p<q):

  • The q-norm xq is stuck between xpn1p1q (“spread” regime) and xp (“sparse” regime).
  • The q-mean Mq(a) is stuck between Mp(a) (“spread” regime) and n1p1qMp(a) (“sparse” regime).
  • The qth moment Xq is stuck between Xp (“reasonable/concentrated” regime) and
    • either n1p1qXp if X is discrete (“sparse” regime),
    • or if X is continuous (“spiky” regime).

TODO: change “concentrated” to “spread” in this graph #figure TODO: also make a “table” #figure

Qualities being tracked

We can use ratios between norms to track different qualitative behaviors (the ratios below are 1, with 1 indicating perfection):

  • for a vector xRn, the length ratios xpxq track its sparsity;
  • for a sequence aR0n, the mean ratios Mq(a)Mp(a) track its spreadness;
  • for a random variable X, the moment ratios XqXp track its reasonableness;
    • by extension, so do the norm ratios fqfp for a function f (by considering its value f(X) over a random input X);
  • for a discrete random variable X, the power entropy ratios Hα[X]Hβ[X] (for 0α<β ) track its flatness.

TODO: add something about Density functions (are inner products like f,fp the right notion?)

See also

  1. In fact, p-means can be defined for any p, but they’re only norms for p1 (in the sense of satisfying the triangle inequality). 

  2. at least, assuming X is either discrete or its pdf is continuous