Entropies are dimensionless notions of how spread out a random variable is (as opposed to dispersions, which care about distance in space). Concretely, for any convex function f such that f(0)=f(1)=0 we can define the f-entropy of a distribution P(X) as

Hf(P):=xXf(P(x))=ExP[f(P(x))P(x)].

Range

By convexity, f(p)0 for any p[0,1], so Hf(P) is nonnegativew. On the other hand, if we let n:=|X|, then by convexity we have

Hf(P)n=ExX[f(P(x))](convexity)f(ExX[P(x)])=f(1n),

so the f-entropy takes a maximum of nf(1n) when P is uniform.

Informally, xXf(P(x)) can be understood as the f-divergence Df((P)P(1)1), where 1 is an imaginary prior distribution that puts probability 1 on every point. In this sense, the f-entropy is (the negative of) the “absolute” version of the f-divergence, for a neutral prior. The closer P is to uniform, the lower Df((P)P(1)1) is, and the higher the entropy Hf(P) is.

More broadly, we’ll call “f-entropy” any (usually decreasing) function of Hf(P). For example,

  • the Shannon entropy is given by H(P):=Hf(P) for f(p):=plogp and ranges from 0 (single point) to logn (uniform);
  • the power entropies are given by Hα(P):=Hf(P)11α for f(p):=pα and they all range from 1 (single point) to n (uniform).