The cross-entropy of Q relative to a prior P measures the “average surprise” you will experience when seeing draws from Q if you were expecting draws from P:

H((Q)Q(P)P):=ExQ[log1P(x)].

Interpretation

It’s called “cross-entropy” because it mixes and matches elements from the entropies of Q and P: the expectation is taken over Q, but the probability values are from P. More formally,

Quantity Relation with H((Q)Q(P)P)
H[Q] As we’ll see in the next section, H((Q)Q(P)P)H[Q]. This is intuitive: you ought to be the least surprised if you were expecting the correct distribution, and H((Q)Q(Q)Q) is just H[Q].
H[P] There is no general relation between H((Q)Q(P)P) and H[P], but when P is uniform over some set X, then H((Q)Q(P)P)=log|X|=H[P] no matter what Q is.

Just like the entropy of Q measures the number of bits you need to describe draws from Q using an optimal encoding scheme, the entropy of Q relative to P is the number of bits you need to describe draws from Q using an encoding scheme that was designed to be optimal for P (and thus may be suboptimal for Q).

#to-write express as I[X at E; X]?

Relation to information divergence

We can decompose cross-entropy into the entropy of Q (the “inherent uncertainty” of draws Q) and the information divergence between Q and P (the “additional surprise” from expecting distribution P but getting Q):

H((Q)Q(P)P)``total surprise''=ExQ[log1P(x)]=ExQ[log1Q(x)]+ExQ[logQ(x)P(x)](1)=H[Q]``inherent uncertainty''+D((Q)Q(P)P)``additional surprise''H[Q],

with equality iff P and Q are identical.

#to-write chain rule (and observe that it’s not “truly” conditioned in the LHS)

Minimization

When training a generating model, we want to get a model distribution P to approach some target distribution Q, so it makes sense to try to minimize a statistical distance like the information divergence

D((Q)Q(P)P)=ExQ[logQ(x)P(x)].

However, we typically don’t have access to the probability values Q(x) (they’re exactly what we’re trying to learn); we can only sample from Q. Fortunately, equation (1) tells us that it’s equivalent to minimize the cross-entropy H((Q)Q(P)P), since the difference H[Q] is fixed, and

H((Q)Q(P)P)=ExQ[log1P(x)]

only requires knowledge of P(x).

When working on a finite dataset D, minimizing cross-entropy also corresponds to maximizing the likelihood of observing D, assuming the data points are independent:

ExD[log1P(x)]=1|D|log(xDP(x)).

The use of cross-entropy as a loss has a few interesting consequences:

  • Since H((Q)Q(P)P)H[Q], cross-entropy loss can never drop to zero. But as the loss approaches H[Q], the divergence D((Q)Q(P)P) tends to 0, which forces P to mimic Q with ever more accuracy.
    • On the other hand, we usually don’t know the inherent uncertainty H[Q] of the real-world distribution, so we can never know for sure how close we are to matching it, and how small the information divergence is.
  • Because cross-entropy is the cost of encoding draws from Q using the optimal scheme for P, any model with low cross-entropy loss gives a cheap encoding scheme for the real world-distribution Q, and vice versa: generative models are compression algorithms.

Gradient over logits

Suppose that the learned distribution P is generated by first computing a logit τx for each possible value x, then taking the softmax

P(x):=eτxzeτz.

Then direction of greatest decrease of H((Q)Q(P)P) as a function of the logits τ is given by

τH((Q)Q(P)P)=EyQ[τlogP(y)],

which, using the expression for the logarithmic derivatives of softmax, becomes

(τH((Q)Q(P)P))x=EyQ[1(y=x)P(x)]=Q(x)P(x).

When learning a real world distribution Q, since we don’t have access to the values of the density Q, we can’t use the expression Q(x)P(x) on the second line. Instead, we can use the expression on the first line:

  • draw some yQ;
  • push τy up in by 1P(y);
  • push every other τx down by P(x).

Backpropagation

If the logits themselves depend on some underlying parameters θ, then

θH((Q)Q(P)P)=xτxH((Q)Q(P)P)θτx=x(Q(x)P(x))θτx=ExQ[θτx]ExP[θτx].

In particular, if the logits come from some energy function

τx:=Eθ(x),

then

θH((Q)Q(P)P)=ExQ[θEθ(x)]+ExP[θEθ(x)].

Pushing in this direction means trying to decrease the energy over the target distribution Q but compensating by increasing the energy over the current model distribution P, hence the name contrastive divergence.

#to-write the loss isn’t computable in that case but you can get a proxy loss back