Entropy
The entropy1 of a random variable is a quantitative notion of how uncertain we are about its outcome. More precisely, it’s the “average surprise”2 you get from samples, where surprise is measured as
It’s the average “amount of information” you get from observing
By extension, we define the entropy of a distribution
Properties
- Entropy ranges between
(if is deterministic) and (if is uniform), where is the support of . - It’s subadditive:
, with equality iff and are independent.- Intuitively, you can’t be more surprised by the combination of
and than you were by seeing both outcomes separately. - The “correlated part” of
and is counted once in but twice in .
- Intuitively, you can’t be more surprised by the combination of
We will prove
Binary entropy
In particular, let
is concave, is symmetric around : ,- when
, , - when
, .
Conditional entropy
#to-write make the expectation explicit
Similarly, we can define conditional entropy as the average surprise you get from
where
Chain rule
The conditional entropy gives a chain rule:
which immediately shows that adding data can only increase the entropy
Together with subadditivity, this shows that conditioning can only decrease entropy
with equality iff
Mixtures and concavity
If we see
with equality respectively iff the variables
If we take a multiplicative view of entropy, then the entropy of a disjoint mixture is upper-bounded by the sum of the entropies:
and this is achieved if
Use as a prior
#to-write term like
Gradient
Over logits
Suppose that the learned distribution
Then the direction of greatest increase of
where the direct gradient term goes to zero because
(It makes sense that this term goes to zero, since it roughly corresponds to trying to increase the entropy by “decreasing the probability uniformly everywhere at once”, which is impossible.)
Using the expression for the relative derivatives of softmax, this further becomes
Intuitively, this says that logit
Over probabilities
Suppose we increase
In addition, we have
so the effect of increasing
#to-write
- the “amount of information” thing makes sense? because it’s the ==information you gain== if you started from a prior of
’s distribution then you actually observe ? i.e. - okay I’m now thoroughly KL-pilled, it’s the source of everything.
is the information you’ve gained if you started from a prior of and somehow your prior is now- but I want to understand the sense if which
is a lower bound on the “number of questions you must have asked in order to get there”?- i.e. the information
gains you about is limited by the information that it gives about itself - i.e.
? - actually I believe you’ve just rediscovered
- darn information theory is so cool
- and btw you can write this as
- and this directly shows that it’s equal to
I guess- though maybe you should swallow the pill and not talk about entropy ever again!
- note: I guess it also means
?- cool story bro
- wait is it the case that in general if you add a variable that’s already present it just flips the sign?
- and somehow I’m supposed to just know that
? - I guess this just suggests that if any variable is a constant then the interaction is
, which would make a whole lot of sense tbh - very similar to Wick in that case!
- and somehow I’m supposed to just know that
- really, entropy should be negative…
- and this directly shows that it’s equal to
- i.e. the information
- okay I’m this close to just replacing all occurrences of
by- and to have mutual information as a more core concept than entropy
- and to literally just define
as- this would explain why you have to somehow bother to use two copies of
in the definition of entropy!- (and same thing in basically all of the power entropies)
- ultimately ==information is about something providing information about another thing==
- there is no such thing as “the information of one thing”
- and the information divergence is the core of what information means
- and it’s also what makes information theory so nice: it provides all the nontrivial inequalities!
- information divergence measures the size of a partial update, whereas entropy measures the (average) size of a total update
- this would explain why you have to somehow bother to use two copies of
- does
make sense? it’s hmm maybe this is slightly cursed? - and
? so it keeps alternating, hmmm…- actually this is support for defining
instead of- indeed,
- and I’m not sure what this first quantity
is supposed to mean but it sure seems like it should be ?- if you set
then this quantity is entropy of the prior on ?
- if you set
- actually very confusing
- and I’m not sure what this first quantity
- and
- i.e. “how much does
help you predict nothing”?? - so we’d have to accept that
- i.e.
is actively detrimental to the task of predicting nothing? (???)
- i.e.
- i.e. “how much does
- actually feels like we should have
given that is not dependent with anything in the LHS (this is definitely a rule right?)- which would mean
- but then
, even though and is literally a point mass no matter what is- actually that’s a general argument for
being a constant (not clear which constant, but would seem like a reasonable choice here since we get zero whenever there is any deterministic constant in the expression?) - so
should be
- actually that’s a general argument for
- which would mean
- to summarize:
- given that
is the only tenable position, we would have to have - but then that violates the rule of “if
is independent from the joint distribution of the things in the LHS then the information should be ”
- given that
- okay I vote for
just being undefined!! is clearly the core quantity here- and it seems like this makes things better in the case of continuous random variables as well
- indeed,
- it’s in fact somewhat baffling that the interaction information is even well-defined (that it’s consistent and symmetric)
- actually this is support for defining
- actually I think I’d be on board to keep
only for the deficit
- but I want to understand the sense if which
- information theory is specifically about making dimensionless statistics additive
- KL divergence is to entropy the same as squared distance (!!) is to variance?
-
More precisely “Shannon entropy” or “information entropy” to separate it from the wider class of dimensionless uncertainty measures based described in Generalized entropies. ↩
-
Note that this is a pretty stupid notion of surprise. If I ask you to draw me numbers between
and and everytime you do so I go “woah that’s crazy, this number only have a chance of appearing”, then you would be justified in questioning my sanity. ↩ -
a convex combination of distributions ↩