Feature benefit vs regularization
For a moment, let’s forget that interference ever existed, and figure out how (and how fast) regularization will push towards sparsity in some encoding
On each weight
- feature benefit pushing up by
, - regularization pushing down by
.
Crucially, the upwards push is relative to
And we have
#figure
We call this expression the sparsity force. #to-write use that term in the rest of the write-up
#to-write already note informally that this stretches the values equally and therefore is affine (will maintain relative differences between nonzero values)?
Note that the threshold
A dynamic balance on
As before, let’s consider the parallel push, which is the time derivative of
- Feature benefit pushes
up with force , which is when . - Reguralization pushes
down by , which is initially , and can only get smaller as gets sparser while maintaining .[^5]
Therefore, the push from regularization will always remain sub-constant, and in particular will never be able to push
#to-think Is this assumption of
Balance happens when
Let’s call this the balance condition. What’s great about it is that if we have a guess about what
and the current threshold
How fast does it go?
See Feature benefit vs regularization through blowup for a more complicated and probably erroneous previous attempt.
Summing up equation (1) and letting
where the last inequality is essentially the identity
where the random variable
If
or if we define
so
with high probability in
Will the relative variance be a constant?
Empirically, the relative variance is indeed a constant not too far from

Suppose that currently
Using relative derivatives, we have
Since feature benefit is a relative force, it contributes nothing to the difference of the relative derivatives of
Note that this differential equation doesn’t involve
such that for all
In other words, the relative spacing of the nonzero weights never change: their change between times
Since the relative variance is scaling-invariant, we can think of this affine transformation as a simple translation. The value of the relative variance of the remaining nonzero weights
- take the initial values
, - translate them left by some amount which leaves
weights positive, - drop the values that have become
, - then compute the relative variance of what’s left.
In particular, the relative variance when
and the relative variance of
(since these extremes have the same variance but the latter has a smaller mean).
These relative variances are functions of

We can see that the orange curve does indeed lie within the red curves, and that the red and pink curves only start to diverge significantly at later time steps when