For a moment, let’s forget that interference ever existed, and figure out how (and how fast) regularization will push towards sparsity in some encoding Wi. Since we’re only looking at feature benefit and regularization, the other encodings have no influence at all on what happens in Wi.

On each weight Wik, we have

  • feature benefit pushing up by (1Wi2)Wik,
  • regularization pushing down by λ sign(Wik).

Crucially, the upwards push is relative to Wik, while the downwards push is absolute. This means that weights whose absolute value is above some threshold θ will increase, while those below the threshold will decrease, creating a “rich gets richer and poor gets poorer” dynamic that will push for sparsity. This threshold is

(1Wi2)Wik=λ sign(Wi)|Wik|=λ1Wi2=:θ.

And we have

(1)d|Wik|dt=(1Wi2)|Wik|feature benefitλ1[Wik0]regularization={(1Wi2)constant in k(|Wik|θ)distance from thresholdif Wik00otherwise.

#figure

We call this expression the sparsity force. #to-write use that term in the rest of the write-up

#to-write already note informally that this stretches the values equally and therefore is affine (will maintain relative differences between nonzero values)?

Note that the threshold θ is not fixed: we will see that as Wi gets sparser, Wi2 gets closer to 1, which increases the threshold and allows it to get rid of higher and higher entries, until only one is left. But how do we know what values 1Wi2 will take over time?

A dynamic balance on Wi2

As before, let’s consider the parallel push, which is the time derivative of Wi2.

  • Feature benefit pushes Wi2 up with force (1Wi2)Wi2, which is Ω(1) when Ω(1)Wi21Ω(1).
  • Reguralization pushes Wi2 down by λWi1, which is initially Θ(λm), and can only get smaller as Wi gets sparser while maintaining Wi1.[^5]

Therefore, the push from regularization will always remain sub-constant, and in particular will never be able to push Wi2 below 1/2 (since that would require an Ω(1) push down to counteract feature benefit when Wi2=1/2). This means that in long-ish timescales, they will tend to be in balance (technically they could oscillate, but that seems unlikely so let’s assume they don’t): the upwards push on Wi2 from feature benefit roughly matches the downwards push on Wi2 from regularization.

#to-think Is this assumption of Wi2 being static truly realistic? Need to check once I’ve found out what are the dynamics of Wi1 assuming this. I think we’ll find that the derivative of Wi1 will be proportional to λ and thus the derivative of Wi2 will be λdWi1dt, which proportional to λ2, so we can make it vanish by setting λ arbitrarily low.

Balance happens when

(1Wi2)Wi2=λWi11Wi2=λWi1Wi2λWi1.

Let’s call this the balance condition. What’s great about it is that if we have a guess about what Wi1 is at some point in time, it also tells us that what 1Wi2 is, and therefore it tells us that the sparsity force will be

d|Wik|dt{λWi1(|Wik|θ)distance from thresholdif Wik00otherwise.

and the current threshold θ is

θ=λ1Wi2λλWi1=1Wi1.

How fast does it go?

See Feature benefit vs regularization through blowup for a more complicated and probably erroneous previous attempt.

Summing up equation (1) and letting m:=#{k|Wik0} be the number of nonzero weights in Wi, we have

dWi1dt=λmregularization(1Wi2)Wi1feature benefit(by balance condition)=λWi2(mWi2Wi12)=λ(m)2Wi2(Wi2m(Wi1m)2)=λ(m)2Wi2×k:Wik0(|Wik|Wi1m``deviation from mean'')2m``sample variance over nonzero weights'',

where the last inequality is essentially the identity

E[X2]E[X]2=Var[X]

where the random variable X is drawn by picking a k at uniformly at random in {k|Wik0} and outputting |Wik|.

If X’s relative variance is a constant (i.e. if X is “well-rounded”), then

dWi1dt=λ(m)2Wi2Var[X](X well-rounded)=λ(m)2Wi2Θ(E[X]2)=Θ(λWi2Wi12)(assuming Wi2=Θ(1))=Θ(λWi12),

or if we define w:=1Wi1 (which is a proxy for the “typical nonzero weight”, and is θ when Wi21), this becomes

dwdt=Θ(λ),

so w(t)=w(0)+Θ(λt) and

Wi(t)1=1Θ(w(0)+λt)=1Θ(1m+λt)

with high probability in m.

Will the relative variance be a constant?

Empirically, the relative variance is indeed a constant not too far from 1 (see plot below). But why is that?

Suppose that currently Wi1Wi2Wim0, and let’s look at the relative difference between the biggest weight Wi1 and some other weight Wik>0, i.e.

γk:=Wi1WikWi1=1WikWi1.

Using relative derivatives, we have

dγkdt=d(Wik/Wi1)dt=WikWi1(dWik/dtWikdWi1/dtWi1).

Since feature benefit is a relative force, it contributes nothing to the difference of the relative derivatives of Wik and Wi1, so we just have the contribution from regularization

dγkdt=WikWi1(λWikλWi1)=λWikWi1(1Wik1Wi1)=λWi1(1WikWi1)=λWi1γk.

Note that this differential equation doesn’t involve Wik at all! This means that there is a single function γ(t) defined by

{γ(0)=1dγdt(t)=λWi1(t)γ(t)

such that for all k, as long as Wik(t)>0,

1Wik(t)Wi1(t)=γ(t)(1Wik(0)Wi1(0))Wik(t)=Wi1(t)(1γ(t))doesn't depend on k+γ(t)Wi1(t)Wi1(0)doesn't depend on kWik(0).

In other words, the relative spacing of the nonzero weights never change: their change between times 0 and t is a single affine transformation.

Since the relative variance is scaling-invariant, we can think of this affine transformation as a simple translation. The value of the relative variance of the remaining nonzero weights Wi1(t),,Wim(t) at some point in time must be of the following form:

  • take the initial values Wi1(0),,Wim(0),
  • translate them left by some amount which leaves m weights positive,
  • drop the values that have become 0,
  • then compute the relative variance of what’s left.

In particular, the relative variance when m weights are left must lie between the relative variance of

(Wi1(0)Wi(m+1)(0),Wi2(0)Wi(m+1)(0),,Wim(0)Wi(m+1)(0))

and the relative variance of

(Wi1(0)Wim(0),Wi2(0)Wim(0),,0)

(since these extremes have the same variance but the latter has a smaller mean).

These relative variances are functions of m and the initial value of Wi only, and (when Wi is made of mean-0 normals) they will be Θ(1) with high probability in m. The plot below shows these lower and upper values for Wi(0) itself (in red) and for an idealized version of Wi(0) that hits regular percentiles (in pink, dashed).

We can see that the orange curve does indeed lie within the red curves, and that the red and pink curves only start to diverge significantly at later time steps when m is smaller.