Adapted from Deep linear networks#Speed of lowering interference.
The interaction between all $n$ encodings through the interference force are complex, but it seems reasonable to think just of the “first-order” interactions between only two specific encodings $W_i$ and $W_j$, while zeroing out all other interference forces.
Our metric: cosine similarity
Since they don’t start out with unit length, it turns out that the right quantity to look at is the cosine similarity
\kappa_{ij} \ce \frac{W_i \cdot W_j}{\norm{W_i} \norm{W_j}}.
Indeed, $\kappa_{ij}$ doesn’t change when $W_i$ or $W_j$ gets scaled up by some factor (without changing direction), so the feature benefit force doesn’t affect $\kappa_{ij}$. So for the purposes of studying how fast $\kappa_{ij}$ decreases at any point in time, assuming the regularization is low, the only relevant changes in $W_i$ are the interference forces:
\frac{\d W_i}{\d t} \approx -\sum_{j \ne i}\ReLU\p{W_i \cdot W_j}W_j.
Relative change in the numerator
Take some $i,j$ such that $W_i \cdot W_j > 0$ currently. The interference forces affect the numerator as
\frac{\d\p{W_i \cdot W_j}}{\d t}
&= \frac{\d W_i}{\d t}\cdot W_j + W_i \cdot \frac{\d W_j}{\d t}\\
&= -\p{W_i \cdot W_j}\p{\norm{W_i}^2 + \norm{W_j}^2} - \UB{2\sum_{k \ne i,j}\p{W_i \cdot W_k}\p{W_j \cdot W_k}}_\text{second-order effects},
which, ignoring the second-order effects, is a relative change of $-\p{\norm{W_i}^2+\norm{W_j}^2}$.
Relative change in the denominator
The norms are affected as
\frac{\d \p{\norm{W_i}^2}}{\d t} = 2\frac{\d W_i}{\d t} \cdot W_i,
and we’ve already seen in Incidental polysemanticity setup and initial forces#Strength at initialization that (at least at initialization) the contribution of the interference forces to this is $-\Theta(n/m)$. Since $\norm{W_i}^2$ is a constant, this is a relative change of $-\Theta(n/m)$ for both $\norm{W_i}$ and $\norm{W_j}$.
Combining the changes
Overall, we get
\frac{\d \kappa_{ij}}{\d t}
&\approx -\kappa_{ij}\p{\norm{W_i}^2 + \norm{W_j}^2 - \Theta(n/m)}.\\
%&= -\kappa_{ij}\times \Theta\p{\max\p{\norm{W_i}^2, \norm{W_j}^2}}
Both $\norm{W_i}$ and $\norm{W_j}$ are constants throughout, while $n/m = o(1)$ by assumption #2, so we just get
\frac{\d \kappa_{ij}}{\d t}
\approx -\Theta(\kappa_{ij})
Therefore, $\kappa_{ij}$ will drop exponentially over time, and so will the dot product $W_i \cdot W_j$, so it will evolve as
W_i(t) \cdot W_j(t) \approx e^{-\Theta(t)}\p{W_i(0) \cdot W_j(0)} = \Theta\p{\frac{e^{-\Theta(t)}}{\sqrt{m}}}.
Total change in $W_i$ and $W_j$ over time
Since $W_i \cdot W_j$ decreases exponentially starting at $\Theta\p{1/\sqrt{m}}$, the total push that $W_i$ will receive from the force $-\ReLU(W_i \cdot W_j) W_j$ over all of time will be about $-\Theta\p{1/\sqrt{m}}W_j$. Another way to think about it is that it’s approximating the operation of removing from $W_i$ its projection onto $W_j$ to orthogonalize them:
W_i \gets W_i - \frac{W_i \cdot W_j}{\norm{W_j}^2}W_j,
and the coefficient $\frac{W_i \cdot W_j}{\norm{W_j}^2}$ is indeed $\Theta\p{1/\sqrt{m}}$.
Each such orthogonalization operation only decreases $\norm{W_i}^2$ by about
\Theta\p{\frac{1}{\sqrt{m}}} W_i \cdot W_j = \Theta\p{\frac{1}{m}},
which only sums up to $O(n/m)$ over all values of $W_j$. So arguably, since $n/m = o(1)$ by assumption #2, $W_i$ is barely affected by the resolution of the interferences. Therefore, by the end of phase 2, I expect we can morally assume that the encodings are distributed like a sample of $n$ independent uniform unit vectors in $\R^m$.
#to-think explorer derive the combined force you get if you assume the norm doesn’t change, like you did in Feature benefit vs regularization