Suppose a product $w_1 \cdots w_d$ is trained to approach a target value $s$, with gradient descent. How fast will it approach, and how does this depend on the degree $d$?
More concretely, let’s define a squared loss
\CL \ce \f12(w_1\cdots w_d - s)^2,
and make each “weight” $w_i$ evolve over time according to its gradient
\frac{\d w_i}{\d t}= -\frac{\partial \CL}{\partial w_i} = (s-w_1 \cdots w_d) \prod_{j \ne i} w_j.
This corresponds to what would happen if we were somehow training a linear network of depth $d$ and width $1$ on a 1D linear regression task where the ground truth is $f(x) = sx$. But more generally, as shown in Deep linear networks, this also roughly describes the dynamics with which linear networks of any width learn the singular modes of the linear relationship between inputs and outputs.
Degree $d=2$
When the degree is $2$, we have just
\f{\d w_1}{\d t} &= (s-w_1w_2)w_2\\
\f{\d w_2}{\d t} &= (s-w_1w_2)w_1.
Phase 1: growth
Let’s first think about the early stage, when $w_1w_2 \le s/2$, where we have
\f{\d w_1}{\d t} &= \Theta(sw_2)\\
\f{\d w_2}{\d t} &= \Theta(sw_1).
Intuitively, we should expect $w_1$ and $w_2$ to approach each other (at least in relative terms): indeed, suppose that initially $w_1 \gg w_2$, then that would cause $w_2$ to grow much faster than $w_1$ until it mostly catches up to it. We describe this process of catching up in more details in the section How fast do weights equalize? below, but for now let’s just assume that $w_1$ and $w_2$ have caught up, which means there is now just a single variable $w = w_1 = w_2$. The early stage corresponds to $w^2 \le s/2$, and
\f{\d w}{\d t} = \Theta(sw) \implies \f{\ds{\p{w^2}}{t}}{w^2} = \Theta(s)
The relative derivative is constant, so this solves to an exponential $w^2(t) = e^{\Theta(st)}w^2(0)$, which means that it would take time $\Theta\p{\f{\log(s/\eps)}{s}}$ to go from a small value $w^2=\eps$ to the halfway point $w^2 = s/2$.
Phase 2: approach
In the later stage of learning, when $w^2 \ge s/2$, we have
\f{\d\p{w^2}}{\d t} = 2\p{s-w^2}w^2,
\f{\ds{\p{s-w^2}}{t}}{s-w^2} = -2w^2 = -\Theta(s),
which solves to $s-w^2(t) = e^{-\Theta(s)t}\p{s-w^2(0)}$, which means that it would take time $\Theta\p{\f{\log(1/\eps)}{s}}$ to go from the halfway point $w^2=s/2$ to $w^2 = s(1-\eps)$.
Learning rate
- derive the maximum learning rate
- adapt the conclusion in term of training steps
- but think about how this would change things with width >1 when there’s several different objectives with different singular values $s$: make the point that larger singular values are generally favored (which is not the case when $d=1$)
- (note that in Deep linear networks we didn’t worry about the learning rate because it was all relative between different singular values; on the other hand you can’t make claims about (absolute) acceleration without worrying about those learning rates)
General case
In general, similar dynamics will force the weights $w_1, \ldots, w_d$ to approach each other in relative terms, and we would be left with one variable $w$ with
\f{\d w}{\d t} = \p{s - w^d} w^{d-1} \implies \f{\d\p{w^d}}{\d t} = d\p{s - w^d} w^{2d-2}.
Phase 1: growth
When $w^d \le s/2$, treating $d$ as a constant, we have
\f{\ds{\p{w^d}}t}{w^d} = \Theta\p{s w^{d-2}} = \Theta\p{s\p{w^d}^{1-2/d}}.
Depending on the sign of the exponent $1-2/d$ in this relative derivative, $w^d$ will experience different types of growth:
- If $d < 2$, then $w^d$ will grow polynomially. In particular, the only relevant case is $d=1$, and gives $w(t) = w(0) + \Theta(s)$. So going from a small value $w\approx 0$ to the halfway point $w=s/2$ takes $\Theta(1)$ time.
- If $d=2$, as seen above, $w^2$ grows exponentially, and going from $\eps$ to $s/2$ takes $\Theta\p{\f{\log(s/\eps)}{s}}$ time.
- If $d>2$, then $w^d$ will grow hyperbolically as $w^d(t) = \Theta\p{\f1{s(t-t_0)}}^{1+\f{2}{d-2}}$, and going from a small value $w^d = \eps$ to the halfway point $w^d = s/2$ takes $\Theta\p{\f{\eps^{-(1-2/d)}}{s}}$ time.
In particular, if $w^d$ is starting from a fairly low value, degrees $d>2$ will take a long time to take off, and in general the trade-off depends on $s$.
Phase 2: approach
When $w^d \ge s/2$, treating $d$ as a constant, we have
\f{\ds{\p{s-w^d}}t}{s-w^d} = \Theta\p{w^{2d-2}} = \Theta\p{\p{w^d}^{2-2/d}} = -\Theta\p{s^{2-2/d}},
so in any case the approach is exponential, with
s - w^d(t) = e^{-\Theta\p{s^{2-2/d}t}}\p{s-w^d(0)},
which means that going from the halfway point $w^d = s/2$ to $w^d = (1-\eps)s$ takes
time. In this phase, bigger degree is always better.
Learning rate
#to-write same as above
Now suppose that we’re trying to approach $s$ with a monomial like $w_1^2 w_2$, where some of the weights are squared, and the loss is still the square
\CL \ce \f12\p{w_1^2w_2 - s}^2.
This is roughly the situation that would arise if you’re trying to approximate function $f(x) = sx^2$ using a depth-2 network with a quadratic nonlinearity $\fi(h) \ce h^2$ on its hidden layer.
Then the dynamics are
\f{\d w_1}{\d t} &= 2(s-w_1^2w_2)w_1w_2\\
\f{\d w_2}{\d t} &= (s-w_1^2w_2)w_1^2.
Again, in the growth phase where $w_1^2w_2 \le s$, we have
\f{\d w_1}{\d t} &= \Theta(sw_1w_2)\\
\f{\d w_2}{\d t} &= \Theta(sw_1^2),
so again, $w_1$ and $w_2$ will tend to equalize, because if e.g. $w_1 \gg w_2$, then $w_2$ will grow much faster. So we can approximate the dynamics by
\f{\d w}{\d t} = \Theta\p{\p{s-w^3}w^2}.
Not too surprisingly, this is exactly the same dynamics that we got for a the problem of approximating $s$ by $w_1 w_2 w_3$. So, depending on how a nonlinearity acts, it seems like it could make a neural network act like it has more depth than it actually has!
How fast do weights equalize?
Let’s study how fast the ratio $w_1/w_2$ approaches $1$ when weights are subject to the dynamics
\f{\d w_1}{\d t} &= (s-w_1w_2)w_2\\
\f{\d w_2}{\d t} &= (s-w_1w_2)w_1.
For this, it’s convenient to look at the relative derivative of both the ratio $w_1/w_2$ itself, but also compare it with the relative derivative of their product $w_1w_2$, since it will allow us to compare the timescales at which equalizing vs learning happen.
We have
\f{\ds{(w_1w_2)}t}{w_1w_2} &= \f{\ds{w_1}t}{w_1} + \f{\ds{w_2}t}{w_2} = (s-w_1w_2)\p{\f{w_2}{w_1} + \f{w_1}{w_2}}\\
\f{\ds{(w_1/w_2)}t}{w_1/w_2} &= \f{\ds{w_1}t}{w_1} - \f{\ds{w_2}t}{w_2} = (s-w_1w_2)\p{\f{w_2}{w_1} - \f{w_1}{w_2}},
which at the least confirms that if $w_1 > w_2$ then $w_1/w_2$ is decreasing over time, and vice versa. Let’s assume we start out with $w_1 \gg w_2 > 0$, then we have
\f{\ds{(w_1w_2)}t}{w_1w_2} &\approx (s-w_1w_2)\f{w_1}{w_2}\\
\f{\ds{(w_1/w_2)}t}{w_1/w_2} &\approx -(s-w_1w_2)\f{w_1}{w_2}.
That is, $w_1w_2$ increases at the same relative rate as $w_1/w_2$ decreases, which means (to a first approximation), we can expect that $w_1$ and $w_2$ will roughly equalize as long as we start out with
\UB{\f{s}{w_1w_2}}_\text{how much $w_1w_2$ needs to grow} \ge \UB{\f{w_1}{w_2}}_\text{how much $w_1/w_2$ needs to decrease} \iff w_1 \le \sqrt{s}.