Dynamics depend on degree
Suppose a product
Setup
More concretely, let’s define a squared loss
and make each “weight”
This corresponds to what would happen if we were somehow training a linear network of depth
Degree
When the degree is
Phase 1: growth
Let’s first think about the early stage, when
Intuitively, we should expect
The relative derivative is constant, so this solves to an exponential
Phase 2: approach
In the later stage of learning, when
so
which solves to
Learning rate
#to-write
- derive the maximum learning rate
- adapt the conclusion in term of training steps
- but think about how this would change things with width >1 when there’s several different objectives with different singular values
: make the point that larger singular values are generally favored (which is not the case when ) - (note that in Deep linear networks we didn’t worry about the learning rate because it was all relative between different singular values; on the other hand you can’t make claims about (absolute) acceleration without worrying about those learning rates)
General case
In general, similar dynamics will force the weights
Phase 1: growth
When
Depending on the sign of the exponent
- If
, then will grow polynomially. In particular, the only relevant case is , and gives . So going from a small value to the halfway point takes time. - If
, as seen above, grows exponentially, and going from to takes time. - If
, then will grow hyperbolically as , and going from a small value to the halfway point takes time.
In particular, if
Phase 2: approach
When
so in any case the approach is exponential, with
which means that going from the halfway point
time. In this phase, bigger degree is always better.
Learning rate
#to-write same as above
Nonlinearities
Now suppose that we’re trying to approach
This is roughly the situation that would arise if you’re trying to approximate function
Then the dynamics are
Again, in the growth phase where
so again,
Not too surprisingly, this is exactly the same dynamics that we got for a the problem of approximating
How fast do weights equalize?
Let’s study how fast the ratio
For this, it’s convenient to look at the relative derivative of both the ratio
We have
which at the least confirms that if
That is,
-
or at least, it seems like their absolute values would ↩