$\require{mathtools} % %%% GENERIC MATH %%% % % Environments \newcommand{\al}[1]{\begin{align}#1\end{align}} % need this for \tag{} to work \renewcommand{\r}{\mathrm} % % Greek \newcommand{\eps}{\epsilon} \newcommand{\veps}{\varepsilon} \newcommand{\Om}{\Omega} \newcommand{\om}{\omega} \newcommand{\Th}{\Theta} \let\fi\phi % because it looks like an f \let\phi\varphi % because it looks like a p % % Miscellaneous shortcuts % .. over and under \newcommand{\ss}[1]{_{\substack{#1}}} \newcommand{\ob}{\overbrace} \newcommand{\ub}{\underbrace} \newcommand{\ol}{\overline} \newcommand{\tld}{\widetilde} \newcommand{\HAT}{\widehat} \newcommand{\f}{\frac} \newcommand{\s}[2]{#1 /\mathopen{}#2} \newcommand{\rt}[1]{ {\sqrt{#1}}} % .. relations \newcommand{\sr}{\stackrel} \newcommand{\sse}{\subseteq} \newcommand{\ce}{\coloneqq} \newcommand{\ec}{\eqqcolon} \newcommand{\ap}{\approx} \newcommand{\ls}{\lesssim} \newcommand{\gs}{\gtrsim} % .. miscer \newcommand{\q}{\quad} \newcommand{\qq}{\qquad} \newcommand{\heart}{\heartsuit} % % Delimiters % (I needed to create my own because the MathJax version of \DeclarePairedDelimiter doesn't have \mathopen{} and that messes up the spacing) % .. one-part \newcommand{\p}[1]{\mathopen{}\left( #1 \right)} \newcommand{\b}[1]{\mathopen{}\left[ #1 \right]} \newcommand{\set}[1]{\mathopen{}\left\{ #1 \right\}} \newcommand{\abs}[1]{\mathopen{}\left\lvert #1 \right\rvert} \newcommand{\floor}[1]{\mathopen{}\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\mathopen{}\left\lceil #1 \right\rceil} \newcommand{\inner}[1]{\mathopen{}\left\langle #1 \right\rangle} % .... (use phantom to force at least the standard height of double bars) \newcommand{\norm}[1]{\mathopen{}\left\lVert #1 \vphantom{f} \right\rVert} \newcommand{\frob}[1]{\norm{#1}_\mathrm{F}} %% .. two-part \newcommand{\incond}[2]{#1 \mathop{}\middle|\mathop{} #2} \newcommand{\cond}[2]{ {\left.\incond{#1}{#2}\right.}} \newcommand{\pco}[2]{\p{\incond{#1}{#2}}} \newcommand{\bco}[2]{\b{\incond{#1}{#2}}} \newcommand{\setco}[2]{\set{\incond{#1}{#2}}} \newcommand{\at}[2]{ {\left.#1\right|_{#2}}} \newcommand{\pat}[2]{\p{\at{#1}{#2}}} \newcommand{\bat}[2]{\b{\at{#1}{#2}}} % ..... (use phantom to force at least the standard height of double bar) \newcommand{\oldpara}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} %\newcommand{\para}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} \newcommand{\para}[2]{\mathchoice{\begin{matrix}#1\\\hdashline#2\end{matrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}} \newcommand{\ppa}[2]{\p{\para{#1}{#2}}} \newcommand{\bpa}[2]{\b{\para{#1}{#2}}} %\newcommand{\bpaco}[4]{\bpa{\incond{#1}{#2}}{\incond{#3}{#4}}} \newcommand{\bpaco}[4]{\bpa{\cond{#1}{#2}}{\cond{#3}{#4}}} % % Levels of closeness \newcommand{\scirc}[1]{\sr{\circ}{#1}} \newcommand{\sdot}[1]{\sr{.}{#1}} \newcommand{\slog}[1]{\sr{\log}{#1}} \newcommand{\createClosenessLevels}[7]{ \newcommand{#2}{\mathrel{(#1)}} \newcommand{#3}{\mathrel{#1}} \newcommand{#4}{\mathrel{#1\!\!#1}} \newcommand{#5}{\mathrel{#1\!\!#1\!\!#1}} \newcommand{#6}{\mathrel{(\sdot{#1})}} \newcommand{#7}{\mathrel{(\slog{#1})}} } \let\lt\undefined \let\gt\undefined % .. vanilla versions (is it within a constant?) \newcommand{\ez}{\scirc=} \newcommand{\eq}{\simeq} \newcommand{\eqq}{\mathrel{\eq\!\!\eq}} \newcommand{\eqqq}{\mathrel{\eq\!\!\eq\!\!\eq}} \newcommand{\lez}{\scirc\le} \newcommand{\lq}{\preceq} \newcommand{\lqq}{\mathrel{\lq\!\!\lq}} \newcommand{\lqqq}{\mathrel{\lq\!\!\lq\!\!\lq}} \newcommand{\gez}{\scirc\ge} \newcommand{\gq}{\succeq} \newcommand{\gqq}{\mathrel{\gq\!\!\gq}} \newcommand{\gqqq}{\mathrel{\gq\!\!\gq\!\!\gq}} \newcommand{\lz}{\scirc<} \newcommand{\lt}{\prec} \newcommand{\ltt}{\mathrel{\lt\!\!\lt}} \newcommand{\lttt}{\mathrel{\lt\!\!\lt\!\!\lt}} \newcommand{\gz}{\scirc>} \newcommand{\gt}{\succ} \newcommand{\gtt}{\mathrel{\gt\!\!\gt}} \newcommand{\gttt}{\mathrel{\gt\!\!\gt\!\!\gt}} % .. dotted versions (is it equal in the limit?) \newcommand{\ed}{\sdot=} \newcommand{\eqd}{\sdot\eq} \newcommand{\eqqd}{\sdot\eqq} \newcommand{\eqqqd}{\sdot\eqqq} \newcommand{\led}{\sdot\le} \newcommand{\lqd}{\sdot\lq} \newcommand{\lqqd}{\sdot\lqq} \newcommand{\lqqqd}{\sdot\lqqq} \newcommand{\ged}{\sdot\ge} \newcommand{\gqd}{\sdot\gq} \newcommand{\gqqd}{\sdot\gqq} \newcommand{\gqqqd}{\sdot\gqqq} \newcommand{\ld}{\sdot<} \newcommand{\ltd}{\sdot\lt} \newcommand{\lttd}{\sdot\ltt} \newcommand{\ltttd}{\sdot\lttt} \newcommand{\gd}{\sdot>} \newcommand{\gtd}{\sdot\gt} \newcommand{\gttd}{\sdot\gtt} \newcommand{\gtttd}{\sdot\gttt} % .. log versions (is it equal up to log?) \newcommand{\elog}{\slog=} \newcommand{\eqlog}{\slog\eq} \newcommand{\eqqlog}{\slog\eqq} \newcommand{\eqqqlog}{\slog\eqqq} \newcommand{\lelog}{\slog\le} \newcommand{\lqlog}{\slog\lq} \newcommand{\lqqlog}{\slog\lqq} \newcommand{\lqqqlog}{\slog\lqqq} \newcommand{\gelog}{\slog\ge} \newcommand{\gqlog}{\slog\gq} \newcommand{\gqqlog}{\slog\gqq} \newcommand{\gqqqlog}{\slog\gqqq} \newcommand{\llog}{\slog<} \newcommand{\ltlog}{\slog\lt} \newcommand{\lttlog}{\slog\ltt} \newcommand{\ltttlog}{\slog\lttt} \newcommand{\glog}{\slog>} \newcommand{\gtlog}{\slog\gt} \newcommand{\gttlog}{\slog\gtt} \newcommand{\gtttlog}{\slog\gttt} % % Miscellaneous \newcommand{\LHS}{\mathrm{LHS}} \newcommand{\RHS}{\mathrm{RHS}} % .. operators \DeclareMathOperator{\poly}{poly} \DeclareMathOperator{\polylog}{polylog} \DeclareMathOperator{\quasipoly}{quasipoly} \DeclareMathOperator{\negl}{negl} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} % .. functions \DeclareMathOperator{\id}{id} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\err}{err} \DeclareMathOperator{\ReLU}{ReLU} % .. analysis \let\d\undefined \newcommand{\d}{\operatorname{d}\mathopen{}} \newcommand{\df}[2]{ {\f{\d #1}{\d #2}}} \newcommand{\ds}[2]{ {\s{\d #1}{\d #2}}} \newcommand{\part}{\partial} \newcommand{\partf}[2]{\f{\part #1}{\part #2}} \newcommand{\parts}[2]{\s{\part #1}{\part #2}} \newcommand{\grad}[1]{\mathop{\nabla\!_{#1}}} % .. sets of numbers \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\F}{\mathbb{F}} % %%% SPECIALIZED MATH %%% % % Logic \renewcommand{\and}{\wedge} \newcommand{\AND}{\bigwedge} \newcommand{\or}{\vee} \newcommand{\OR}{\bigvee} \newcommand{\xor}{\oplus} \newcommand{\XOR}{\bigoplus} \newcommand{\union}{\cup} \newcommand{\inter}{\cap} \newcommand{\UNION}{\bigcup} \newcommand{\INTER}{\bigcap} \newcommand{\comp}{\overline} \newcommand{\true}{\r{true}} \newcommand{\false}{\r{false}} \newcommand{\tf}{\set{\true,\false}} \DeclareMathOperator{\One}{\mathbb{1}} \DeclareMathOperator{\1}{\mathbb{1}} % % Linear algebra \renewcommand{\span}{\mathrm{span}} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\dom}{dom} \DeclareMathOperator{\Img}{Im} \newcommand{\transp}{\mathsf{T}} \renewcommand{\t}{^\transp} % ... named tensors \newcommand{\namedtensorstrut}{\vphantom{fg}} % milder than \mathstrut \newcommand{\name}[1]{\mathsf{\namedtensorstrut #1}} \newcommand{\nbin}[2]{\mathbin{\underset{\substack{#1}}{\namedtensorstrut #2}}} \newcommand{\ndot}[1]{\nbin{#1}{\odot}} \newcommand{\ncat}[1]{\nbin{#1}{\oplus}} \newcommand{\nsum}[1]{\sum\limits_{\substack{#1}}} \newcommand{\nfun}[2]{\mathop{\underset{\substack{#1}}{\namedtensorstrut\mathrm{#2}}}} \newcommand{\ndef}[2]{\newcommand{#1}{\name{#2}}} \newcommand{\nt}[1]{^{\transp(#1)}} % % Probability \newcommand{\Normal}{\mathcal{N}} \let\Pr\undefined \DeclareMathOperator*{\Pr}{Pr} \DeclareMathOperator*{\G}{\mathbb{G}} \DeclareMathOperator*{\Odds}{Od} \DeclareMathOperator*{\E}{E} \DeclareMathOperator*{\Var}{Var} \DeclareMathOperator*{\Cov}{Cov} \DeclareMathOperator*{\corr}{corr} \DeclareMathOperator*{\median}{median} \newcommand{\dTV}{d_{\mathrm{TV}}} \newcommand{\dHel}{d_{\mathrm{Hel}}} \newcommand{\dJS}{d_{\mathrm{JS}}} % ... information theory \let\H\undefined \DeclareMathOperator*{\H}{H} \DeclareMathOperator*{\I}{I} \DeclareMathOperator*{\D}{D} % %%% SPECIALIZED COMPUTER SCIENCE %%% % % Complexity classes % .. classical \newcommand{\Poly}{\mathsf{P}} \newcommand{\NP}{\mathsf{NP}} \newcommand{\PH}{\mathsf{PH}} \newcommand{\PSPACE}{\mathsf{PSPACE}} \renewcommand{\L}{\mathsf{L}} % .. probabilistic \newcommand{\formost}{\mathsf{Я}} \newcommand{\RP}{\mathsf{RP}} \newcommand{\BPP}{\mathsf{BPP}} \newcommand{\MA}{\mathsf{MA}} \newcommand{\AM}{\mathsf{AM}} \newcommand{\IP}{\mathsf{IP}} \newcommand{\RL}{\mathsf{RL}} % .. circuits \newcommand{\NC}{\mathsf{NC}} \newcommand{\AC}{\mathsf{AC}} \newcommand{\ACC}{\mathsf{ACC}} \newcommand{\TC}{\mathsf{TC}} \newcommand{\Ppoly}{\mathsf{P}/\poly} \newcommand{\Lpoly}{\mathsf{L}/\poly} % .. resources \newcommand{\TIME}{\mathsf{TIME}} \newcommand{\SPACE}{\mathsf{SPACE}} \newcommand{\TISP}{\mathsf{TISP}} \newcommand{\SIZE}{\mathsf{SIZE}} % .. keywords \newcommand{\co}{\mathsf{co}} \newcommand{\Prom}{\mathsf{Promise}} % % Boolean analysis \newcommand{\zo}{\set{0,1}} \newcommand{\pmo}{\set{\pm 1}} \newcommand{\zpmo}{\set{0,\pm 1}} \newcommand{\harpoon}{\!\upharpoonright\!} \newcommand{\rr}[2]{#1\harpoon_{#2}} \newcommand{\Fou}[1]{\widehat{#1}} \DeclareMathOperator{\Ind}{\mathrm{Ind}} \DeclareMathOperator{\Inf}{\mathrm{Inf}} \newcommand{\Der}[1]{\operatorname{D}_{#1}\mathopen{}} \newcommand{\Exp}[1]{\operatorname{E}_{#1}\mathopen{}} \DeclareMathOperator{\Stab}{\mathrm{Stab}} \DeclareMathOperator{\T}{T} \DeclareMathOperator{\sens}{\mathrm{s}} \DeclareMathOperator{\bsens}{\mathrm{bs}} \DeclareMathOperator{\fbsens}{\mathrm{fbs}} \DeclareMathOperator{\Cert}{\mathrm{C}} \DeclareMathOperator{\DT}{\mathrm{DT}} \DeclareMathOperator{\CDT}{\mathrm{CDT}} % canonical \DeclareMathOperator{\ECDT}{\mathrm{ECDT}} \DeclareMathOperator{\CDTv}{\mathrm{CDT_{vars}}} \DeclareMathOperator{\ECDTv}{\mathrm{ECDT_{vars}}} \DeclareMathOperator{\CDTt}{\mathrm{CDT_{terms}}} \DeclareMathOperator{\ECDTt}{\mathrm{ECDT_{terms}}} \DeclareMathOperator{\CDTw}{\mathrm{CDT_{weighted}}} \DeclareMathOperator{\ECDTw}{\mathrm{ECDT_{weighted}}} \DeclareMathOperator{\AvgDT}{\mathrm{AvgDT}} \DeclareMathOperator{\PDT}{\mathrm{PDT}} % partial decision tree \DeclareMathOperator{\DTsize}{\mathrm{DT_{size}}} \DeclareMathOperator{\W}{\mathbf{W}} % .. functions (small caps sadly doesn't work) \DeclareMathOperator{\Par}{\mathrm{Par}} \DeclareMathOperator{\Maj}{\mathrm{Maj}} \DeclareMathOperator{\HW}{\mathrm{HW}} \DeclareMathOperator{\Thr}{\mathrm{Thr}} \DeclareMathOperator{\Tribes}{\mathrm{Tribes}} \DeclareMathOperator{\RotTribes}{\mathrm{RotTribes}} \DeclareMathOperator{\CycleRun}{\mathrm{CycleRun}} \DeclareMathOperator{\SAT}{\mathrm{SAT}} \DeclareMathOperator{\UniqueSAT}{\mathrm{UniqueSAT}} % % Dynamic optimality \newcommand{\OPT}{\mathsf{OPT}} \newcommand{\Alt}{\mathsf{Alt}} \newcommand{\Funnel}{\mathsf{Funnel}} % % Alignment \DeclareMathOperator{\Amp}{\mathrm{Amp}} % %%% TYPESETTING %%% % % In text \renewcommand{\th}{^{\mathrm{th}}} \newcommand{\degree}{^\circ} % % Fonts % .. bold \newcommand{\BA}{\boldsymbol{A}} \newcommand{\BB}{\boldsymbol{B}} \newcommand{\BC}{\boldsymbol{C}} \newcommand{\BD}{\boldsymbol{D}} \newcommand{\BE}{\boldsymbol{E}} \newcommand{\BF}{\boldsymbol{F}} \newcommand{\BG}{\boldsymbol{G}} \newcommand{\BH}{\boldsymbol{H}} \newcommand{\BI}{\boldsymbol{I}} \newcommand{\BJ}{\boldsymbol{J}} \newcommand{\BK}{\boldsymbol{K}} \newcommand{\BL}{\boldsymbol{L}} \newcommand{\BM}{\boldsymbol{M}} \newcommand{\BN}{\boldsymbol{N}} \newcommand{\BO}{\boldsymbol{O}} \newcommand{\BP}{\boldsymbol{P}} \newcommand{\BQ}{\boldsymbol{Q}} \newcommand{\BR}{\boldsymbol{R}} \newcommand{\BS}{\boldsymbol{S}} \newcommand{\BT}{\boldsymbol{T}} \newcommand{\BU}{\boldsymbol{U}} \newcommand{\BV}{\boldsymbol{V}} \newcommand{\BW}{\boldsymbol{W}} \newcommand{\BX}{\boldsymbol{X}} \newcommand{\BY}{\boldsymbol{Y}} \newcommand{\BZ}{\boldsymbol{Z}} \newcommand{\Ba}{\boldsymbol{a}} \newcommand{\Bb}{\boldsymbol{b}} \newcommand{\Bc}{\boldsymbol{c}} \newcommand{\Bd}{\boldsymbol{d}} \newcommand{\Be}{\boldsymbol{e}} \newcommand{\Bf}{\boldsymbol{f}} \newcommand{\Bg}{\boldsymbol{g}} \newcommand{\Bh}{\boldsymbol{h}} \newcommand{\Bi}{\boldsymbol{i}} \newcommand{\Bj}{\boldsymbol{j}} \newcommand{\Bk}{\boldsymbol{k}} \newcommand{\Bp}{\boldsymbol{p}} \newcommand{\Bq}{\boldsymbol{q}} \newcommand{\Br}{\boldsymbol{r}} \newcommand{\Bs}{\boldsymbol{s}} \newcommand{\Bt}{\boldsymbol{t}} \newcommand{\Bu}{\boldsymbol{u}} \newcommand{\Bv}{\boldsymbol{v}} \newcommand{\Bw}{\boldsymbol{w}} \newcommand{\Bx}{\boldsymbol{x}} \newcommand{\By}{\boldsymbol{y}} \newcommand{\Bz}{\boldsymbol{z}} \newcommand{\Balpha}{\boldsymbol{\alpha}} \newcommand{\Bbeta}{\boldsymbol{\beta}} \newcommand{\Bgamma}{\boldsymbol{\gamma}} \newcommand{\Bdelta}{\boldsymbol{\delta}} \newcommand{\Beps}{\boldsymbol{\eps}} \newcommand{\Bveps}{\boldsymbol{\veps}} \newcommand{\Bzeta}{\boldsymbol{\zeta}} \newcommand{\Beta}{\boldsymbol{\eta}} \newcommand{\Btheta}{\boldsymbol{\theta}} \newcommand{\Biota}{\boldsymbol{\iota}} \newcommand{\Bkappa}{\boldsymbol{\kappa}} \newcommand{\Blambda}{\boldsymbol{\lambda}} \newcommand{\Bmu}{\boldsymbol{\mu}} \newcommand{\Bnu}{\boldsymbol{\nu}} \newcommand{\Bxi}{\boldsymbol{\xi}} \newcommand{\Bomicron}{\boldsymbol{\omicron}} \newcommand{\Bpi}{\boldsymbol{\pi}} \newcommand{\Brho}{\boldsymbol{\rho}} \newcommand{\Bsigma}{\boldsymbol{\sigma}} \newcommand{\Btau}{\boldsymbol{\tau}} \newcommand{\Bupsilon}{\boldsymbol{\upsilon}} \newcommand{\Bphi}{\boldsymbol{\phi}} \newcommand{\Bfi}{\boldsymbol{\fi}} \newcommand{\Bchi}{\boldsymbol{\chi}} \newcommand{\Bpsi}{\boldsymbol{\psi}} \newcommand{\Bomega}{\boldsymbol{\omega}} % .. calligraphic \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} % .. typewriter \newcommand{\TA}{\mathtt{A}} \newcommand{\TB}{\mathtt{B}} \newcommand{\TC}{\mathtt{C}} \newcommand{\TD}{\mathtt{D}} \newcommand{\TE}{\mathtt{E}} \newcommand{\TF}{\mathtt{F}} \newcommand{\TG}{\mathtt{G}} \newcommand{\TH}{\mathtt{H}} \newcommand{\TI}{\mathtt{I}} \newcommand{\TJ}{\mathtt{J}} \newcommand{\TK}{\mathtt{K}} \newcommand{\TL}{\mathtt{L}} \newcommand{\TM}{\mathtt{M}} \newcommand{\TN}{\mathtt{N}} \newcommand{\TO}{\mathtt{O}} \newcommand{\TP}{\mathtt{P}} \newcommand{\TQ}{\mathtt{Q}} \newcommand{\TR}{\mathtt{R}} \newcommand{\TS}{\mathtt{S}} \newcommand{\TT}{\mathtt{T}} \newcommand{\TU}{\mathtt{U}} \newcommand{\TV}{\mathtt{V}} \newcommand{\TW}{\mathtt{W}} \newcommand{\TX}{\mathtt{X}} \newcommand{\TY}{\mathtt{Y}} \newcommand{\TZ}{\mathtt{Z}}$

This note uses Named tensor notation.

Suppose we have a regression problem with

  • inputs: $\ndef{\batch}{batch}\ndef{\coords}{coords}X \ce (x^{(1)}, \ldots, x^{(n)}) \in \R^{\batch \times\coords}$,
  • outputs: $y \ce (y^{(1)}, \ldots, y^{(n)}) \in \R^{\batch}$,

and we want to learn some weights $w \in \R^{\coords}$ to linearly approximate the relationship between inputs and outputs:

\[X \ndot\coords w \approx y.\]

More precisely, say we want to minimize the square loss1

\[ \HAT\CL = \sum_\batch \p{X \ndot\coords w - y}^2 = \norm{X \ndot\coords w - y}^2_\batch. \]

Let $d \ce |\coords|$ be the number of input features, and $n \ce |\batch|$ be the number of data points. There are two main regimes, which lead to (at least on the surface) two completely different ways to solve this problem:

  • overdetermined regime ($n \gg d$):
    • many data points but few input features and parameters (underparametrized);
    • usually can’t fit the data perfectly;
    • the inputs usually span the whole space $\R^{\coords}$, in which case the loss has a unique minimum.
  • underdetermined regime ($n \ll d$):
    • few data points but many input features and parameters (overparametrized);
    • usually can fit the data perfectly;
    • the inputs don’t span the whole space $\R^{\coords}$, so the minimum is not unique, and we need to break ties somehow.

Overdetermined regime

In the overdetermined regime ($n \gg d$), as long as $X$ has full rank $d$, the loss $\CL$ will be strictly convex, so we only need to find the value of $w$ which makes the gradient zero. We have

\[ \d \HAT\CL = 2\p{X\ndot\coords w-y} \ndot\batch X \ndot\coords \d w, \]

so

\[ \frac{\partial \HAT\CL}{\partial w} = 2\p{X\ndot\coords w-y} \ndot\batch X. \]

Setting it to zero, we have

\[ X \ndot\batch \p{X \ndot\coords w} = X \ndot\batch y \]

so by associativity2 we get

\[ \ub{X\nt\coords \ndot\batch X}_\text{second-moment matrix} \ndot\coords w = \ub{X\nt\coords \ndot\batch y}_\text{``input-output correlations''}, \]

where the second-moment matrix $X \ndot\batch X\nt\coords \in \R^{\coords' \times \coords}$ represents how different input coordinates correlate over the input data. As long as the input data spans $\R^\coords$, it is invertible, so we get

\[ w = \p{X \ndot\batch X\nt\coords}^{-1} \ndot{\coords'} X\nt\coords \ndot\batch y. \]

Underdetermined regime

In the underdetermined regime ($n \ll d$), there are many more parameters than constraints, so the loss doesn’t have a unique minimum, and we need to break ties somehow.

For this, let’s impose the inductive bias of gradient descent. Since all gradients are of the form

\[ X \ndot\batch \text{something}, \]

assuming you start at $w = 0^\coords$, the final value of the weights will be

\[ w = X \ndot\batch \psi \]

for some $\psi \in \R^{\batch}$. That is, $w$ will be a linear combination of the input vectors, with coefficients given by $\psi$. This makes sense, since the input vectors $x^{(1)}, \ldots, x^{(n)}$ are the only directions in $\R^{\coords}$ that the learning algorithm even “knows about”. And as it turns out, enforcing this is equivalent to minimizing the norm $\norm{w}_\coords^2$.3

Assuming we can get perfect loss, this gives

\[ y = X \ndot\coords w = X \ndot\coords \p{X \ndot\batch \psi}, %= X \ndot\coords \p{X\nt\batch \ndot{\batch'} \psi\nt\batch} \]

so by associativity we get

\[ \ub{X\nt\batch \ndot\coords X}_\text{Gram matrix} \ndot\batch \psi = y\nt\batch, %\ub{X \ndot\coords X\nt\batch}_\text{Gram matrix} \ndot\batch \psi\nt\batch = y, \]

where the Gram matrix $X \ndot\coords X\nt\batch \in \R^{\batch \times \batch'}$ contains the inner products between the input vectors. As long as the input vectors are linearly independent, the Gram matrix is invertible, and we get

\[ \psi = \p{X \ndot\coords X\nt\batch}^{-1} \ndot{\batch'} y\nt\batch, %\psi\nt\batch = \p{X \ndot\coords X\nt\batch}^{-1} \ndot{\batch'} y, \]

and thus

\[ w = X \ndot\batch \psi = X \ndot\batch \p{X \ndot\coords X\nt\batch}^{-1} \ndot{\batch'} y\nt\batch. %w = X\nt\batch \ndot{\batch'} \psi\nt\batch = X\nt\batch \ndot\batch \p{X \ndot\coords X\nt\batch}^{-1} \ndot{\batch'} y. \]

General case

It could be that neither the second-moment matrix $X \ndot\batch X\nt\coords$ nor the Gram matrix $X \ndot\coords X\nt\batch$ is invertible. This happens when $X$ doesn’t have full-rank (for example if we have $n=2$ inputs in $d=3$ dimensions, but the inputs are linearly dependent).

To deal with the general case, we use the singular value decomposition of $X$

\[ \ndef{\sing}{sing} X = \sum_\sing \tld{x}U V, \]

where

  • $|\sing|= \rank(X)$;
  • $\tld{x} \in \R^{\sing}$ contains the nonzero singular values of $X$;
  • $U \in \R^{\sing \times \coords}$ is an orthonormal basis of the span of $X$ within the input space $\R^{\coords}$;
  • $V\in \R^{\sing \times \batch}$ is an orthonormal basis of the span of $X$ within $\R^{\batch}$.

Keeping the inductive bias that $w$ must be in the span of $X$ (within $\R^{\coords}$), we can express it as

\[ w = U \ndot\sing \tld{w}. \]

Also, we can4 decompose $y$ into its projection onto the span of $X$ (within $\R^\batch$) and the perpendicular part:

\[ y = V \ndot \sing \tld{y} + y_\perp, \]

where $\tld{y} \ce V \ndot\batch y$ and $V \ndot\batch y_\perp = 0^\coords$. Then we can rewrite the loss as

\[ \al{ \HAT\CL &= \norm{X \ndot\coords w - y}_\batch^2\\ &= \norm{V \ndot\sing \p{\tld{x}\tld{w}} - V \ndot\sing\tld{y} - y_\perp}_\batch^2\tag{$U$ orthonormal}\\ &= \norm{V \ndot\sing \p{\tld{x}\tld{w} - \tld{y}}}_\batch^2 +\norm{y_\perp}_\batch^2\\ &= \norm{\tld{x}\tld{w} - \tld{y}}_\sing^2 + \norm{y_\perp}_\batch^2.\tag{$V$ orthonormal} } \]

So the minimum loss is $\norm{y_\perp}_\batch^2$, and it is achieved when

\[ \tld{x} \tld{w} = \tld{y} \Leftrightarrow \tld{w} = \frac{\tld{y}}{\tld{x}} \]

(where both the multiplication and the division are elementwise). Note how we made the nonsensical dream “$X w = y \Leftrightarrow w = \frac{y}{X}$” come true just by expressing $X$, $w$ and $y$ in the right way!

How well does it learn truly linear relationships?

Suppose that the data is from some distribution $\CD$ where

  • the inputs $x$ are from a uniform $d$-dimensional uniform gaussian $\Normal(0, I)$,
  • the relationship is truly linear: $y = x \ndot\coords w^*$ for some fixed $w^*$.

Then how does the loss over the whole distribution

\[ \CL(w) \ce \E_{\Bx,\By \sim \CD}\b{\p{\Bx \ndot\coords w - \By}^2} \]

evolve as we increase the number $n$ of samples?

First let’s figure what got learned. We have

\[ \tld{y} = V \ndot\batch y = V \ndot\batch \p{X \ndot \coords w^*} = \tld{x}\p{U \ndot\coords w^*}, \]

so

\[ w = U \ndot\sing \tld{w} = U \ndot\sing \frac{\tld{y}}{\tld{x}} = U \ndot\sing \p{U \ndot\coords w^*} = \proj_U (w^*), \]

where $\proj_U(\cdot)$ denotes the orthogonal projection on the space spanned by $U$.

Therefore,

\[ \al{ \CL(w) &= \E_{\Bx}\b{\p{\Bx \ndot\coords \proj_U(w^*) - \Bx \ndot\coords w^*}^2}\\ &= \E_{\Bx}\b{\p{\Bx \ndot\coords \p{\proj_U(w^*) - w^*}}^2}\\ &= \norm{\proj_U(w^*) - w^*}_\coords^2\tag{$\Bx \sim \Normal(0, I)$}\\ &= \norm{w^*}_\coords^2 - \norm{\proj_U(w^*)}_\coords^2\tag{perpendicularity}, } \]

and since $U$ is distributed like a random subspace of dimension $\min(n, d)$, this loss has expectation

\[ \al{ \E_{X,y}\b{\CL(w)} &=\norm{w^*}_\coords^2 - \E_{X,y}\b{\norm{\proj_U(w^*)}_\coords^2}\\ &=\p{1-\f{\min(n,d)}{d}}\norm{w^*}_\coords^2\\ &= \begin{cases} \p{1-\f{n}{d}}\norm{w^*}_\coords^2 & \text{if $n < d$}\\ 0 & \text{otherwise.} \end{cases} } \]

over the randomness of the sample $X,y$.

Stability

Since we’re dividing by $\tld{x}$, this can cause trouble if we perturb $\tld{y}$ in directions that have particularly small (but nonzero) singular values. This is ultimately what causes double descent.

#to-write

  • natural derivation of the affine case
    • either see it as an extra feature
    • or see it as “do regression on $X$ and $Y$ minus their averages”
  • connect to Correlation
    • variance unexplained
    • (revisit the “screening-off” property)
    • phrase directions of the correlations as features?
      • actually the learned weights are just a rescaling of the correlation!
      • they’re essentially $\f{\Cov[X,Y]}{\Var[X]}$
      • actually the correlation matrix is the linear regression between the whitened versions of $X$ and $Y$?
  1. One reason to care about minimizing the square loss is because it is equivalent to maximum likelihood estimation under the assumption that the relation between inputs and ouputs is “truly linear but noisy”: $y^{(i)} \ce x^{(i)} \ndot\coords w + \eps^{(i)}$ where $\eps^{(i)} \sim \Normal(0, \sigma^2)$ is a noise term that is independent for each data point. 

  2. Here, we use the named tensor notation for transpose in order to keep separate $\coords$ axes. 

  3. To see this, observe that the space $\CW_\text{min} = \setco{w}{\CL(w) = \min_{w^*}\CL(w^*)}$ that minimize the loss is perpendicular to the space $\CW_X =\setco{X \ndot\batch \psi}{\psi}$ of weights that we’re considering, so the smallest weights in $\CW_\text{min}$ lie at the intersection of $\CW_\text{min}$ and $\CW_X$. See more details at What do deep networks learn and when do they learn it – Windows On Theory

  4. The reason why it makes sense to do this a priori is that the gradient involves the term $X \ndot\batch y$, which only depends on the projection of $y$ onto $V$