Grokking modular arithmetic

$\require{mathtools} \newcommand{\nc}{\newcommand} % %%% GENERIC MATH %%% % % Environments \newcommand{\al}[1]{\begin{align}#1\end{align}} % need this for \tag{} to work \renewcommand{\r}{\mathrm} % BAD!! does cursed things with accents :(( \renewcommand{\t}{\textrm} \newcommand{\either}[1]{\begin{cases}#1\end{cases}} % % Delimiters % (I needed to create my own because the MathJax version of \DeclarePairedDelimiter doesn't have \mathopen{} and that messes up the spacing) % .. one-part \newcommand{\p}[1]{\mathopen{}\left( #1 \right)} \renewcommand{\P}[1]{^{\p{#1}}} \renewcommand{\b}[1]{\mathopen{}\left[ #1 \right]} \newcommand{\lopen}[1]{\mathopen{}\left( #1 \right]} \newcommand{\ropen}[1]{\mathopen{}\left[ #1 \right)} \newcommand{\set}[1]{\mathopen{}\left\{ #1 \right\}} \newcommand{\abs}[1]{\mathopen{}\left\lvert #1 \right\rvert} \newcommand{\floor}[1]{\mathopen{}\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\mathopen{}\left\lceil #1 \right\rceil} \newcommand{\round}[1]{\mathopen{}\left\lfloor #1 \right\rceil} \newcommand{\inner}[1]{\mathopen{}\left\langle #1 \right\rangle} \newcommand{\norm}[1]{\mathopen{}\left\lVert #1 \strut \right\rVert} \newcommand{\frob}[1]{\norm{#1}_\mathrm{F}} \newcommand{\mix}[1]{\mathopen{}\left\lfloor #1 \right\rceil} %% .. two-part \newcommand{\inco}[2]{#1 \mathop{}\middle|\mathop{} #2} \newcommand{\co}[2]{ {\left.\inco{#1}{#2}\right.}} \newcommand{\cond}{\co} % deprecated \newcommand{\pco}[2]{\p{\inco{#1}{#2}}} \newcommand{\bco}[2]{\b{\inco{#1}{#2}}} \newcommand{\setco}[2]{\set{\inco{#1}{#2}}} \newcommand{\at}[2]{ {\left.#1\strut\right|_{#2}}} \newcommand{\pat}[2]{\p{\at{#1}{#2}}} \newcommand{\bat}[2]{\b{\at{#1}{#2}}} \newcommand{\para}[2]{#1\strut \mathop{}\middle\|\mathop{} #2} \newcommand{\ppa}[2]{\p{\para{#1}{#2}}} \newcommand{\pff}[2]{\p{\ff{#1}{#2}}} \newcommand{\bff}[2]{\b{\ff{#1}{#2}}} \newcommand{\bffco}[4]{\bff{\cond{#1}{#2}}{\cond{#3}{#4}}} \newcommand{\sm}[1]{\p{\begin{smallmatrix}#1\end{smallmatrix}}} % % Greek \newcommand{\eps}{\epsilon} \newcommand{\veps}{\varepsilon} \newcommand{\vpi}{\varpi} % the following cause issues with real LaTeX tho :/ maybe consider naming it \fhi instead? \let\fi\phi % because it looks like an f \let\phi\varphi % because it looks like a p \renewcommand{\th}{\theta} \newcommand{\Th}{\Theta} \newcommand{\om}{\omega} \newcommand{\Om}{\Omega} % % Miscellaneous \newcommand{\LHS}{\mathrm{LHS}} \newcommand{\RHS}{\mathrm{RHS}} \DeclareMathOperator{\cst}{const} % .. operators \DeclareMathOperator{\poly}{poly} \DeclareMathOperator{\polylog}{polylog} \DeclareMathOperator{\quasipoly}{quasipoly} \DeclareMathOperator{\negl}{negl} \DeclareMathOperator*{\argmin}{arg\thinspace min} \DeclareMathOperator*{\argmax}{arg\thinspace max} \DeclareMathOperator{\diag}{diag} % .. functions \DeclareMathOperator{\id}{id} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\step}{step} \DeclareMathOperator{\err}{err} \DeclareMathOperator{\ReLU}{ReLU} \DeclareMathOperator{\softmax}{softmax} % .. analysis \let\d\undefined \newcommand{\d}{\operatorname{d}\mathopen{}} \newcommand{\dd}[1]{\operatorname{d}^{#1}\mathopen{}} \newcommand{\df}[2]{ {\f{\d #1}{\d #2}}} \newcommand{\ds}[2]{ {\sl{\d #1}{\d #2}}} \newcommand{\ddf}[3]{ {\f{\dd{#1} #2}{\p{\d #3}^{#1}}}} \newcommand{\dds}[3]{ {\sl{\dd{#1} #2}{\p{\d #3}^{#1}}}} \renewcommand{\part}{\partial} \newcommand{\ppart}[1]{\part^{#1}} \newcommand{\partf}[2]{\f{\part #1}{\part #2}} \newcommand{\parts}[2]{\sl{\part #1}{\part #2}} \newcommand{\ppartf}[3]{ {\f{\ppart{#1} #2}{\p{\part #3}^{#1}}}} \newcommand{\pparts}[3]{ {\sl{\ppart{#1} #2}{\p{\part #3}^{#1}}}} \newcommand{\grad}[1]{\mathop{\nabla\!_{#1}}} % .. sets \newcommand{\es}{\emptyset} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\R}{\mathbb{R}} \newcommand{\Rge}{\R_{\ge 0}} \newcommand{\Rgt}{\R_{> 0}} \newcommand{\C}{\mathbb{C}} \newcommand{\F}{\mathbb{F}} \newcommand{\zo}{\set{0,1}} \newcommand{\pmo}{\set{\pm 1}} \newcommand{\zpmo}{\set{0,\pm 1}} % .... set operations \newcommand{\sse}{\subseteq} \newcommand{\out}{\not\in} \newcommand{\minus}{\setminus} \newcommand{\inc}[1]{\union \set{#1}} % "including" \newcommand{\exc}[1]{\setminus \set{#1}} % "except" % .. over and under \renewcommand{\ss}[1]{_{\substack{#1}}} \newcommand{\OB}{\overbrace} \newcommand{\ob}[2]{\OB{#1}^\t{#2}} \newcommand{\UB}{\underbrace} \newcommand{\ub}[2]{\UB{#1}_\t{#2}} \newcommand{\ol}{\overline} \newcommand{\tld}{\widetilde} % deprecated \renewcommand{\~}{\widetilde} \newcommand{\HAT}{\widehat} % deprecated \renewcommand{\^}{\widehat} \newcommand{\rt}[1]{ {\sqrt{#1}}} \newcommand{\for}[2]{_{#1=1}^{#2}} \newcommand{\sfor}{\sum\for} \newcommand{\pfor}{\prod\for} % .... two-part \newcommand{\f}{\frac} \renewcommand{\sl}[2]{#1 /\mathopen{}#2} \newcommand{\ff}[2]{\mathchoice{\begin{smallmatrix}\displaystyle\vphantom{\p{#1}}#1\\[-0.05em]\hline\\[-0.05em]\hline\displaystyle\vphantom{\p{#2}}#2\end{smallmatrix}}{\begin{smallmatrix}\vphantom{\p{#1}}#1\\[-0.1em]\hline\\[-0.1em]\hline\vphantom{\p{#2}}#2\end{smallmatrix}}{\begin{smallmatrix}\vphantom{\p{#1}}#1\\[-0.1em]\hline\\[-0.1em]\hline\vphantom{\p{#2}}#2\end{smallmatrix}}{\begin{smallmatrix}\vphantom{\p{#1}}#1\\[-0.1em]\hline\\[-0.1em]\hline\vphantom{\p{#2}}#2\end{smallmatrix}}} % .. arrows \newcommand{\from}{\leftarrow} \DeclareMathOperator*{\<}{\!\;\longleftarrow\;\!} \let\>\undefined \DeclareMathOperator*{\>}{\!\;\longrightarrow\;\!} \let\-\undefined \DeclareMathOperator*{\-}{\!\;\longleftrightarrow\;\!} \newcommand{\so}{\implies} % .. operators and relations \renewcommand{\*}{\cdot} \newcommand{\x}{\times} \newcommand{\ox}{\otimes} \newcommand{\OX}[1]{^{\ox #1}} \newcommand{\sr}{\stackrel} \newcommand{\ce}{\coloneqq} \newcommand{\ec}{\eqqcolon} \newcommand{\ap}{\approx} \newcommand{\ls}{\lesssim} \newcommand{\gs}{\gtrsim} % .. punctuation and spacing \renewcommand{\.}[1]{#1\dots#1} \newcommand{\ts}{\thinspace} \newcommand{\q}{\quad} \newcommand{\qq}{\qquad} % % %%% SPECIALIZED MATH %%% % % Logic and bit operations \newcommand{\fa}{\forall} \newcommand{\ex}{\exists} \renewcommand{\and}{\wedge} \newcommand{\AND}{\bigwedge} \renewcommand{\or}{\vee} \newcommand{\OR}{\bigvee} \newcommand{\xor}{\oplus} \newcommand{\XOR}{\bigoplus} \newcommand{\union}{\cup} \newcommand{\dunion}{\sqcup} \newcommand{\inter}{\cap} \newcommand{\UNION}{\bigcup} \newcommand{\DUNION}{\bigsqcup} \newcommand{\INTER}{\bigcap} \newcommand{\comp}{\overline} \newcommand{\true}{\r{true}} \newcommand{\false}{\r{false}} \newcommand{\tf}{\set{\true,\false}} \DeclareMathOperator{\One}{\mathbb{1}} \DeclareMathOperator{\1}{\mathbb{1}} % use \mathbbm instead if using real LaTeX \DeclareMathOperator{\LSB}{LSB} % % Linear algebra \newcommand{\spn}{\mathrm{span}} % do NOT use \span because it causes misery with amsmath \DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\dom}{dom} \DeclareMathOperator{\Img}{Im} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\perm}{perm} \DeclareMathOperator{\haf}{haf} \newcommand{\transp}{\mathsf{T}} \newcommand{\T}{^\transp} \newcommand{\par}{\parallel} % .. named tensors \newcommand{\namedtensorstrut}{\vphantom{fg}} % milder than \mathstrut \newcommand{\name}[1]{\mathsf{\namedtensorstrut #1}} \newcommand{\nbin}[2]{\mathbin{\underset{\substack{#1}}{\namedtensorstrut #2}}} \newcommand{\ndot}[1]{\nbin{#1}{\odot}} \newcommand{\ncat}[1]{\nbin{#1}{\oplus}} \newcommand{\nsum}[1]{\sum\limits_{\substack{#1}}} \newcommand{\nfun}[2]{\mathop{\underset{\substack{#1}}{\namedtensorstrut\mathrm{#2}}}} \newcommand{\ndef}[2]{\newcommand{#1}{\name{#2}}} \newcommand{\nt}[1]{^{\transp(#1)}} % % Probability \newcommand{\tri}{\triangle} \newcommand{\Normal}{\mathcal{N}} \newcommand{\Exp}{\mathcal{Exp}} % .. operators \DeclareMathOperator{\supp}{supp} \let\Pr\undefined \DeclareMathOperator*{\Pr}{Pr} \DeclareMathOperator*{\G}{\mathbb{G}} \DeclareMathOperator*{\Odds}{Od} \DeclareMathOperator*{\E}{E} \DeclareMathOperator*{\Var}{Var} \DeclareMathOperator*{\Cov}{Cov} \DeclareMathOperator*{\K}{K} \DeclareMathOperator*{\corr}{corr} \DeclareMathOperator*{\median}{median} \DeclareMathOperator*{\maj}{maj} % ... information theory \let\H\undefined \DeclareMathOperator*{\H}{H} \DeclareMathOperator*{\I}{I} \DeclareMathOperator*{\D}{D} \DeclareMathOperator*{\KL}{KL} % .. other divergences \newcommand{\dTV}{d_{\mathrm{TV}}} \newcommand{\dHel}{d_{\mathrm{Hel}}} \newcommand{\dJS}{d_{\mathrm{JS}}} % % Polynomials \DeclareMathOperator{\He}{He} \DeclareMathOperator{\coeff}{coeff} % %%% SPECIALIZED COMPUTER SCIENCE %%% % % Complexity classes % .. keywords \newcommand{\coclass}{\mathsf{co}} \newcommand{\Prom}{\mathsf{Promise}} % .. classical \newcommand{\PTIME}{\mathsf{P}} \newcommand{\NP}{\mathsf{NP}} \newcommand{\coNP}{\coclass\NP} \newcommand{\PH}{\mathsf{PH}} \newcommand{\PSPACE}{\mathsf{PSPACE}} \renewcommand{\L}{\mathsf{L}} \newcommand{\EXP}{\mathsf{EXP}} \newcommand{\NEXP}{\mathsf{NEXP}} % .. probabilistic \newcommand{\formost}{\mathsf{Я}} \newcommand{\RP}{\mathsf{RP}} \newcommand{\BPP}{\mathsf{BPP}} \newcommand{\ZPP}{\mathsf{ZPP}} \newcommand{\MA}{\mathsf{MA}} \newcommand{\AM}{\mathsf{AM}} \newcommand{\IP}{\mathsf{IP}} \newcommand{\RL}{\mathsf{RL}} % .. circuits \newcommand{\NC}{\mathsf{NC}} \newcommand{\AC}{\mathsf{AC}} \newcommand{\ACC}{\mathsf{ACC}} \newcommand{\ThrC}{\mathsf{TC}} \newcommand{\Ppoly}{\mathsf{P}/\poly} \newcommand{\Lpoly}{\mathsf{L}/\poly} % .. resources \newcommand{\TIME}{\mathsf{TIME}} \newcommand{\NTIME}{\mathsf{NTIME}} \newcommand{\SPACE}{\mathsf{SPACE}} \newcommand{\TISP}{\mathsf{TISP}} \newcommand{\SIZE}{\mathsf{SIZE}} % .. custom \newcommand{\NCP}{\mathsf{NCP}} % % Boolean analysis \newcommand{\harpoon}{\!\upharpoonright\!} \newcommand{\rr}[2]{#1\harpoon_{#2}} \newcommand{\Fou}[1]{\widehat{#1}} \DeclareMathOperator{\Ind}{\mathrm{Ind}} \DeclareMathOperator{\Inf}{\mathrm{Inf}} \newcommand{\Der}[1]{\operatorname{D}_{#1}\mathopen{}} % \newcommand{\Exp}[1]{\operatorname{E}_{#1}\mathopen{}} \DeclareMathOperator{\Stab}{\mathrm{Stab}} \DeclareMathOperator{\Tau}{T} \DeclareMathOperator{\sens}{\mathrm{s}} \DeclareMathOperator{\bsens}{\mathrm{bs}} \DeclareMathOperator{\fbsens}{\mathrm{fbs}} \DeclareMathOperator{\Cert}{\mathrm{C}} \DeclareMathOperator{\DT}{\mathrm{DT}} \DeclareMathOperator{\CDT}{\mathrm{CDT}} % canonical \DeclareMathOperator{\ECDT}{\mathrm{ECDT}} \DeclareMathOperator{\CDTv}{\mathrm{CDT_{vars}}} \DeclareMathOperator{\ECDTv}{\mathrm{ECDT_{vars}}} \DeclareMathOperator{\CDTt}{\mathrm{CDT_{terms}}} \DeclareMathOperator{\ECDTt}{\mathrm{ECDT_{terms}}} \DeclareMathOperator{\CDTw}{\mathrm{CDT_{weighted}}} \DeclareMathOperator{\ECDTw}{\mathrm{ECDT_{weighted}}} \DeclareMathOperator{\AvgDT}{\mathrm{AvgDT}} \DeclareMathOperator{\PDT}{\mathrm{PDT}} % partial decision tree \DeclareMathOperator{\DTsize}{\mathrm{DT_{size}}} \DeclareMathOperator{\W}{\mathbf{W}} % .. functions (small caps sadly doesn't work) \DeclareMathOperator{\Par}{\mathrm{Par}} \DeclareMathOperator{\Maj}{\mathrm{Maj}} \DeclareMathOperator{\HW}{\mathrm{HW}} \DeclareMathOperator{\Thr}{\mathrm{Thr}} \DeclareMathOperator{\Tribes}{\mathrm{Tribes}} \DeclareMathOperator{\RotTribes}{\mathrm{RotTribes}} \DeclareMathOperator{\CycleRun}{\mathrm{CycleRun}} \DeclareMathOperator{\SAT}{\mathrm{SAT}} \DeclareMathOperator{\UniqueSAT}{\mathrm{UniqueSAT}} % % Dynamic optimality \newcommand{\OPT}{\mathsf{OPT}} \newcommand{\Alt}{\mathsf{Alt}} \newcommand{\Funnel}{\mathsf{Funnel}} % % Alignment \DeclareMathOperator{\Amp}{\mathrm{Amp}} % %%% TYPESETTING %%% % % In "text" \newcommand{\heart}{\heartsuit} \newcommand{\nth}{^\t{th}} \newcommand{\degree}{^\circ} \newcommand{\qu}[1]{\text{``}#1\text{''}} % remove these last two if using real LaTeX \newcommand{\qed}{\blacksquare} \newcommand{\qedhere}{\tag*{$\blacksquare$}} % % Fonts % .. bold \newcommand{\BA}{\boldsymbol{A}} \newcommand{\BB}{\boldsymbol{B}} \newcommand{\BC}{\boldsymbol{C}} \newcommand{\BD}{\boldsymbol{D}} \newcommand{\BE}{\boldsymbol{E}} \newcommand{\BF}{\boldsymbol{F}} \newcommand{\BG}{\boldsymbol{G}} \newcommand{\BH}{\boldsymbol{H}} \newcommand{\BI}{\boldsymbol{I}} \newcommand{\BJ}{\boldsymbol{J}} \newcommand{\BK}{\boldsymbol{K}} \newcommand{\BL}{\boldsymbol{L}} \newcommand{\BM}{\boldsymbol{M}} \newcommand{\BN}{\boldsymbol{N}} \newcommand{\BO}{\boldsymbol{O}} \newcommand{\BP}{\boldsymbol{P}} \newcommand{\BQ}{\boldsymbol{Q}} \newcommand{\BR}{\boldsymbol{R}} \newcommand{\BS}{\boldsymbol{S}} \newcommand{\BT}{\boldsymbol{T}} \newcommand{\BU}{\boldsymbol{U}} \newcommand{\BV}{\boldsymbol{V}} \newcommand{\BW}{\boldsymbol{W}} \newcommand{\BX}{\boldsymbol{X}} \newcommand{\BY}{\boldsymbol{Y}} \newcommand{\BZ}{\boldsymbol{Z}} \newcommand{\Ba}{\boldsymbol{a}} \newcommand{\Bb}{\boldsymbol{b}} \newcommand{\Bc}{\boldsymbol{c}} \newcommand{\Bd}{\boldsymbol{d}} \newcommand{\Be}{\boldsymbol{e}} \newcommand{\Bf}{\boldsymbol{f}} \newcommand{\Bg}{\boldsymbol{g}} \newcommand{\Bh}{\boldsymbol{h}} \newcommand{\Bi}{\boldsymbol{i}} \newcommand{\Bj}{\boldsymbol{j}} \newcommand{\Bk}{\boldsymbol{k}} \newcommand{\Bl}{\boldsymbol{l}} \newcommand{\Bm}{\boldsymbol{m}} \newcommand{\Bn}{\boldsymbol{n}} \newcommand{\Bo}{\boldsymbol{o}} \newcommand{\Bp}{\boldsymbol{p}} \newcommand{\Bq}{\boldsymbol{q}} \newcommand{\Br}{\boldsymbol{r}} \newcommand{\Bs}{\boldsymbol{s}} \newcommand{\Bt}{\boldsymbol{t}} \newcommand{\Bu}{\boldsymbol{u}} \newcommand{\Bv}{\boldsymbol{v}} \newcommand{\Bw}{\boldsymbol{w}} \newcommand{\Bx}{\boldsymbol{x}} \newcommand{\By}{\boldsymbol{y}} \newcommand{\Bz}{\boldsymbol{z}} \newcommand{\Balpha}{\boldsymbol{\alpha}} \newcommand{\Bbeta}{\boldsymbol{\beta}} \newcommand{\Bgamma}{\boldsymbol{\gamma}} \newcommand{\Bdelta}{\boldsymbol{\delta}} \newcommand{\Beps}{\boldsymbol{\eps}} \newcommand{\Bveps}{\boldsymbol{\veps}} \newcommand{\Bzeta}{\boldsymbol{\zeta}} \newcommand{\Beta}{\boldsymbol{\eta}} \newcommand{\Btheta}{\boldsymbol{\theta}} \newcommand{\Bth}{\boldsymbol{\th}} \newcommand{\Biota}{\boldsymbol{\iota}} \newcommand{\Bkappa}{\boldsymbol{\kappa}} \newcommand{\Blambda}{\boldsymbol{\lambda}} \newcommand{\Bmu}{\boldsymbol{\mu}} \newcommand{\Bnu}{\boldsymbol{\nu}} \newcommand{\Bxi}{\boldsymbol{\xi}} \newcommand{\Bpi}{\boldsymbol{\pi}} \newcommand{\Bvpi}{\boldsymbol{\vpi}} \newcommand{\Brho}{\boldsymbol{\rho}} \newcommand{\Bsigma}{\boldsymbol{\sigma}} \newcommand{\Btau}{\boldsymbol{\tau}} \newcommand{\Bupsilon}{\boldsymbol{\upsilon}} \newcommand{\Bphi}{\boldsymbol{\phi}} \newcommand{\Bfi}{\boldsymbol{\fi}} \newcommand{\Bchi}{\boldsymbol{\chi}} \newcommand{\Bpsi}{\boldsymbol{\psi}} \newcommand{\Bom}{\boldsymbol{\om}} % .. calligraphic \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} % .. typewriter \newcommand{\TA}{\mathtt{A}} \newcommand{\TB}{\mathtt{B}} \newcommand{\TC}{\mathtt{C}} \newcommand{\TD}{\mathtt{D}} \newcommand{\TE}{\mathtt{E}} \newcommand{\TF}{\mathtt{F}} \newcommand{\TG}{\mathtt{G}} \renewcommand{\TH}{\mathtt{H}} \newcommand{\TI}{\mathtt{I}} \newcommand{\TJ}{\mathtt{J}} \newcommand{\TK}{\mathtt{K}} \newcommand{\TL}{\mathtt{L}} \newcommand{\TM}{\mathtt{M}} \newcommand{\TN}{\mathtt{N}} \newcommand{\TO}{\mathtt{O}} \newcommand{\TP}{\mathtt{P}} \newcommand{\TQ}{\mathtt{Q}} \newcommand{\TR}{\mathtt{R}} \newcommand{\TS}{\mathtt{S}} \newcommand{\TT}{\mathtt{T}} \newcommand{\TU}{\mathtt{U}} \newcommand{\TV}{\mathtt{V}} \newcommand{\TW}{\mathtt{W}} \newcommand{\TX}{\mathtt{X}} \newcommand{\TY}{\mathtt{Y}} \newcommand{\TZ}{\mathtt{Z}} % % LEVELS OF CLOSENESS (basically deprecated) \newcommand{\scirc}[1]{\sr{\circ}{#1}} \newcommand{\sdot}[1]{\sr{.}{#1}} \newcommand{\slog}[1]{\sr{\log}{#1}} \newcommand{\createClosenessLevels}[7]{ \newcommand{#2}{\mathrel{(#1)}} \newcommand{#3}{\mathrel{#1}} \newcommand{#4}{\mathrel{#1\!\!#1}} \newcommand{#5}{\mathrel{#1\!\!#1\!\!#1}} \newcommand{#6}{\mathrel{(\sdot{#1})}} \newcommand{#7}{\mathrel{(\slog{#1})}} } \let\lt\undefined \let\gt\undefined % .. vanilla versions (is it within a constant?) \newcommand{\ez}{\scirc=} \newcommand{\eq}{\simeq} \newcommand{\eqq}{\mathrel{\eq\!\!\eq}} \newcommand{\eqqq}{\mathrel{\eq\!\!\eq\!\!\eq}} \newcommand{\lez}{\scirc\le} \renewcommand{\lq}{\preceq} \newcommand{\lqq}{\mathrel{\lq\!\!\lq}} \newcommand{\lqqq}{\mathrel{\lq\!\!\lq\!\!\lq}} \newcommand{\gez}{\scirc\ge} \newcommand{\gq}{\succeq} \newcommand{\gqq}{\mathrel{\gq\!\!\gq}} \newcommand{\gqqq}{\mathrel{\gq\!\!\gq\!\!\gq}} \newcommand{\lz}{\scirc<} \newcommand{\lt}{\prec} \newcommand{\ltt}{\mathrel{\lt\!\!\lt}} \newcommand{\lttt}{\mathrel{\lt\!\!\lt\!\!\lt}} \newcommand{\gz}{\scirc>} \newcommand{\gt}{\succ} \newcommand{\gtt}{\mathrel{\gt\!\!\gt}} \newcommand{\gttt}{\mathrel{\gt\!\!\gt\!\!\gt}} % .. dotted versions (is it equal in the limit?) \newcommand{\ed}{\sdot=} \newcommand{\eqd}{\sdot\eq} \newcommand{\eqqd}{\sdot\eqq} \newcommand{\eqqqd}{\sdot\eqqq} \newcommand{\led}{\sdot\le} \newcommand{\lqd}{\sdot\lq} \newcommand{\lqqd}{\sdot\lqq} \newcommand{\lqqqd}{\sdot\lqqq} \newcommand{\ged}{\sdot\ge} \newcommand{\gqd}{\sdot\gq} \newcommand{\gqqd}{\sdot\gqq} \newcommand{\gqqqd}{\sdot\gqqq} \newcommand{\ld}{\sdot<} \newcommand{\ltd}{\sdot\lt} \newcommand{\lttd}{\sdot\ltt} \newcommand{\ltttd}{\sdot\lttt} \newcommand{\gd}{\sdot>} \newcommand{\gtd}{\sdot\gt} \newcommand{\gttd}{\sdot\gtt} \newcommand{\gtttd}{\sdot\gttt} % .. log versions (is it equal up to log?) \newcommand{\elog}{\slog=} \newcommand{\eqlog}{\slog\eq} \newcommand{\eqqlog}{\slog\eqq} \newcommand{\eqqqlog}{\slog\eqqq} \newcommand{\lelog}{\slog\le} \newcommand{\lqlog}{\slog\lq} \newcommand{\lqqlog}{\slog\lqq} \newcommand{\lqqqlog}{\slog\lqqq} \newcommand{\gelog}{\slog\ge} \newcommand{\gqlog}{\slog\gq} \newcommand{\gqqlog}{\slog\gqq} \newcommand{\gqqqlog}{\slog\gqqq} \newcommand{\llog}{\slog<} \newcommand{\ltlog}{\slog\lt} \newcommand{\lttlog}{\slog\ltt} \newcommand{\ltttlog}{\slog\lttt} \newcommand{\glog}{\slog>} \newcommand{\gtlog}{\slog\gt} \newcommand{\gttlog}{\slog\gtt} \newcommand{\gtttlog}{\slog\gttt}$

A personal summary of (part of) Grokking modular arithmetic by Andrey Gromov, but can also be understood as “[Victor] Grokking [why Fourier transforms are natural features of] modular arithmetic.”

Non-modular addition

Suppose we train a network as follows:

task: given $x, y \in \set{-D, \ldots, D}$, compute their sum $s \in \set{-2D, \ldots, 2D}$,
encodings: both inputs and outputs are one-hot encoded (so there are $2(2D+1) = 4D+2$ input dimensions and $4D+1$ output dimensions);
network: fully connected, one hidden layer with activation $\fi$.

How do we figure out what will be learned? I claim that the question to ask is:

If you were a hidden neuron, how would you try to help?

Let’s take some neuron. It has weights coming in from the inputs $x$ and $y$, and weights coming out to the output $s$. Let’s denote them by the functions

\[ \left\{ \al{ f &: \set{-D, \ldots, D} \to \R\\ g &: \set{-D, \ldots, D} \to \R\\ h &: \set{-2D, \ldots, 2D} \to \R } \right. \]

respectively. That means that on inputs $x,y$, it contributes

\[ \fi(f(x)+g(y))h(s) \]

to the value of the $s\nth$ output.

What does it mean to “try to help”? It would be helpful to be correlated with the correct behavior, so let’s define it as “activating the correct output more than the average output”. This means that, over a uniform $\Bx, \By, \Bs$, we want

\[ \E_{\Bx,\By}[\fi(f(\Bx)+g(\By))h(\Bx+\By)] > \E_{\Bx,\By,\Bs}[\fi(f(\Bx)+f(\By))h(\Bs)]. \]

For a moment, let’s imagine that there is no nonlinearity $\fi(\cdot)$, and we just want

\[ \E_{\Bx,\By}[(f(\Bx)+g(\By))h(\Bx+\By)] > \E_{\Bx,\By,\Bs}[(f(\Bx)+f(\By))h(\Bs)]. \]

That is, we want to $h(\Bx+\By)$ to be as correlated as possible with $f(\Bx)+g(\By)$ (assuming they have mean $0$). The most correlated you can be is to be equal! This gives the following functional equation

\[ f(x) + g(y) = h(x+y). \]

Olympiad math interlude

In particular, we have

\[ f(x) + g(0) = h(x) = f(0) + g(x), \]

\[ \al{ f(x) + f(y) &= f(x) + (g(y) + f(0) - g(0))\\ &= h(x+y) + f(0)-g(0)\\ &= (f(x+y) + g(0)) + f(0)-g(0)\\ &= f(x+y) + f(0). } \]

This is basically Cauchy’s functional equation, and solves to $f(x) = ax + b$. We can then plug this in to get $g(y) = ay+b'$ and $h(s) = as + b + b'$.

Note: Since we’re only interested in making the correlation big, not perfect, it’s not quite enough to prove that this is the only solution using a small subset of the constraints (e.g. using $f(x)+g(y)=h(x+y)$ only for $o(D^2)$ values of $(x,y)$). But I do think that this is actually robust.

What correlation gap does this give us?

The correlation only cares about $f(x)+g(y)$ (rather than separately), let’s define $B = b+b'$. By independence, we have

\[ \al{ \E[(f(\Bx)+g(\By))h(\Bx+\By)] &= \E\b{(a\Bx+a\By+B)^2}\\ &= \E\b{(a\Bx)^2} + \E\b{(a\By)^2} + B^2\\ &= \Theta\p{a^2D^2} + B^2 } \]

and

\[ \al{ \E[(f(\Bx)+f(\By))h(\Bs)] &= \E[(a\Bx+a\By+B)(a\Bs+B)]\\ &= (a\E[\Bx]+a\E[\By]+B)(a\E[\Bs]+B)\\ &= B^2. } \]

We do have an appreciable gap! And just setting $B=0$ sounds reasonable if you’re the only neuron.¹

Modular addition

Now how about addition modulo some prime $p$?

As before, let’s define the weights of a particular neuron as

\[ \left\{ \al{ f &: \set{0, \ldots, p-1} \to \R\\ g &: \set{0, \ldots, p-1} \to \R\\ h &: \set{0, \ldots, p-1} \to \R. } \right. \]

If we want to maximize the correlation, we still have the functional equation

\[ f(x) + g(y) = h(x+y), \]

which (if we extend $f,g,h$ to $\N$ with period $p$) we’ve already seen requires $h(s)$ to be affine:

\[ h(s) = as+B. \]

But now we also have the constraint that $h(0) = h(p)$, which would require $a=0$? That would force all outputs to always be equal, which is obviously no good.

Okay so that means we can’t have perfect correlation, but we can maybe compromise on that a little bit. Intuitively, affine means “equally spaced on a line”. What’s the closest thing you can get to “equally spaced on a line”, while ensuring that you come back to the same point after $p$ steps? The answer is: “equally spaced on a circle”!

Unit circle

Concretely, the easiest way to think about it is using complex numbers $\C$, and placing the weights on the unit circle. Let² $h(s) \ce e^{i2\pi\frac{s}{p}}$ be the target value. Is it the case that setting $f(x) \ce e^{i2\pi \frac{x}{p}}$ and $g(y) = e^{i2\pi \frac{y}{p}}$ gives us

\[ f(x) + g(y) \approx h(x+y)? \]

Well, not quite, but close enough to fix.

There’s two minor technical problems:

As it turns out, $e^{i2\pi \frac{x}{p}} + e^{i2\pi \frac{y}{p}}$ is aligned with the complex number $e^{i2\pi \frac{(x+y)/2}{p}}$, not $e^{i2\pi\frac{ (x+y)}{p}}$. This could be fixed simply by changing $f$ and $g$ to $f(x) \ce e^{i2\pi \frac{2x}{p}}$ and $g(y) = e^{i2\pi \frac{2y}{p}}$.
But half of the time, $e^{i2\pi \frac{2x}{p}} + e^{i2\pi \frac{2y}{p}}$ is actually in the opposite direction as $e^{i2\pi\frac{x+y}{p}}$. So the solution is to use a nonlinearity $\fi(w) \ce w^2$ to square the sum (and come back to the original definition of $f$ and $g$).

So now we have

\[ \fi(f(x)+g(y)) = \p{e^{i2\pi \frac{x}{p}} + e^{i2\pi \frac{y}{p}}}^2 = e^{i2\pi \frac{2x}{p}} + 2 \UB{e^{i2\pi\frac{x+y}{p}}}_\text{the important part} + e^{i2\pi \frac{2y}{p}}, \]

which is indeed correlated with $e^{i2\pi\frac{x+y}{p}}$.

What correlation gap does this give us?

Although what we really wanted was for their (expected) product

\[ \E[\fi(f(\Bx)+g(\By))h(\Bx+\By)] \]

to be big as a real number, which means one should be close to the complex conjugate of the other. So let’s conjugate $h$ into $h(s) \ce e^{-i2\pi\frac{s}{p}}$, getting

\[ \al{ \fi(f(x)+g(y))h(s) &= \p{e^{i2\pi \frac{2x}{p}} + 2 e^{i2\pi\frac{x+y}{p}} + e^{i2\pi \frac{2y}{p}}}e^{-i2\pi \frac{s}{p}}\\ &= e^{i2\pi \frac{2x-s}{p}} + 2 e^{i2\pi\frac{x+y-s}{p}} + e^{i2\pi \frac{2y-s}{p}}. %&= 2 + e^{i2\pi \frac{x-y}{p}} + e^{i2\pi \frac{y-x}{p}}. } \]

We can now compute the correlations as

\[ \E[\fi(f(\Bx)+g(\By))h(\Bx+\By)] = \E\b{e^{i2\pi \frac{\Bx-\By}{p}} + 2+ e^{i2\pi \frac{\By-\Bx}{p}}} = 2, \]

whereas for any fixed $s$,

\[ \E[\fi(f(\Bx)+g(\By))h(s)] = \E\b{e^{i2\pi \frac{2\Bx-s}{p}} + 2 e^{i2\pi\frac{\Bx+\By-s}{p}} + e^{i2\pi \frac{2\By-s}{p}}} = 0, \]

since $\sum_{t=0}^{p-1} e^{i2\pi \frac{kt}{p}} = 0$ for any $k \in \set{1, \ldots, p-1}$.

So we do have a good gap!³ The fair comparison is to set $a = \frac{1}{D}$ in the non-modular case, in which case the gap was $\Theta(1)$ there too. So it means that by “bending the line into a circle”, we only lost a constant factor in what the correlation could be.

#to-write compute the actual correlations (i.e. normalized by variance) instead of talking out of your ass?

But neural networks work with real numbers!

Exercise for the reader: prove that essentially the same thing happens if you replace $e^{i2\pi \frac{x}{p}}$ with $\cos\p{2\pi\frac{x}{p}}$.

Appendix: if you actually want a perfect solution

I still need to figure out how to do this (and whether it’s possible to make it work with the given $f$ and $g$; in particular I think you need a non-linearity $\fi$ here).

I think the intuition still holds even if this fails though, since the dynamics only care about current correlations, not about what the perfect solution would eventually look like.

So-far unsuccessful attempt:

In practice, if you want the network to learn the function, you’d want the neurons to collaborate to create a gap for every possible output $s$, not for a uniformly random $\Bs$, and force the value to be small for every possible $x,y$ such that $x+y \ne s$, not just for a uniformly random $\Bx,\By$. I attempted this in the appendix but gave up after a while because it didn’t seem immediately crucial. ↩
more generally, $h(s) = h(s) \ce e^{i2\pi\frac{ks}{p}}$ is an option for any $k \in \set{1, \ldots, p-1}$ since we didn’t require ↩
Again, in practice, if you want the network to learn the function, this value would need to be small for every possible pair $x,y$ such that $x+y \ne s$, not just for a uniformly random $\Bx,\By$. To achieve this effect, you need to set the offset $B$ to random values. ↩