$\require{mathtools} % %%% GENERIC MATH %%% % % Environments \newcommand{\al}[1]{\begin{align}#1\end{align}} % need this for \tag{} to work \renewcommand{\r}{\mathrm} % % Greek \newcommand{\eps}{\epsilon} \newcommand{\veps}{\varepsilon} \newcommand{\Om}{\Omega} \newcommand{\om}{\omega} \newcommand{\Th}{\Theta} \let\fi\phi % because it looks like an f \let\phi\varphi % because it looks like a p % % Miscellaneous shortcuts % .. over and under \newcommand{\ss}[1]{_{\substack{#1}}} \newcommand{\ob}{\overbrace} \newcommand{\ub}{\underbrace} \newcommand{\ol}{\overline} \newcommand{\tld}{\widetilde} \newcommand{\HAT}{\widehat} \newcommand{\f}{\frac} \newcommand{\s}[2]{#1 /\mathopen{}#2} \newcommand{\rt}[1]{ {\sqrt{#1}}} % .. relations \newcommand{\sr}{\stackrel} \newcommand{\sse}{\subseteq} \newcommand{\ce}{\coloneqq} \newcommand{\ec}{\eqqcolon} \newcommand{\ap}{\approx} \newcommand{\ls}{\lesssim} \newcommand{\gs}{\gtrsim} % .. miscer \newcommand{\q}{\quad} \newcommand{\qq}{\qquad} \newcommand{\heart}{\heartsuit} % % Delimiters % (I needed to create my own because the MathJax version of \DeclarePairedDelimiter doesn't have \mathopen{} and that messes up the spacing) % .. one-part \newcommand{\p}[1]{\mathopen{}\left( #1 \right)} \newcommand{\b}[1]{\mathopen{}\left[ #1 \right]} \newcommand{\set}[1]{\mathopen{}\left\{ #1 \right\}} \newcommand{\abs}[1]{\mathopen{}\left\lvert #1 \right\rvert} \newcommand{\floor}[1]{\mathopen{}\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\mathopen{}\left\lceil #1 \right\rceil} \newcommand{\inner}[1]{\mathopen{}\left\langle #1 \right\rangle} % .... (use phantom to force at least the standard height of double bars) \newcommand{\norm}[1]{\mathopen{}\left\lVert #1 \vphantom{f} \right\rVert} \newcommand{\frob}[1]{\norm{#1}_\mathrm{F}} %% .. two-part \newcommand{\incond}[2]{#1 \mathop{}\middle|\mathop{} #2} \newcommand{\cond}[2]{ {\left.\incond{#1}{#2}\right.}} \newcommand{\pco}[2]{\p{\incond{#1}{#2}}} \newcommand{\bco}[2]{\b{\incond{#1}{#2}}} \newcommand{\setco}[2]{\set{\incond{#1}{#2}}} \newcommand{\at}[2]{ {\left.#1\right|_{#2}}} \newcommand{\pat}[2]{\p{\at{#1}{#2}}} \newcommand{\bat}[2]{\b{\at{#1}{#2}}} % ..... (use phantom to force at least the standard height of double bar) \newcommand{\oldpara}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} %\newcommand{\para}[2]{#1\vphantom{f} \mathop{}\middle\|\mathop{} #2} \newcommand{\para}[2]{\mathchoice{\begin{matrix}#1\\\hdashline#2\end{matrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}{\begin{smallmatrix}#1\\\hdashline#2\end{smallmatrix}}} \newcommand{\ppa}[2]{\p{\para{#1}{#2}}} \newcommand{\bpa}[2]{\b{\para{#1}{#2}}} %\newcommand{\bpaco}[4]{\bpa{\incond{#1}{#2}}{\incond{#3}{#4}}} \newcommand{\bpaco}[4]{\bpa{\cond{#1}{#2}}{\cond{#3}{#4}}} % % Levels of closeness \newcommand{\scirc}[1]{\sr{\circ}{#1}} \newcommand{\sdot}[1]{\sr{.}{#1}} \newcommand{\slog}[1]{\sr{\log}{#1}} \newcommand{\createClosenessLevels}[7]{ \newcommand{#2}{\mathrel{(#1)}} \newcommand{#3}{\mathrel{#1}} \newcommand{#4}{\mathrel{#1\!\!#1}} \newcommand{#5}{\mathrel{#1\!\!#1\!\!#1}} \newcommand{#6}{\mathrel{(\sdot{#1})}} \newcommand{#7}{\mathrel{(\slog{#1})}} } \let\lt\undefined \let\gt\undefined % .. vanilla versions (is it within a constant?) \newcommand{\ez}{\scirc=} \newcommand{\eq}{\simeq} \newcommand{\eqq}{\mathrel{\eq\!\!\eq}} \newcommand{\eqqq}{\mathrel{\eq\!\!\eq\!\!\eq}} \newcommand{\lez}{\scirc\le} \newcommand{\lq}{\preceq} \newcommand{\lqq}{\mathrel{\lq\!\!\lq}} \newcommand{\lqqq}{\mathrel{\lq\!\!\lq\!\!\lq}} \newcommand{\gez}{\scirc\ge} \newcommand{\gq}{\succeq} \newcommand{\gqq}{\mathrel{\gq\!\!\gq}} \newcommand{\gqqq}{\mathrel{\gq\!\!\gq\!\!\gq}} \newcommand{\lz}{\scirc<} \newcommand{\lt}{\prec} \newcommand{\ltt}{\mathrel{\lt\!\!\lt}} \newcommand{\lttt}{\mathrel{\lt\!\!\lt\!\!\lt}} \newcommand{\gz}{\scirc>} \newcommand{\gt}{\succ} \newcommand{\gtt}{\mathrel{\gt\!\!\gt}} \newcommand{\gttt}{\mathrel{\gt\!\!\gt\!\!\gt}} % .. dotted versions (is it equal in the limit?) \newcommand{\ed}{\sdot=} \newcommand{\eqd}{\sdot\eq} \newcommand{\eqqd}{\sdot\eqq} \newcommand{\eqqqd}{\sdot\eqqq} \newcommand{\led}{\sdot\le} \newcommand{\lqd}{\sdot\lq} \newcommand{\lqqd}{\sdot\lqq} \newcommand{\lqqqd}{\sdot\lqqq} \newcommand{\ged}{\sdot\ge} \newcommand{\gqd}{\sdot\gq} \newcommand{\gqqd}{\sdot\gqq} \newcommand{\gqqqd}{\sdot\gqqq} \newcommand{\ld}{\sdot<} \newcommand{\ltd}{\sdot\lt} \newcommand{\lttd}{\sdot\ltt} \newcommand{\ltttd}{\sdot\lttt} \newcommand{\gd}{\sdot>} \newcommand{\gtd}{\sdot\gt} \newcommand{\gttd}{\sdot\gtt} \newcommand{\gtttd}{\sdot\gttt} % .. log versions (is it equal up to log?) \newcommand{\elog}{\slog=} \newcommand{\eqlog}{\slog\eq} \newcommand{\eqqlog}{\slog\eqq} \newcommand{\eqqqlog}{\slog\eqqq} \newcommand{\lelog}{\slog\le} \newcommand{\lqlog}{\slog\lq} \newcommand{\lqqlog}{\slog\lqq} \newcommand{\lqqqlog}{\slog\lqqq} \newcommand{\gelog}{\slog\ge} \newcommand{\gqlog}{\slog\gq} \newcommand{\gqqlog}{\slog\gqq} \newcommand{\gqqqlog}{\slog\gqqq} \newcommand{\llog}{\slog<} \newcommand{\ltlog}{\slog\lt} \newcommand{\lttlog}{\slog\ltt} \newcommand{\ltttlog}{\slog\lttt} \newcommand{\glog}{\slog>} \newcommand{\gtlog}{\slog\gt} \newcommand{\gttlog}{\slog\gtt} \newcommand{\gtttlog}{\slog\gttt} % % Miscellaneous \newcommand{\LHS}{\mathrm{LHS}} \newcommand{\RHS}{\mathrm{RHS}} % .. operators \DeclareMathOperator{\poly}{poly} \DeclareMathOperator{\polylog}{polylog} \DeclareMathOperator{\quasipoly}{quasipoly} \DeclareMathOperator{\negl}{negl} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} % .. functions \DeclareMathOperator{\id}{id} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\err}{err} \DeclareMathOperator{\ReLU}{ReLU} % .. analysis \let\d\undefined \newcommand{\d}{\operatorname{d}\mathopen{}} \newcommand{\df}[2]{ {\f{\d #1}{\d #2}}} \newcommand{\ds}[2]{ {\s{\d #1}{\d #2}}} \newcommand{\part}{\partial} \newcommand{\partf}[2]{\f{\part #1}{\part #2}} \newcommand{\parts}[2]{\s{\part #1}{\part #2}} \newcommand{\grad}[1]{\mathop{\nabla\!_{#1}}} % .. sets of numbers \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\F}{\mathbb{F}} % %%% SPECIALIZED MATH %%% % % Logic \renewcommand{\and}{\wedge} \newcommand{\AND}{\bigwedge} \newcommand{\or}{\vee} \newcommand{\OR}{\bigvee} \newcommand{\xor}{\oplus} \newcommand{\XOR}{\bigoplus} \newcommand{\union}{\cup} \newcommand{\inter}{\cap} \newcommand{\UNION}{\bigcup} \newcommand{\INTER}{\bigcap} \newcommand{\comp}{\overline} \newcommand{\true}{\r{true}} \newcommand{\false}{\r{false}} \newcommand{\tf}{\set{\true,\false}} \DeclareMathOperator{\One}{\mathbb{1}} \DeclareMathOperator{\1}{\mathbb{1}} % % Linear algebra \renewcommand{\span}{\mathrm{span}} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\dom}{dom} \DeclareMathOperator{\Img}{Im} \newcommand{\transp}{\mathsf{T}} \renewcommand{\t}{^\transp} % ... named tensors \newcommand{\namedtensorstrut}{\vphantom{fg}} % milder than \mathstrut \newcommand{\name}[1]{\mathsf{\namedtensorstrut #1}} \newcommand{\nbin}[2]{\mathbin{\underset{\substack{#1}}{\namedtensorstrut #2}}} \newcommand{\ndot}[1]{\nbin{#1}{\odot}} \newcommand{\ncat}[1]{\nbin{#1}{\oplus}} \newcommand{\nsum}[1]{\sum\limits_{\substack{#1}}} \newcommand{\nfun}[2]{\mathop{\underset{\substack{#1}}{\namedtensorstrut\mathrm{#2}}}} \newcommand{\ndef}[2]{\newcommand{#1}{\name{#2}}} \newcommand{\nt}[1]{^{\transp(#1)}} % % Probability \newcommand{\Normal}{\mathcal{N}} \let\Pr\undefined \DeclareMathOperator*{\Pr}{Pr} \DeclareMathOperator*{\G}{\mathbb{G}} \DeclareMathOperator*{\Odds}{Od} \DeclareMathOperator*{\E}{E} \DeclareMathOperator*{\Var}{Var} \DeclareMathOperator*{\Cov}{Cov} \DeclareMathOperator*{\corr}{corr} \DeclareMathOperator*{\median}{median} \newcommand{\dTV}{d_{\mathrm{TV}}} \newcommand{\dHel}{d_{\mathrm{Hel}}} \newcommand{\dJS}{d_{\mathrm{JS}}} % ... information theory \let\H\undefined \DeclareMathOperator*{\H}{H} \DeclareMathOperator*{\I}{I} \DeclareMathOperator*{\D}{D} % %%% SPECIALIZED COMPUTER SCIENCE %%% % % Complexity classes % .. classical \newcommand{\Poly}{\mathsf{P}} \newcommand{\NP}{\mathsf{NP}} \newcommand{\PH}{\mathsf{PH}} \newcommand{\PSPACE}{\mathsf{PSPACE}} \renewcommand{\L}{\mathsf{L}} % .. probabilistic \newcommand{\formost}{\mathsf{Я}} \newcommand{\RP}{\mathsf{RP}} \newcommand{\BPP}{\mathsf{BPP}} \newcommand{\MA}{\mathsf{MA}} \newcommand{\AM}{\mathsf{AM}} \newcommand{\IP}{\mathsf{IP}} \newcommand{\RL}{\mathsf{RL}} % .. circuits \newcommand{\NC}{\mathsf{NC}} \newcommand{\AC}{\mathsf{AC}} \newcommand{\ACC}{\mathsf{ACC}} \newcommand{\TC}{\mathsf{TC}} \newcommand{\Ppoly}{\mathsf{P}/\poly} \newcommand{\Lpoly}{\mathsf{L}/\poly} % .. resources \newcommand{\TIME}{\mathsf{TIME}} \newcommand{\SPACE}{\mathsf{SPACE}} \newcommand{\TISP}{\mathsf{TISP}} \newcommand{\SIZE}{\mathsf{SIZE}} % .. keywords \newcommand{\co}{\mathsf{co}} \newcommand{\Prom}{\mathsf{Promise}} % % Boolean analysis \newcommand{\zo}{\set{0,1}} \newcommand{\pmo}{\set{\pm 1}} \newcommand{\zpmo}{\set{0,\pm 1}} \newcommand{\harpoon}{\!\upharpoonright\!} \newcommand{\rr}[2]{#1\harpoon_{#2}} \newcommand{\Fou}[1]{\widehat{#1}} \DeclareMathOperator{\Ind}{\mathrm{Ind}} \DeclareMathOperator{\Inf}{\mathrm{Inf}} \newcommand{\Der}[1]{\operatorname{D}_{#1}\mathopen{}} \newcommand{\Exp}[1]{\operatorname{E}_{#1}\mathopen{}} \DeclareMathOperator{\Stab}{\mathrm{Stab}} \DeclareMathOperator{\T}{T} \DeclareMathOperator{\sens}{\mathrm{s}} \DeclareMathOperator{\bsens}{\mathrm{bs}} \DeclareMathOperator{\fbsens}{\mathrm{fbs}} \DeclareMathOperator{\Cert}{\mathrm{C}} \DeclareMathOperator{\DT}{\mathrm{DT}} \DeclareMathOperator{\CDT}{\mathrm{CDT}} % canonical \DeclareMathOperator{\ECDT}{\mathrm{ECDT}} \DeclareMathOperator{\CDTv}{\mathrm{CDT_{vars}}} \DeclareMathOperator{\ECDTv}{\mathrm{ECDT_{vars}}} \DeclareMathOperator{\CDTt}{\mathrm{CDT_{terms}}} \DeclareMathOperator{\ECDTt}{\mathrm{ECDT_{terms}}} \DeclareMathOperator{\CDTw}{\mathrm{CDT_{weighted}}} \DeclareMathOperator{\ECDTw}{\mathrm{ECDT_{weighted}}} \DeclareMathOperator{\AvgDT}{\mathrm{AvgDT}} \DeclareMathOperator{\PDT}{\mathrm{PDT}} % partial decision tree \DeclareMathOperator{\DTsize}{\mathrm{DT_{size}}} \DeclareMathOperator{\W}{\mathbf{W}} % .. functions (small caps sadly doesn't work) \DeclareMathOperator{\Par}{\mathrm{Par}} \DeclareMathOperator{\Maj}{\mathrm{Maj}} \DeclareMathOperator{\HW}{\mathrm{HW}} \DeclareMathOperator{\Thr}{\mathrm{Thr}} \DeclareMathOperator{\Tribes}{\mathrm{Tribes}} \DeclareMathOperator{\RotTribes}{\mathrm{RotTribes}} \DeclareMathOperator{\CycleRun}{\mathrm{CycleRun}} \DeclareMathOperator{\SAT}{\mathrm{SAT}} \DeclareMathOperator{\UniqueSAT}{\mathrm{UniqueSAT}} % % Dynamic optimality \newcommand{\OPT}{\mathsf{OPT}} \newcommand{\Alt}{\mathsf{Alt}} \newcommand{\Funnel}{\mathsf{Funnel}} % % Alignment \DeclareMathOperator{\Amp}{\mathrm{Amp}} % %%% TYPESETTING %%% % % In text \renewcommand{\th}{^{\mathrm{th}}} \newcommand{\degree}{^\circ} % % Fonts % .. bold \newcommand{\BA}{\boldsymbol{A}} \newcommand{\BB}{\boldsymbol{B}} \newcommand{\BC}{\boldsymbol{C}} \newcommand{\BD}{\boldsymbol{D}} \newcommand{\BE}{\boldsymbol{E}} \newcommand{\BF}{\boldsymbol{F}} \newcommand{\BG}{\boldsymbol{G}} \newcommand{\BH}{\boldsymbol{H}} \newcommand{\BI}{\boldsymbol{I}} \newcommand{\BJ}{\boldsymbol{J}} \newcommand{\BK}{\boldsymbol{K}} \newcommand{\BL}{\boldsymbol{L}} \newcommand{\BM}{\boldsymbol{M}} \newcommand{\BN}{\boldsymbol{N}} \newcommand{\BO}{\boldsymbol{O}} \newcommand{\BP}{\boldsymbol{P}} \newcommand{\BQ}{\boldsymbol{Q}} \newcommand{\BR}{\boldsymbol{R}} \newcommand{\BS}{\boldsymbol{S}} \newcommand{\BT}{\boldsymbol{T}} \newcommand{\BU}{\boldsymbol{U}} \newcommand{\BV}{\boldsymbol{V}} \newcommand{\BW}{\boldsymbol{W}} \newcommand{\BX}{\boldsymbol{X}} \newcommand{\BY}{\boldsymbol{Y}} \newcommand{\BZ}{\boldsymbol{Z}} \newcommand{\Ba}{\boldsymbol{a}} \newcommand{\Bb}{\boldsymbol{b}} \newcommand{\Bc}{\boldsymbol{c}} \newcommand{\Bd}{\boldsymbol{d}} \newcommand{\Be}{\boldsymbol{e}} \newcommand{\Bf}{\boldsymbol{f}} \newcommand{\Bg}{\boldsymbol{g}} \newcommand{\Bh}{\boldsymbol{h}} \newcommand{\Bi}{\boldsymbol{i}} \newcommand{\Bj}{\boldsymbol{j}} \newcommand{\Bk}{\boldsymbol{k}} \newcommand{\Bp}{\boldsymbol{p}} \newcommand{\Bq}{\boldsymbol{q}} \newcommand{\Br}{\boldsymbol{r}} \newcommand{\Bs}{\boldsymbol{s}} \newcommand{\Bt}{\boldsymbol{t}} \newcommand{\Bu}{\boldsymbol{u}} \newcommand{\Bv}{\boldsymbol{v}} \newcommand{\Bw}{\boldsymbol{w}} \newcommand{\Bx}{\boldsymbol{x}} \newcommand{\By}{\boldsymbol{y}} \newcommand{\Bz}{\boldsymbol{z}} \newcommand{\Balpha}{\boldsymbol{\alpha}} \newcommand{\Bbeta}{\boldsymbol{\beta}} \newcommand{\Bgamma}{\boldsymbol{\gamma}} \newcommand{\Bdelta}{\boldsymbol{\delta}} \newcommand{\Beps}{\boldsymbol{\eps}} \newcommand{\Bveps}{\boldsymbol{\veps}} \newcommand{\Bzeta}{\boldsymbol{\zeta}} \newcommand{\Beta}{\boldsymbol{\eta}} \newcommand{\Btheta}{\boldsymbol{\theta}} \newcommand{\Biota}{\boldsymbol{\iota}} \newcommand{\Bkappa}{\boldsymbol{\kappa}} \newcommand{\Blambda}{\boldsymbol{\lambda}} \newcommand{\Bmu}{\boldsymbol{\mu}} \newcommand{\Bnu}{\boldsymbol{\nu}} \newcommand{\Bxi}{\boldsymbol{\xi}} \newcommand{\Bomicron}{\boldsymbol{\omicron}} \newcommand{\Bpi}{\boldsymbol{\pi}} \newcommand{\Brho}{\boldsymbol{\rho}} \newcommand{\Bsigma}{\boldsymbol{\sigma}} \newcommand{\Btau}{\boldsymbol{\tau}} \newcommand{\Bupsilon}{\boldsymbol{\upsilon}} \newcommand{\Bphi}{\boldsymbol{\phi}} \newcommand{\Bfi}{\boldsymbol{\fi}} \newcommand{\Bchi}{\boldsymbol{\chi}} \newcommand{\Bpsi}{\boldsymbol{\psi}} \newcommand{\Bomega}{\boldsymbol{\omega}} % .. calligraphic \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} % .. typewriter \newcommand{\TA}{\mathtt{A}} \newcommand{\TB}{\mathtt{B}} \newcommand{\TC}{\mathtt{C}} \newcommand{\TD}{\mathtt{D}} \newcommand{\TE}{\mathtt{E}} \newcommand{\TF}{\mathtt{F}} \newcommand{\TG}{\mathtt{G}} \newcommand{\TH}{\mathtt{H}} \newcommand{\TI}{\mathtt{I}} \newcommand{\TJ}{\mathtt{J}} \newcommand{\TK}{\mathtt{K}} \newcommand{\TL}{\mathtt{L}} \newcommand{\TM}{\mathtt{M}} \newcommand{\TN}{\mathtt{N}} \newcommand{\TO}{\mathtt{O}} \newcommand{\TP}{\mathtt{P}} \newcommand{\TQ}{\mathtt{Q}} \newcommand{\TR}{\mathtt{R}} \newcommand{\TS}{\mathtt{S}} \newcommand{\TT}{\mathtt{T}} \newcommand{\TU}{\mathtt{U}} \newcommand{\TV}{\mathtt{V}} \newcommand{\TW}{\mathtt{W}} \newcommand{\TX}{\mathtt{X}} \newcommand{\TY}{\mathtt{Y}} \newcommand{\TZ}{\mathtt{Z}}$

Personal summary of https://www.alignmentforum.org/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization.

Claim: Alignment predictably fails once capabilities generalize.

  • During evolution from apes to humans, our intelligence progressed but our (inner) alignment completely failed. This was predictable: apes were never reasoning about how to optimize reproductive fitness because they weren’t reasoning at all, so when reasoning suddenly appears we shouldn’t expect it to somehow be optimizing for reproductive fitness.
  • Intermediate AIs will probably be blown out of the water the same way that we blew all animals out of the water at the things they were best at (e.g. flying fast, carrying things) using machines.
  • Author mentions the counter-argument of culture as a one-time boost, but doesn’t argue further. :/
  • Good capabilities are an attractor well: the training scenarios encourage the AIs to figure out reasoning.
  • But alignment isn’t an attractor well: the objective is basically a free parameter. (??)

I don’t really understand the argument that alignment isn’t an attractor: isn’t a good understanding of how humans want you to behave also useful for getting good performance?