% Environments
\newcommand{\al}[1]{\begin{align}#1\end{align}} % need this for \tag{} to work
% Delimiters
% (I needed to create my own because the MathJax version of \DeclarePairedDelimiter doesn't have \mathopen{} and that messes up the spacing)
% .. one-part
\newcommand{\p}[1]{\mathopen{}\left( #1 \right)}
\renewcommand{\b}[1]{\mathopen{}\left[ #1 \right]}
\newcommand{\set}[1]{\mathopen{}\left\{ #1 \right\}}
\newcommand{\abs}[1]{\mathopen{}\left\lvert #1 \right\rvert}
\newcommand{\floor}[1]{\mathopen{}\left\lfloor #1 \right\rfloor}
\newcommand{\ceil}[1]{\mathopen{}\left\lceil #1 \right\rceil}
\newcommand{\inner}[1]{\mathopen{}\left\langle #1 \right\rangle}
\newcommand{\norm}[1]{\mathopen{}\left\lVert #1 \strut \right\rVert}
\newcommand{\mix}[1]{\mathopen{}\left\lfloor #1 \right\rceil}
%% .. two-part
\newcommand{\inco}[2]{#1 \mathop{}\middle|\mathop{} #2}
\newcommand{\co}[2]{ {\left.\inco{#1}{#2}\right.}}
\newcommand{\cond}{\co} % deprecated
\newcommand{\at}[2]{ {\left.#1\strut\right|_{#2}}}
\newcommand{\para}[2]{#1\strut \mathop{}\middle\|\mathop{} #2}
% Greek
% the following cause issues with real LaTeX tho :/ maybe consider naming it \fhi instead?
\let\fi\phi % because it looks like an f
\let\phi\varphi % because it looks like a p
% Miscellaneous
% .. operators
\DeclareMathOperator*{\argmin}{arg\thinspace min}
\DeclareMathOperator*{\argmax}{arg\thinspace max}
% .. functions
% .. analysis
\newcommand{\df}[2]{ {\f{\d #1}{\d #2}}}
\newcommand{\ds}[2]{ {\sl{\d #1}{\d #2}}}
\newcommand{\ddf}[3]{ {\f{\dd{#1} #2}{\p{\d #3}^{#1}}}}
\newcommand{\dds}[3]{ {\sl{\dd{#1} #2}{\p{\d #3}^{#1}}}}
\newcommand{\partf}[2]{\f{\part #1}{\part #2}}
\newcommand{\parts}[2]{\sl{\part #1}{\part #2}}
% .. sets
\newcommand{\pmo}{\set{\pm 1}}
\newcommand{\zpmo}{\set{0,\pm 1}}
% .... set operations
\newcommand{\inc}[1]{\union \set{#1}} % "including"
\newcommand{\exc}[1]{\setminus \set{#1}} % "except"
% .. over and under
\newcommand{\tld}{\widetilde} % deprecated
\newcommand{\HAT}{\widehat} % deprecated
\newcommand{\rt}[1]{ {\sqrt{#1}}}
% .... two-part
\renewcommand{\sl}[2]{#1 /\mathopen{}#2}
% .. arrows
% .. operators and relations
% .. punctuation and spacing
% Levels of closeness
% .. vanilla versions (is it within a constant?)
% .. dotted versions (is it equal in the limit?)
% .. log versions (is it equal up to log?)
% Logic and bit operations
\DeclareMathOperator{\1}{\mathbb{1}} % use \mathbbm instead if using real LaTeX
% Linear algebra
\newcommand{\spn}{\mathrm{span}} % do NOT use \span because it causes misery with amsmath
% .. named tensors
\newcommand{\namedtensorstrut}{\vphantom{fg}} % milder than \mathstrut
\newcommand{\name}[1]{\mathsf{\namedtensorstrut #1}}
\newcommand{\nbin}[2]{\mathbin{\underset{\substack{#1}}{\namedtensorstrut #2}}}
% Probability
% .. operators
% ... information theory
% .. other divergences
% Complexity classes
% .. classical
% .. probabilistic
% .. circuits
% .. resources
% .. keywords
% Boolean analysis
\DeclareMathOperator{\CDT}{\mathrm{CDT}} % canonical
\DeclareMathOperator{\PDT}{\mathrm{PDT}} % partial decision tree
% .. functions (small caps sadly doesn't work)
% Dynamic optimality
% Alignment
% In "text"
% remove these last two if using real LaTeX
% Fonts
% .. bold
% .. calligraphic
% .. typewriter
Personal summary of A Bird’s Eye View of the ML Field from the Pragmatic AI safety sequence.
Driving dynamics of the ML field
- Empirical ML research progresses through well-defined metrics for progress towards well-defined goals.
- Metrics are objective and allow comparison across methods.
- Good metrics can detect minor improvements, so that we can accumulate succesful tricks.
- Metrics that require human evaluations are not as good (more costly, slower feedback, more subjective).
- Theory is of limited use.
- Deep learning is more of an engineering science, where the main tools are “intuition, creative inspiration, tinkering, and trying many things”.
- Deep learning has many factors leading to unpredictability: complicatedness, fast changes, ambiguity, opacity, and interconnectedness / multiple causes.
- There are few theories, they provide little guidance, and their conclusions are fragile.
- When things work, it’s not even obvious why in hindsight (e.g. why residual connections or fractal data augmentation help).
- Further reading:
- New methods rapidly make older methods irrelevant
- The field is rhythmed by “tsunamis”, after which methods often work out-of-the-box and have much higher performance.
- RL appears to be poised for a tsunami because it is currently very capricious and hasn’t been revolutionized by large-scale models yet.
- We should focus on building things that won’t be washed away: ecosystems, prestige, safety culture, and datasets.
The ML research ecosystems
- The field is focused around a small number of conferences.
- (Also, a few flashy industry Nature papers.)
- Journal and workshop-only papers are less influential.
- Field is fast, so need to keep up with arXiv submissions (Twitter can help).
- conferences by subfield (sorted by importance)
- generic: ICLR, NeurIPS, ICML
- computer vision: CVPR
- natural language processing: ACL
- reinforcement learning / robotics: ICRA
- Most research is on “microcosms”: simpler subproblems that mirror the larger problems but are more tractable. Some great microcosms are:
- image classification (e.g. ImageNet)
- where most deep learning building blocks have emerged (some activation functions, batch normalization, some optimizers, dropout, convolutions, residual connections, etc.)
- fundamentally about analyzing structured (continuous) signals
- funding is good because computer vision is useful in many industries
- benefits from good ideas, as opposed to just scale, which attracts researchers
- natural language processing
- analysis of structured discrete signals
- NLP and CV have started to coalesce, with more multimodal models able to handle both types of signals (e.g. transformers)
- conference review process
- it has serious flaws
- best paper awards, oral or spotlight designations mean very little
- in ML, biased towards theory papers, even though this is anticorrerlated with impact
- in vision, political
- acceptance decisions and scores are very random and not very correlated with future citations
- this means it’s hard to filter good ideas without the test of time
- but not useless
- comments are useful in order to hear what people truly think
- limits parochialism
- requires some level of technical execution, and encourages authors to make their papers more solid
- Many inconsequential papers get published.
- e.g. by hiding metrics/datasets on which they underperform
- There is a bias for interestingness/novelty as opposed to practicality.
- On popular metrics, progress is often linear, or log linear (in the loss).
- Image classification: very steady increase, ~90% top-1 accuracy.
- Video understanding: 10 years behind image classification, ~75%.
- Other tasks: see article.
- emergent properties
- The qualitative impact of an order of magnitude increase in parameters, or a new algorithm, is often difficult to predict. Capabilities can sometimes emerge suddenly and without warning.
- “AlphaZero experienced a phase transition where internal representations changed dramatically and capabilities altered significantly at about ~32,000 steps, when the system learned concepts like king safety, threats, and mobility suddenly.”
- Grokking shows that in some cases, performance can improve dramatically on test data even after it had already saturated on the training data.
- scaling
- in NLP, the algorithm (transformers) has stabilized and progress is led by scaling
- (on the other hand, in vision, progress is mostly led by algorithms)
- compute growing rapidly since 2010
- scaling laws tell us how to optimize performance on a limited compute budget
Analyzing the trajectory of researchers
- bibliometrics
- citations aren’t great but they work
- Semantic Scholar’s “Highly Influential Citations” is better
- ~1 for typical middle-stage PhD students
- ~1k for superstar graduating PhD students
- ~30k for Kaiming He
- long-tailed impact
- papers
- top 1% -> 50%
- top 0.1% -> 30%
- top 1 (0.02%) -> 10%
- individuals
- top 1% -> 35%
- top 0.1% -> 15%
- top 1 (0.004%) -> 3.5%
- not just due to number of papers
- not just due to luck (strong correlation between past and future citations)
- “Practically, most researchers can be considered to have essentially no impact.”
- transfer betwen paradigms is hard
- skills for traditional ML: math, optimization, statistics, reasoning
- skills for deep learning: gut, trying many things
- extreme accomplishment is fragile
- Matthew effect
- success in grad school -> job at top institution -> better collaborators, students, funding, invited to more talks, papers get read more, etc.
- this means researchers with high credibility aren’t always so much better than their peers
- incentives: everyone wants to continue being useful
- “Researchers who are good at math are incentivized to enjoy theory and think theory is important, because they are better at it. They are also incentivized to characterize rapid, successive empirical advancement as meaningless “benchmark chasing,” or many ML successes are just “glorified pattern matching, curve fitting, and memorization” not addressing “the fundamental problems.” They’re incentivized to work on what they can understand as opposed to work on what performs well; they’re incentivized to search in the space where their mathematics can help, which is not a comparatively large space (this is analogized to the streetlight effect).”
- small budget -> “scale isn’t important”
- currently unsuccessful -> “field needs a paradigm shift”
- on a roll -> “no need for a paradigm shift”
- deep in a niche -> “the niche still needs much more work”
- student -> “my advisor’s research is important”
- EA/rat -> “we’re doing the most important work”
- work on capabilities -> “there’s no risk”
- grad students at top universities
- median of two lead author papers at top conferences
- “A few students publish far more than two lead author papers. These students are substantially more visible, leading people to think nearly all graduate students publish many papers.”
- e.g. Rylan
- “a typical graduate student who becomes a professor at a top university usually has at least 8 papers at the time of graduation, and it’s common for them to have ~10 papers”
- little contact time with advisors but still large influence
- “Even if students meet with their advisors infrequently, they tend to emulate them in their head while researching.”
AGI and AI safety
Note: the article was written in May 2022; things haveprobably significantly changed since.
- currently very little safety-related work (2% at NeurIPS 2021)
- topics (with safety highlighted)
- General Machine Learning (e.g., classification, unsupervised learning, transfer learning)
- Deep Learning (e.g., architectures, generative models, optimization for deep networks)
- Reinforcement Learning (e.g., decision and control, planning, hierarchical RL)
- Applications (e.g., speech processing, computational biology, computer vision, NLP)
- Probabilistic Methods (e.g., variational inference, causal inference, Gaussian processes)
- Optimization (e.g., convex and non-convex optimization)
- Neuroscience and Cognitive Science (e.g., neural coding, brain-computer interfaces)
- Theory (e.g., control theory, learning theory, algorithmic game theory)
- Infrastructure (e.g., datasets, competitions, implementations, libraries)
- Social Aspects of Machine Learning (e.g., ==AI safety==, fairness, privacy, ==interpretability==)
- sentiment
- mostly technopositive
- AI winter made AGI taboo
- safety, alignment, superintelligent are toxic words
- possible paths people see towards capabilities improvements
- richer environments where simple rewards incentivize general intelligence
- neurosymbolic AI paradigm shift
- bootstrap intelligence from a good theorem prover / programmer
- scaling
- better upstream representations
- not sure I quite understand this one, but seemingly it’s about first using self-supervised learning to learn the distribution rather than directly learning the policy
- see A path towards human-level intelligence
- timelines (median)
- human-level: 50 years
- human-level math: 12 years
- arguments for longer
- robustness, sequential decision making and generality will be hard for current paradigm
- this is before GPT-4’s MMLU results
- good to act based on longer timelines because that’s where we have time to help?
- arguments for shorter
- algorithmic improvements are possible
- bootstrapping
- we might be surprised by emergent capabilities
- good to act based on shorter timelines because surprise is bad?