Personal summary of An overview of 11 proposals for building safe advanced AI.
The four components
- Outer alignment: is the objective we’re training for aligned?
- Inner alignment: will the model be trying to accomplish the objective?
- Training competitiveness: is the process of producing that model competitive?
- Performance competitiveness: would that model satisfy the use cases people want?
- I’ll use the term “bad behavior” for things like deceptive behavior (pseudo-alignment) and otherwise catastrophic behavior. I should work on making this vagueness clearer in my head.
- Basically all proposals involving transparency tools could fail to be traning-competitive if the checks pass too rarely (either because the models are hard to check confidently or are very often deceptive).
- Performance competitiveness is a bit like “outer competitiveness”, and training competitiveness is a bit like inner competitiveness. In particular, for amplification-based approaches,
- both outer alignment and performance competitiveness are concerned with the theoretical behavior of the amplification tree (assuming perfect learning): whether it’s aligned and whether it will be useful;
- both inner alignment and training competitiveness are concerned with how practice will differ from the theoretical behavior: whether the model will be aligned with its theoretical limit, and how difficult it is to get train the model to match the theoretical limit.
- I think that conceiving of the amplification trees as infinite is confusing. We’re training $M_d$ to match the tree of depth $d$, not to match the infinite tree. The real performance competitiveness question is whether trees of reasonable depth $d$ are competitive, not whether the infinite tree is competitive. The claim that trees of increasing depth converge towards some limiting behavior is both dubious and not necessarily relevant to answering the question.
- Train an RL agent in an environment where “good morals” are incentivized (e.g. corrigibility, honesty, multi-agent cooperation).
- Use transparency tools to check the model.
- What’s the dominant behavior in the training environment (hard to understand)?
- Is pseudo-alignment likely to occur and are transparency tools robust to deceptive models trying to trick them?
- Is the sort of agency produced by RL necessary for advanced AI systems? Otherwise, RL might be too expensive due to its complex environments.
- Can you create an environment to incentivize all the behavior you want, e.g. good decision-making, question-answering, learning abilities?
Imitative amplification + intermittent oversight
- Iteratively train $M_{i+1}$ to imitate $\Amp(M_i)$, i.e. a human consulting $M_i$.
- After training, $\Amp(M_i)$ uses transparency tools to check $M_{i+1}$.
- Is HCH (a tree of humans) aligned or not?
- Can $\Amp(M_i)$ detect bad behavior? Intuitively, $\Amp(M_i)$ should be about as powerful as $M_{i+1}$, which would help, but not clear since there could be a jump if $M_{i+1}$ suddenly “crystallizes” something.
- Is pure imitation rich enough for sample-efficient learning?
- Is HCH universal enough, or are “intuitive flashes of insight” needed?
Imitative amplification + relaxed adversarial training
- Iteratively train $M_{i+1}$ to imitate $\Amp(M_i)$.
- But at the same time, train $M_{i+1}$ to decrease the probability of catastrophe as assessed by $\Amp(M_i)$.
- Is HCH aligned or not?
- Is it easier to detect bad behavior while it’s still developing? On the other hand, could the model overfit the oversight signal?
- See above. Also, is the extra oversight bandwidth worth it (it’s costly but might increase the probability the checks pass)?
- See above.
Approval-based amplification + relaxed adversarial training
- Iteratively train $M_{i+1}$ to maximize the approval of $\Amp(M_i)$.
- At the same time, train $M_{i+1}$ to decrease the probability of catastrophe as assessed by $\Amp(M_i)$.
- Not HCH! Is the approval-amplification tree aligned? Seems less likely because of approval gaming.
- See above. Does it help that it’s myopic? Do approval signals incentivize deception? Are models trained on approval more helpful for oversight subtasks?
- Is approval richer than imitation (because it’s easier to evaluate than to produce behavior)?
- Can a tree with machines produce ideas that a tree with humans alone never could? Does it produce good-sounding bad ideas?
Microscope AI
- Train a predictive model on some data.
- Mine the model for insights that humans can use.
- Inscrutable philosophical ramblings (convergent instrumental goals?).
- Is it possible to avoid mesa-optimization completely?
- Is it easy to extract a model’s insights?
- Is enhanced human understanding enough? Is most of the value in many low-level decisions or a few high-level decisions?
- Train models on exclusively STEM tasks so that the model doesn’t know about humans.
- See above.
- Is it easier to avoid mesa-optimization (and in particular, enforce myopia) if the model doesn’t know about humans?
- Is it easier to do math if you know about language and culture?
- Are most of the benefits from AI technological? Would it lead to a decisive strategic advantage? Would it advance AI capabilities without advancing alignment? (My personal question: Could we use a math AI to help us do alignment research?)
- Jointly train a reward model (based on human feedback) and an agent pursuing it.
- Do out-of-distribution detection (this is the “narrow” part).
- Use transparency to check the reward model and the agent.
- Is the feedback sufficient for the reward model to be correct in the limit over cases not detected to be OOD?
- Similar to previous uses of transparency tools.
- Is reward modeling efficient? Similar to approval-based approaches in that the human only needs to evaluate behavior, not produce it.
- Similar to microscope/STEM: are most of the benefits from AI achievable by narrow agents?
Recursive reward modeling + relaxed adversarial training
- Iteratively train $M_{i+1}$ using reward modeling with feedback from $\Amp(M_i)$.
- At the same time, train $M_{i+1}$ (both the agent and the reward model) to decrease the probability of catastrophe as assessed by $\Amp(M_i)$.
- Is the reward modeling tree aligned? Similar to the approval maximization tree, seems dubious.
- See previous relaxed adversarial training approaches. (Myopia plays out a bit differently here.)
- Mostly similar to narrow reward modeling.
- Similar to approval-based amplification, but also it’s non-myopic, which could help with some problems (???).
- Train a model to win debates against itself a copy of itself, judged by a human.
- Debaters can ask for clarifications about the opponent’s moves.
- Debaters can use transparency tools against each other and win if they prove problematic cognition.
- Is giving honest, helpful advice the best strategy? Or is the true position sometimes to difficult to argue for? If the former is true, then the first player will give aligned advice.
- Are transparency tools powerful enough? How do you disincentivize deception while still allowing the second player to defend a (second-best) position they don’t believe in? How would you train the model to use the transparency tools given that your best hope is that deception actually never happens?
- Would it be sufficiently sample-efficient to make the cost of human feedback non-prohibitive?
- The theoretical limit of debate is $\mathrm{NEXP}$ (assuming the human is an arbitrary polynomial-time verifier), whereas most other frameworks limit to $\mathrm{NP}$. But would that difference be felt in practice?
TODO: last 2 proposals.