This is intended to be a compilation of the big AI safety ideas, problems and approaches to solutions. In order to keep it readable, we provide links to the content instead of the content itself. Contributions are welcome!
- More is different [1] [2] [3]
- Vinge uncertainty [1] [2] [3]
- Collingridge dilemma [1]
- Bio-anchors [1]
- Thought experiments [1]
- Instrumental convergence [1] [2] [3] [4]
- Self-preservation
- Goal-content integrity
- Cognitive enhancement
- Resource acquisition
- Power/Influence acquisition [1]
- Specification gaming [1] [2]
- Deception. This is the optimal behavior for a misaligned mesa-optimizer. [1]
- Nearest unblocked strategy [1]
- Collaboration with other AIs
- Sycophant AI [1]
- Orthogonality Thesis [1] [2] [3] [4]
- Strawberry problem [1] [2]
- Paperclip maximizing [1] [2]
- Learning the wrong distribution. [1] [2]
- High impact [1]
- Edge instantiation [1]
- Context disaster [1]
- Alignment tax / Safety Tax [1]
- Collingridge dilemma [1]
- Corrigibility [1] [2]
- Humans are not secure [1]
- AI-Box [1] [2]
- We need to get alignment right on the 'first critical try' [1]
- Shutdown problem [1]
- Robust totalitarianism [1] [2]
- Extreme first-strike advantages [1] [2]
- Misuse Risks [1]
- Value Erosion through Competition [1] [2]
- Windfall clause [1]
- Compute governance [1]
- Risks from malevolent actors [1]
- Human-anchors [1]
- Bio-anchors [1]
- A super-smart deceptive, manipulative, psychopath with arbitrary and (possibly absurd) goals.
- As a computer program that simply does what it’s programmed to do. Just because it is super capable does not mean it is wise, moral, smart or cares about what humans want.
- Eliciting latent knowledge (Paul Christiano) (Alignment Research Center) [1]
- Agent foundations (MIRI) [1]
- Brain-like design [1]
- Iterated Distillation and Amplification [1]
- Humans Consulting Humans (Christiano) [1]
- Learning from Humans [1] [2] [3]
- Reward modeling (DeepMind) [1]
- Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstration [1]
- Imitation learning [1]
- Myopic reinforcement learning [1]
- Inverse reinforcement learning [1]
- Cooperative inverse reinforcement learning [1]
- Debate [1] [2]
- Capability control method
- Transparency / Interpretability
- "General Intelligence or Universal Intelligence is the ability to efficiently achieve goals in a wide range of domains". (This is a commonly held definition) [1] [2]
- "Intelligence is the ability to make models. General intelligence means that a sufficiently large computational substrate can be fitted to an arbitrary computable function, within the limits of that substrate." (Josha Bach) [1]
- "AI that is trying to do what you want it to do". (Paul Christiano) [1]
- "AI systems be designed with the sole objective of maximizing the realization of human preferences" (Stuart Russell) [1]
- "AI should be designed to align with our ‘coherent extrapolated volition’ (CEV)[1]. CEV represents an integrated version of what we would want ‘if we knew more, thought faster, were more the people we wished we were, and had grown up farther together" (Eliezer Yudkowsky) [1]
- 2022 AGI Safety Fundamentals alignment curriculum
- AI Safety Syllabus
- Awesome AI Safety
- Awesome AI Alignment
- AI Alignment resources arbital
- AGISafety.org
- AINotKillEveryone.com
Contributions are welcome! Please open a merge request and will do my best to quickly approve it.