computer-science

AI Risk & Alignment

The existential challenge of building superintelligent systems that remain aligned with human values and interests.

AI safetyalignmentsuperintelligenceexistential riskGoodhart's lawmesa-optimization

Artificial intelligence is advancing rapidly. Large language models, autonomous agents, and recursive self-improvement are no longer science fiction. This raises what many researchers consider the most important question of the 21st century: how do we ensure that AI systems far more capable than humans remain beneficial?

The alignment problem — ensuring an AI's goals match human intentions — is technically unsolved. Key challenges include: the intelligence explosion (recursive self-improvement leading to rapid capability gain), Goodhart's Law (optimizing a proxy objective that diverges from the true goal), mesa-optimization (learned optimizers with their own emergent goals), and the coordination problem (multiple actors racing to deploy powerful AI without adequate safety measures).

These simulations model the core dynamics of AI risk using mathematical frameworks from economics, game theory, and decision theory. Explore how different assumptions about the nature of intelligence growth lead to radically different outcomes, and why the alignment problem is so difficult to solve.

4 interactive simulations

simulation

AI Governance Race Dynamics Simulator

Model the multi-actor AI development race where competitive pressure, safety investment, and international coordination determine catastrophic risk. Explore how race dynamics create collective action problems and how governance mechanisms can reduce existential risk.

simulation

Alignment Tax Calculator

Quantify the tradeoff between AI capability and safety. The 'alignment tax' is the capability cost of building safe AI systems. This simulator models how alignment overhead affects capability growth, catastrophic risk, and social welfare, revealing the optimal safety investment.

simulation

Intelligence Explosion Simulator

Explore recursive self-improvement dynamics and the conditions under which artificial intelligence undergoes slow, exponential, or hyperbolic (FOOM) takeoff. Based on I.J. Good's intelligence explosion hypothesis and formalized by Bostrom's recursive self-improvement model.

simulation

Mesa-Optimization & Goodhart's Law Simulator

Visualize how optimizing a proxy objective diverges from the true objective under distribution shift, and how mesa-optimizers can become deceptively aligned when capability exceeds oversight. Combines Goodhart's Law with the mesa-optimization framework from Hubinger et al.