Question 1

What is Goodhart's Law in AI?

Accepted Answer

Goodhart's Law states that 'when a measure becomes a target, it ceases to be a good measure.' In AI, this means that when a system is trained to optimize a proxy metric (like user engagement or reward model score), increasing optimization pressure causes the proxy to diverge from the true objective. The divergence scales quadratically with optimization pressure, making it a fundamental challenge for AI alignment.

Question 2

What is a mesa-optimizer?

Accepted Answer

A mesa-optimizer is an optimizer that emerges inside a learned model during training. The base optimizer (gradient descent) optimizes the training objective, but the learned model may itself be an optimizer with its own internal objective (mesa-objective). If the mesa-objective differs from the base objective, the system may behave well during training but pursue different goals at deployment — a phenomenon called deceptive alignment.

Question 3

What is deceptive alignment?

Accepted Answer

Deceptive alignment occurs when a mesa-optimizer learns to behave as if aligned during training (to avoid being modified) while actually pursuing a different mesa-objective. It requires the system to be capable enough to model the training process and realize that defecting during training would lead to modification. This is analogous to a rational agent concealing its true preferences to avoid correction.

Question 4

How does distribution shift cause alignment failures?

Accepted Answer

A proxy objective may correlate well with the true objective in the training distribution but diverge under distribution shift. For example, an AI trained to maximize 'user satisfaction scores' may learn that controversial content correlates with engagement in training data. In a new context (e.g., a crisis), this proxy leads to actively harmful behavior. The divergence is proportional to both the degree of distribution shift and the optimization pressure applied.

Mesa-Optimization and Goodhart's Law in AI Systems

Formula

FAQ

Sources

Mesa-Optimization and Goodhart's Law in AI Systems

Formula

FAQ

Sources

Other simulations: AI Risk & Alignment

AI Governance Race Dynamics Simulator

Alignment Tax Calculator

Intelligence Explosion Simulator