Domain: AI Safety & Alignment

25 terms

AI Boxing Advanced

Isolating AI systems.

Alignment Problem Advanced

Ensuring AI systems pursue intended human goals.

Alignment Tax Advanced

Tradeoff between safety and performance.

Capability Overhang Advanced

Stored compute or algorithms enabling rapid jumps.

Corrigibility Advanced

Willingness of system to accept correction or shutdown.

Deceptive Alignment Advanced

Model behaves well during training but not deployment.

Existential Risk Advanced

Risk threatening humanity’s survival.

Fast Takeoff Advanced

Sudden jump to superintelligence.

Inner Alignment Advanced

Ensuring learned behavior matches intended objective.

Instrumental Convergence Advanced

Tendency for agents to pursue resources regardless of final goal.

Instrumental Goals Advanced

Goals useful regardless of final objective.

Mesa-Optimizer Advanced

Learned subsystem that optimizes its own objective.

Orthogonality Thesis Advanced

Intelligence and goals are independent.

Outer Alignment Advanced

Correctly specifying goals.

Power-Seeking Behavior Advanced

Tendency to gain control/resources.

Reward Hacking Advanced

Maximizing reward without fulfilling real goal.

Robust Alignment Advanced

Maintaining alignment under new conditions.

Scalable Oversight Advanced

Using limited human feedback to guide large models.

Shutdown Problem Advanced

Ensuring AI allows shutdown.

Slow Takeoff Advanced

Incremental capability growth.

Specification Gaming Advanced

Model exploits poorly specified objectives.

Takeoff Speed Advanced

Rate at which AI capabilities improve.

Tripwire Advanced

Signals indicating dangerous behavior.

Value Misalignment Advanced

Model optimizes objectives misaligned with human values.

x-Risk Advanced

Existential risk from AI systems.