Results for "provable safety"
Tradeoff between safety and performance.
Accelerating safety relative to capabilities.
Automated detection/prevention of disallowed outputs (toxicity, self-harm, illegal instruction, etc.).
Systems where failure causes physical harm.
Mechanism to disable AI system.
Hard constraints preventing unsafe actions.
Mathematical guarantees of system behavior.
Sudden jump to superintelligence.
Risk threatening humanity’s survival.
Stepwise reasoning patterns that can improve multi-step tasks; often handled implicitly or summarized for safety/privacy.
Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.
Ensuring model behavior matches human goals, norms, and constraints, including reducing harmful or deceptive outputs.
Stress-testing models for failures, vulnerabilities, policy violations, and harmful behaviors before release.
Rules and controls around generation (filters, validators, structured outputs) to reduce unsafe or invalid behavior.
Ensuring AI systems pursue intended human goals.
Maximizing reward without fulfilling real goal.
Tendency for agents to pursue resources regardless of final goal.
Model optimizes objectives misaligned with human values.
Correctly specifying goals.
Model behaves well during training but not deployment.
Learned subsystem that optimizes its own objective.
Maintaining alignment under new conditions.
Willingness of system to accept correction or shutdown.
European regulation classifying AI systems by risk.
AI used in sensitive domains requiring compliance.
Governance of model changes.
Control shared between human and agent.
Ensuring robots do not harm humans.
Testing AI under actual clinical conditions.
US approval process for medical AI devices.