Results for "safety science"
Stress-testing models for failures, vulnerabilities, policy violations, and harmful behaviors before release.
Ensuring AI systems pursue intended human goals.
Maximizing reward without fulfilling real goal.
Tendency for agents to pursue resources regardless of final goal.
Model optimizes objectives misaligned with human values.
Correctly specifying goals.
Learned subsystem that optimizes its own objective.
Model behaves well during training but not deployment.
Maintaining alignment under new conditions.
Willingness of system to accept correction or shutdown.
European regulation classifying AI systems by risk.
AI used in sensitive domains requiring compliance.
Governance of model changes.
Control shared between human and agent.
Ensuring robots do not harm humans.
Testing AI under actual clinical conditions.
US approval process for medical AI devices.
Software regulated as a medical device.
AI capable of performing most intellectual tasks humans can.
Existential risk from AI systems.
Incremental capability growth.
Isolating AI systems.
Signals indicating dangerous behavior.
Ensuring AI allows shutdown.
Tendency to gain control/resources.
International agreements on AI.