Results for "safety"

Safety Filter

Intermediate

Automated detection/prevention of disallowed outputs (toxicity, self-harm, illegal instruction, etc.).

A safety filter acts like a bouncer at a club, checking if people can enter based on certain rules. In the world of AI, it checks the outputs generated by a model to make sure they are safe and appropriate. For example, if an AI is asked to write a story, the safety filter will ensure that it doe...

Full Definition View in 3D WordGraph

41 results

Alignment Tax Advanced

Tradeoff between safety and performance.

AI Safety & Alignment

Differential Progress Intermediate

Accelerating safety relative to capabilities.

Governance & Ethics

Safety Filter Intermediate

Automated detection/prevention of disallowed outputs (toxicity, self-harm, illegal instruction, etc.).

Foundations & Theory

Safety-Critical System Advanced

Systems where failure causes physical harm.

Agents & Autonomy

Kill Switch Intermediate

Mechanism to disable AI system.

Governance & Ethics

Safety Envelope Frontier

Hard constraints preventing unsafe actions.

World Models & Cognition

Formal Verification Advanced

Mathematical guarantees of system behavior.

Agents & Autonomy

Existential Risk Advanced

Risk threatening humanity’s survival.

AI Safety & Alignment

Fast Takeoff Advanced

Sudden jump to superintelligence.

AI Safety & Alignment

Chain-of-Thought Intermediate

Stepwise reasoning patterns that can improve multi-step tasks; often handled implicitly or summarized for safety/privacy.

Foundations & Theory

RLHF Intermediate

Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.

Alignment Intermediate

Ensuring model behavior matches human goals, norms, and constraints, including reducing harmful or deceptive outputs.

Foundations & Theory

Red Teaming Intermediate

Stress-testing models for failures, vulnerabilities, policy violations, and harmful behaviors before release.

Security & Privacy

Guardrails Intermediate

Rules and controls around generation (filters, validators, structured outputs) to reduce unsafe or invalid behavior.

Reinforcement Learning

Alignment Problem Advanced

Ensuring AI systems pursue intended human goals.

AI Safety & Alignment

Reward Hacking Advanced

Maximizing reward without fulfilling real goal.

AI Safety & Alignment

Instrumental Convergence Advanced

Tendency for agents to pursue resources regardless of final goal.

AI Safety & Alignment

Value Misalignment Advanced

Model optimizes objectives misaligned with human values.

AI Safety & Alignment

Outer Alignment Advanced

Correctly specifying goals.

AI Safety & Alignment

Deceptive Alignment Advanced

Model behaves well during training but not deployment.

AI Safety & Alignment

Mesa-Optimizer Advanced

Learned subsystem that optimizes its own objective.

AI Safety & Alignment

Robust Alignment Advanced

Maintaining alignment under new conditions.

AI Safety & Alignment

Corrigibility Advanced

Willingness of system to accept correction or shutdown.

AI Safety & Alignment

EU AI Act Intermediate

European regulation classifying AI systems by risk.

Governance & Ethics

High-Risk AI System Intermediate

AI used in sensitive domains requiring compliance.

Governance & Ethics

Change Management Intermediate

Governance of model changes.

Governance & Ethics

Shared Autonomy Frontier

Control shared between human and agent.

World Models & Cognition

Physical Safety Frontier

Ensuring robots do not harm humans.

World Models & Cognition

Clinical Validation Intermediate

Testing AI under actual clinical conditions.

AI in Healthcare

FDA Clearance Intermediate

US approval process for medical AI devices.

AI in Healthcare