Results for "human values"
Model optimizes objectives misaligned with human values.
Required human review for high-risk decisions.
Ensuring AI systems pursue intended human goals.
Inferring and aligning with human preferences.
Humans assist or override autonomous behavior.
System design where humans validate or guide model outputs, especially for high-stakes decisions.
Control shared between human and agent.
Correctly specifying goals.
Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.
Ensuring model behavior matches human goals, norms, and constraints, including reducing harmful or deceptive outputs.
Using limited human feedback to guide large models.
Willingness of system to accept correction or shutdown.
Research ensuring AI remains safe.
Existential risk from AI systems.
Risk threatening humanity’s survival.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Variable whose values depend on chance.
Intelligence and goals are independent.
Model behaves well during training but not deployment.
Interpreting human gestures.
AI subfield dealing with understanding and generating human language, including syntax, semantics, and pragmatics.
The field of building systems that perform tasks associated with human intelligence—perception, reasoning, language, planning, and decision-making—via algori...
The learned numeric values of a model adjusted during training to minimize a loss function.
A function measuring prediction error (and sometimes calibration), guiding gradient-based optimization.
Scalar summary of ROC; measures ranking ability, not calibration.
Average of squared residuals; common regression objective.
Attention where queries/keys/values come from the same sequence, enabling token-to-token interactions.
Feature attribution method grounded in cooperative game theory for explaining predictions in tabular settings.
Expected return of taking action in a state.
Predicting future values from past observations.