Results for "learned objectives"
Learned subsystem that optimizes its own objective.
Ensuring learned behavior matches intended objective.
Model exploits poorly specified objectives.
Configuration choices not learned directly (or not typically learned) that govern training or architecture.
Early architecture using learned gates for skip connections.
Maximizing reward without fulfilling real goal.
Tendency for agents to pursue resources regardless of final goal.
Model optimizes objectives misaligned with human values.
Correctly specifying goals.
Willingness of system to accept correction or shutdown.
Agents have opposing objectives.
Model behaves well during training but not deployment.
The internal space where learned representations live; operations here often correlate with semantics or generative factors.
A parameterized mapping from inputs to outputs; includes architecture + learned parameters.
Automatically learning useful internal features (latent variables) that capture salient structure for downstream tasks.
Applying learned patterns incorrectly.
RL using learned or known environment models.
Learned model of environment dynamics.
Multiple agents interacting cooperatively or competitively.
Coordination arising without explicit programming.
Generator produces limited variety of outputs.
Decomposing goals into sub-tasks.
Maintaining alignment under new conditions.
Governance of model changes.
Coordinating models, tools, and logic.
Limiting inference usage.
The physical system being controlled.
Agents optimize collective outcomes.
Ensuring AI allows shutdown.
Tendency to gain control/resources.