Results for "sub-goals"
Goals useful regardless of final objective.
Decomposing goals into sub-tasks.
Breaking tasks into sub-steps.
Methods for breaking goals into steps; can be classical (A*, STRIPS) or LLM-driven with tool calls.
Intelligence and goals are independent.
Correctly specifying goals.
Randomly zeroing activations during training to reduce co-adaptation and overfitting.
Incremental capability growth.
System that independently pursues goals over time.
Ensuring AI systems pursue intended human goals.
Tendency for agents to pursue resources regardless of final goal.
Inferring human goals from behavior.
A system that perceives state, selects actions, and pursues goals—often combining LLM reasoning with tools and memory.
Ensuring model behavior matches human goals, norms, and constraints, including reducing harmful or deceptive outputs.
A dataset + metric suite for comparing models; can be gamed or misaligned with real-world goals.
Attacks that manipulate model instructions (especially via retrieved content) to override system goals or exfiltrate data.
Multiple agents interacting cooperatively or competitively.
Agent reasoning about future outcomes.
Maximizing reward without fulfilling real goal.
Model optimizes objectives misaligned with human values.
Ensuring learned behavior matches intended objective.
Learned subsystem that optimizes its own objective.
Model behaves well during training but not deployment.
Tradeoff between safety and performance.
Internal representation of the agent itself.
Research ensuring AI remains safe.