Results for "reward hacking"

Reward Hacking

Advanced

Maximizing reward without fulfilling real goal.

Reward hacking is like a kid who finds a way to trick their parents into giving them a reward without actually doing what they were supposed to do. For example, if a child is told they can have dessert for cleaning their room, they might just shove everything under the bed instead of actually tid...

Full Definition View in 3D WordGraph

26 results

Reward Hacking Advanced

Maximizing reward without fulfilling real goal.

AI Safety & Alignment

Reward Shaping Advanced

Modifying reward to accelerate learning.

Reinforcement Learning

Inverse Reinforcement Learning Advanced

Inferring reward function from observed behavior.

Reinforcement Learning

Reward Model Intermediate

Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.

Foundations & Theory

Sparse Reward Advanced

Reward only given upon task completion.

Reinforcement Learning

Specification Gaming Advanced

Model exploits poorly specified objectives.

AI Safety & Alignment

Value Function Intermediate

Expected cumulative reward from a state or state-action pair.

AI Economics & Strategy

Reinforcement Learning Intermediate

A learning paradigm where an agent interacts with an environment and learns to choose actions to maximize cumulative reward.

Reinforcement Learning

RLHF Intermediate

Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.

Value Misalignment Advanced

Model optimizes objectives misaligned with human values.

AI Safety & Alignment

Guardrails Intermediate

Rules and controls around generation (filters, validators, structured outputs) to reduce unsafe or invalid behavior.

Reinforcement Learning

Outer Alignment Advanced

Correctly specifying goals.

AI Safety & Alignment

Policy Gradient Intermediate

Optimizing policies directly via gradient ascent on expected reward.

AI Economics & Strategy

DPO Intermediate

A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.

Markov Decision Process Intermediate

Formal framework for sequential decision-making under uncertainty.

AI Economics & Strategy

Policy Intermediate

Strategy mapping states to actions.

AI Economics & Strategy

Q-Function Intermediate

Expected return of taking action in a state.

AI Economics & Strategy

Bellman Equation Intermediate

Fundamental recursive relationship defining optimal value functions.

AI Economics & Strategy

Exploration-Exploitation Tradeoff Intermediate

Balancing learning new behaviors vs exploiting known rewards.

AI Economics & Strategy

Alignment Problem Advanced

Ensuring AI systems pursue intended human goals.

AI Safety & Alignment

Inner Alignment Advanced

Ensuring learned behavior matches intended objective.

AI Safety & Alignment

Imitation Learning Advanced

Learning policies from expert demonstrations.

Reinforcement Learning

Deceptive Alignment Advanced

Model behaves well during training but not deployment.

AI Safety & Alignment

Power-Seeking Behavior Advanced

Tendency to gain control/resources.

AI Safety & Alignment

Value Learning Intermediate

Inferring and aligning with human preferences.

Governance & Ethics

On-Policy Learning Intermediate

Learning only from current policy’s data.

AI Economics & Strategy