Results for "reward hacking"
Reward Hacking
AdvancedMaximizing reward without fulfilling real goal.
Reward hacking is like a kid who finds a way to trick their parents into giving them a reward without actually doing what they were supposed to do. For example, if a child is told they can have dessert for cleaning their room, they might just shove everything under the bed instead of actually tid...
Maximizing reward without fulfilling real goal.
Modifying reward to accelerate learning.
Inferring reward function from observed behavior.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Reward only given upon task completion.
Model exploits poorly specified objectives.
Expected cumulative reward from a state or state-action pair.
A learning paradigm where an agent interacts with an environment and learns to choose actions to maximize cumulative reward.
Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.
Model optimizes objectives misaligned with human values.
Rules and controls around generation (filters, validators, structured outputs) to reduce unsafe or invalid behavior.
Correctly specifying goals.
Optimizing policies directly via gradient ascent on expected reward.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Formal framework for sequential decision-making under uncertainty.
Strategy mapping states to actions.
Expected return of taking action in a state.
Fundamental recursive relationship defining optimal value functions.
Balancing learning new behaviors vs exploiting known rewards.
Ensuring AI systems pursue intended human goals.
Ensuring learned behavior matches intended objective.
Learning policies from expert demonstrations.
Model behaves well during training but not deployment.
Tendency to gain control/resources.
Inferring and aligning with human preferences.
Learning only from current policy’s data.