Results for "reward inference"
Maximizing reward without fulfilling real goal.
Modifying reward to accelerate learning.
Inferring reward function from observed behavior.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Reward only given upon task completion.
Model exploits poorly specified objectives.
Expected cumulative reward from a state or state-action pair.
Methods to protect model/data during inference (e.g., trusted execution environments) from operators/attackers.
Model execution path in production.
Cost to run models in production.
Framework for reasoning about cause-effect relationships beyond correlation, often using structural assumptions and experiments.
Low-latency prediction per request.
Running predictions on large datasets periodically.
Differences between training and inference conditions.
A learning paradigm where an agent interacts with an environment and learns to choose actions to maximize cumulative reward.
Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.
Rules and controls around generation (filters, validators, structured outputs) to reduce unsafe or invalid behavior.
Model optimizes objectives misaligned with human values.
Optimizing policies directly via gradient ascent on expected reward.
Correctly specifying goals.
Acting to minimize surprise or free energy.
Inferring and aligning with human preferences.
Hardware resources used for training/inference; constrained by memory bandwidth, FLOPs, and parallelism.
Running models locally.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Strategy mapping states to actions.
Formal framework for sequential decision-making under uncertainty.
Fundamental recursive relationship defining optimal value functions.
Expected return of taking action in a state.
Balancing learning new behaviors vs exploiting known rewards.