Results for "policy consistency"
Learning only from current policy’s data.
Directly optimizing control policies.
Learning from data generated by a different policy.
Optimizing policies directly via gradient ascent on expected reward.
Strategy mapping states to actions.
Combines value estimation (critic) with policy learning (actor).
Sampling multiple outputs and selecting consensus.
Human or automated process of assigning targets; quality, consistency, and guidelines matter heavily.
Measure of consistency across labelers; low agreement indicates ambiguous tasks or poor guidelines.
Continuous cycle of observation, reasoning, action, and feedback.
A learning paradigm where an agent interacts with an environment and learns to choose actions to maximize cumulative reward.
Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.
Rules and controls around generation (filters, validators, structured outputs) to reduce unsafe or invalid behavior.
Stress-testing models for failures, vulnerabilities, policy violations, and harmful behaviors before release.
Expected cumulative reward from a state or state-action pair.
Separates planning from execution in agent architectures.
Legal or policy requirement to explain AI decisions.
Algorithm computing control actions.
RL without explicit dynamics model.
Learning action mapping directly from demonstrations.
Training with a small labeled dataset plus a larger unlabeled dataset, leveraging assumptions like smoothness/cluster structure.
Feature attribution method grounded in cooperative game theory for explaining predictions in tabular settings.
Central system to store model versions, metadata, approvals, and deployment state.
Ability to replicate results given same code/data; harder in distributed training and nondeterministic ops.
Forcing predictable formats for downstream systems; reduces parsing errors and supports validation/guardrails.
Centralized repository for curated features.
Estimating parameters by maximizing likelihood of observed data.
Models estimating recidivism risk.
A high-priority instruction layer setting overarching behavior constraints for a chat model.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.