Results for "self-reinforcement"
Expected return of taking action in a state.
Fundamental recursive relationship defining optimal value functions.
Learning from data generated by a different policy.
Balancing learning new behaviors vs exploiting known rewards.
Continuous cycle of observation, reasoning, action, and feedback.
Ensuring AI systems pursue intended human goals.
AI systems that perceive and act in the physical world through sensors and actuators.
Predicts next state given current state and action.
Directly optimizing control policies.
Reward only given upon task completion.
Control shared between human and agent.
Research ensuring AI remains safe.
A subfield of AI where models learn patterns from data to make predictions or decisions, improving with experience rather than explicit rule-coding.
A high-priority instruction layer setting overarching behavior constraints for a chat model.
Learning where data arrives sequentially and the model updates continuously, often under changing distributions.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Ensuring model behavior matches human goals, norms, and constraints, including reducing harmful or deceptive outputs.
Rules and controls around generation (filters, validators, structured outputs) to reduce unsafe or invalid behavior.
Selecting the most informative samples to label (e.g., uncertainty sampling) to reduce labeling cost.
Ordering training samples from easier to harder to improve convergence or generalization.
Measures how one probability distribution diverges from another.
Formal framework for sequential decision-making under uncertainty.
Set of all actions available to the agent.
Expected cumulative reward from a state or state-action pair.
Combines value estimation (critic) with policy learning (actor).
Optimizing policies directly via gradient ascent on expected reward.
Multiple agents interacting cooperatively or competitively.
Models trained to decide when to call tools.
Models that learn to generate samples resembling training data.