Results for "rewards"
RL without explicit dynamics model.
Modifying reward to accelerate learning.
Reward only given upon task completion.
A learning paradigm where an agent interacts with an environment and learns to choose actions to maximize cumulative reward.
Set of all actions available to the agent.
Fundamental recursive relationship defining optimal value functions.
Expected return of taking action in a state.
Balancing learning new behaviors vs exploiting known rewards.
AI selecting next experiments.
Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.
Rules and controls around generation (filters, validators, structured outputs) to reduce unsafe or invalid behavior.
Continuous cycle of observation, reasoning, action, and feedback.
Formal framework for sequential decision-making under uncertainty.
Separates planning from execution in agent architectures.
Learning from data generated by a different policy.
System that independently pursues goals over time.
Learned model of environment dynamics.
Imagined future trajectories.
Designing AI to cooperate with humans and each other.
A system that perceives state, selects actions, and pursues goals—often combining LLM reasoning with tools and memory.