Results for "self-reinforcement"
Internal representation of the agent itself.
Attention where queries/keys/values come from the same sequence, enabling token-to-token interactions.
Architecture based on self-attention and feedforward layers; foundation of modern LLMs and many multimodal models.
Models evaluating and improving their own outputs.
Sampling multiple outputs and selecting consensus.
Asking model to review and improve output.
Learning from data by constructing “pseudo-labels” (e.g., next-token prediction, masked modeling) without manual annotation.
Automated detection/prevention of disallowed outputs (toxicity, self-harm, illegal instruction, etc.).
Transformer applied to image patches.
Model trained on its own outputs degrades quality.
Inferring reward function from observed behavior.
Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.
Strategy mapping states to actions.
Learning policies from expert demonstrations.
Training with a small labeled dataset plus a larger unlabeled dataset, leveraging assumptions like smoothness/cluster structure.
Injects sequence order into Transformers, since attention alone is permutation-invariant.
Networks with recurrent connections for sequences; largely supplanted by Transformers for many tasks.
Generates sequences one token at a time, conditioning on past tokens.
A high-capacity language model trained on massive corpora, exhibiting broad generalization and emergent behaviors.
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
Prevents attention to future tokens during training/inference.
System-level behavior arising from interactions.
GNN using attention to weight neighbor contributions dynamically.
Collective behavior without central control.
Distributed agents producing emergent intelligence.
Awareness and regulation of internal processes.
Ensuring AI allows shutdown.
Goals useful regardless of final objective.
A learning paradigm where an agent interacts with an environment and learns to choose actions to maximize cumulative reward.
All possible configurations an agent may encounter.