Results for "intra-sequence relations"
Structured graph encoding facts as entity–relation–entity triples.
Attention where queries/keys/values come from the same sequence, enabling token-to-token interactions.
Simplified Boltzmann Machine with bipartite structure.
Prevents attention to future tokens during training/inference.
Injects sequence order into Transformers, since attention alone is permutation-invariant.
Generates sequences one token at a time, conditioning on past tokens.
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
Search algorithm for generation that keeps top-k partial sequences; can improve likelihood but reduce diversity.
Exponential of average negative log-likelihood; lower means better predictive fit, not necessarily better utility.
Encodes token position explicitly, often via sinusoids.
A single attention mechanism within multi-head attention.
Transformer applied to image patches.
Differences between training and inference conditions.
Controlling robots via language.
Networks with recurrent connections for sequences; largely supplanted by Transformers for many tasks.
Training objective where the model predicts the next token given previous tokens (causal modeling).
An RNN variant using gates to mitigate vanishing gradients and capture longer context.
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
Stepwise reasoning patterns that can improve multi-step tasks; often handled implicitly or summarized for safety/privacy.
A high-capacity language model trained on massive corpora, exhibiting broad generalization and emergent behaviors.
Stochastic generation strategies that trade determinism for diversity; key knobs include temperature and nucleus sampling.
Samples from the k highest-probability tokens to limit unlikely outputs.
Samples from the smallest set of tokens whose probabilities sum to p, adapting set size by context.
Raw model outputs before converting to probabilities; manipulated during decoding and calibration.
Coordinating tools, models, and steps (retrieval, calls, validation) to deliver reliable end-to-end behavior.
Stores past attention states to speed up autoregressive decoding.
Encodes positional information via rotation in embedding space.
Techniques to handle longer documents without quadratic cost.
Attention mechanisms that reduce quadratic complexity.