Results for "sequence modeling"
Attention where queries/keys/values come from the same sequence, enabling token-to-token interactions.
Prevents attention to future tokens during training/inference.
Injects sequence order into Transformers, since attention alone is permutation-invariant.
Generates sequences one token at a time, conditioning on past tokens.
Differences between training and inference conditions.
Modeling environment evolution in latent space.
Networks with recurrent connections for sequences; largely supplanted by Transformers for many tasks.
Training objective where the model predicts the next token given previous tokens (causal modeling).
An RNN variant using gates to mitigate vanishing gradients and capture longer context.
CNNs applied to time series.
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
Search algorithm for generation that keeps top-k partial sequences; can improve likelihood but reduce diversity.
Exponential of average negative log-likelihood; lower means better predictive fit, not necessarily better utility.
Encodes token position explicitly, often via sinusoids.
A single attention mechanism within multi-head attention.
Transformer applied to image patches.
Controlling robots via language.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
Stepwise reasoning patterns that can improve multi-step tasks; often handled implicitly or summarized for safety/privacy.
Deep learning system for protein structure prediction.
Formal framework for sequential decision-making under uncertainty.
Penalizes confident wrong predictions heavily; standard for classification and language modeling.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
Models time evolution via hidden states.
Modeling interactions with environment.
Differences between simulated and real physics.
Learning from data by constructing “pseudo-labels” (e.g., next-token prediction, masked modeling) without manual annotation.
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
A high-capacity language model trained on massive corpora, exhibiting broad generalization and emergent behaviors.
Stochastic generation strategies that trade determinism for diversity; key knobs include temperature and nucleus sampling.