Results for "autoregressive training"
Generates sequences one token at a time, conditioning on past tokens.
Prevents attention to future tokens during training/inference.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
Stores past attention states to speed up autoregressive decoding.
Training objective where the model predicts the next token given previous tokens (causal modeling).
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
Sequential data indexed by time.
Classical statistical time-series model.
Cost of model training.
One complete traversal of the training dataset during training.
End-to-end process for model training.
Halting training when validation performance stops improving to reduce overfitting.
When information from evaluation data improperly influences training, inflating reported performance.
Minimizing average loss on training data; can overfit when data is limited or biased.
Configuration choices not learned directly (or not typically learned) that govern training or architecture.
Number of samples per gradient update; impacts compute efficiency, generalization, and stability.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Ordering training samples from easier to harder to improve convergence or generalization.
Gradually increasing learning rate at training start to avoid divergence.
Recovering training data from gradients.
Inferring sensitive features of training data.
Differences between training and inference conditions.
Model behaves well during training but not deployment.
Combining simulation and real-world data.
Training with a small labeled dataset plus a larger unlabeled dataset, leveraging assumptions like smoothness/cluster structure.
A parameterized mapping from inputs to outputs; includes architecture + learned parameters.
A scalar measure optimized during training, typically expected loss over data, sometimes with regularization terms.
When a model cannot capture underlying structure, performing poorly on both training and test data.
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.
Activation max(0, x); improves gradient flow and training speed in deep nets.