Results for "training contamination"
Error due to sensitivity to fluctuations in the training dataset.
Adjusting learning rate over training to improve convergence.
Empirical laws linking model size, data, compute to performance.
Models that learn to generate samples resembling training data.
Increasing performance via more data.
Scaling law optimizing compute vs data.
Methods like Adam adjusting learning rates dynamically.
Train/test environment mismatch.
Randomizing simulation parameters to improve real-world transfer.
Fabrication of cases or statutes by LLMs.
Methods that learn training procedures or initializations so models can adapt quickly to new tasks with little data.
A mismatch between training and deployment data distributions that can degrade model performance.
The learned numeric values of a model adjusted during training to minimize a loss function.
Techniques that discourage overly complex solutions to improve generalization (reduce overfitting).
When a model fits noise/idiosyncrasies of training data and performs poorly on unseen data.
How well a model performs on new data drawn from the same (or similar) distribution as training.
A robust evaluation technique that trains/evaluates across multiple splits to estimate performance variability.
A gradient method using random minibatches for efficient training on large datasets.
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
Gradients shrink through layers, slowing learning in early layers; mitigated by ReLU, residuals, normalization.
Techniques that stabilize and speed training by normalizing activations; LayerNorm is common in Transformers.
A high-capacity language model trained on massive corpora, exhibiting broad generalization and emergent behaviors.
Achieving task performance by providing a small number of examples inside the prompt without weight updates.
Fine-tuning on (prompt, response) pairs to align a model with instruction-following behaviors.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Training across many devices/silos without centralizing raw data; aggregates updates, not data.
Ability to replicate results given same code/data; harder in distributed training and nondeterministic ops.
PEFT method injecting trainable low-rank matrices into layers, enabling efficient fine-tuning.
Reducing numeric precision of weights/activations to speed inference and reduce memory with acceptable accuracy loss.
Tendency to trust automated suggestions even when incorrect; mitigated by UI design, training, and checks.