Results for "training loss"
Hardware resources used for training/inference; constrained by memory bandwidth, FLOPs, and parallelism.
Maliciously inserting or altering training data to implant backdoors or degrade performance.
Attacks that infer whether specific records were in training data, or reconstruct sensitive training examples.
Error due to sensitivity to fluctuations in the training dataset.
Adjusting learning rate over training to improve convergence.
Models that learn to generate samples resembling training data.
Increasing performance via more data.
Scaling law optimizing compute vs data.
Methods like Adam adjusting learning rates dynamically.
Train/test environment mismatch.
Randomizing simulation parameters to improve real-world transfer.
A mismatch between training and deployment data distributions that can degrade model performance.
When a model fits noise/idiosyncrasies of training data and performs poorly on unseen data.
A robust evaluation technique that trains/evaluates across multiple splits to estimate performance variability.
A gradient method using random minibatches for efficient training on large datasets.
Techniques that stabilize and speed training by normalizing activations; LayerNorm is common in Transformers.
A high-capacity language model trained on massive corpora, exhibiting broad generalization and emergent behaviors.
Training across many devices/silos without centralizing raw data; aggregates updates, not data.
Ability to replicate results given same code/data; harder in distributed training and nondeterministic ops.
PEFT method injecting trainable low-rank matrices into layers, enabling efficient fine-tuning.
Tendency to trust automated suggestions even when incorrect; mitigated by UI design, training, and checks.
A point where gradient is zero but is neither a max nor min; common in deep nets.
Limiting gradient magnitude to prevent exploding gradients.
Allows gradients to bypass layers, enabling very deep networks.
Prevents attention to future tokens during training/inference.
Capabilities that appear only beyond certain model sizes.
Controls amount of noise added at each diffusion step.
Ensuring learned behavior matches intended objective.
Maintaining alignment under new conditions.
Model trained on its own outputs degrades quality.