Results for "training loss"
The shape of the loss function over parameter space.
The learned numeric values of a model adjusted during training to minimize a loss function.
A scalar measure optimized during training, typically expected loss over data, sometimes with regularization terms.
Minimizing average loss on training data; can overfit when data is limited or biased.
Iterative method that updates parameters in the direction of negative gradient to minimize loss.
Reducing numeric precision of weights/activations to speed inference and reduce memory with acceptable accuracy loss.
Matrix of second derivatives describing local curvature of loss.
Lowest possible loss.
Loss of old knowledge when learning new tasks.
Maximum expected loss under normal conditions.
One complete traversal of the training dataset during training.
Attacks that infer whether specific records were in training data, or reconstruct sensitive training examples.
End-to-end process for model training.
Cost of model training.
A function measuring prediction error (and sometimes calibration), guiding gradient-based optimization.
Penalizes confident wrong predictions heavily; standard for classification and language modeling.
Training with a small labeled dataset plus a larger unlabeled dataset, leveraging assumptions like smoothness/cluster structure.
Training one model on multiple tasks simultaneously to improve generalization through shared structure.
Methods that learn training procedures or initializations so models can adapt quickly to new tasks with little data.
A mismatch between training and deployment data distributions that can degrade model performance.
Configuration choices not learned directly (or not typically learned) that govern training or architecture.
When a model fits noise/idiosyncrasies of training data and performs poorly on unseen data.
When a model cannot capture underlying structure, performing poorly on both training and test data.
How well a model performs on new data drawn from the same (or similar) distribution as training.
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.
When information from evaluation data improperly influences training, inflating reported performance.
A gradient method using random minibatches for efficient training on large datasets.
Halting training when validation performance stops improving to reduce overfitting.
Activation max(0, x); improves gradient flow and training speed in deep nets.
Techniques that stabilize and speed training by normalizing activations; LayerNorm is common in Transformers.