Results for "stochastic regularization"
Measure of vector magnitude; used in regularization and optimization.
Techniques that discourage overly complex solutions to improve generalization (reduce overfitting).
Optimization under uncertainty.
A gradient method using random minibatches for efficient training on large datasets.
Autoencoder using probabilistic latent variables and KL regularization.
Randomly zeroing activations during training to reduce co-adaptation and overfitting.
Stochastic generation strategies that trade determinism for diversity; key knobs include temperature and nucleus sampling.
Training with a small labeled dataset plus a larger unlabeled dataset, leveraging assumptions like smoothness/cluster structure.
Configuration choices not learned directly (or not typically learned) that govern training or architecture.
A scalar measure optimized during training, typically expected loss over data, sometimes with regularization terms.
Minimizing average loss on training data; can overfit when data is limited or biased.
When a model fits noise/idiosyncrasies of training data and performs poorly on unseen data.
How well a model performs on new data drawn from the same (or similar) distribution as training.
Halting training when validation performance stops improving to reduce overfitting.
Expanding training data via transformations (flips, noise, paraphrases) to improve robustness.
A narrow minimum often associated with poorer generalization.
The range of functions a model can represent.
A conceptual framework describing error as the sum of systematic error (bias) and sensitivity to data (variance).
A branch of ML using multi-layer neural networks to learn hierarchical representations, often excelling in vision, speech, and language.
Iterative method that updates parameters in the direction of negative gradient to minimize loss.
Learning where data arrives sequentially and the model updates continuously, often under changing distributions.
Uses an exponential moving average of gradients to speed convergence and reduce oscillation.
Popular optimizer combining momentum and per-parameter adaptive step sizes via first/second moment estimates.
Number of samples per gradient update; impacts compute efficiency, generalization, and stability.
One complete traversal of the training dataset during training.
A parameterized function composed of interconnected units organized in layers with nonlinear activations.
Training across many devices/silos without centralizing raw data; aggregates updates, not data.
Samples from the k highest-probability tokens to limit unlikely outputs.
Optimization with multiple local minima/saddle points; typical in neural networks.
Variability introduced by minibatch sampling during SGD.