Results for "batch norm"
Measure of vector magnitude; used in regularization and optimization.
Number of samples per gradient update; impacts compute efficiency, generalization, and stability.
Emergence of conventions among agents.
Running predictions on large datasets periodically.
Techniques that stabilize and speed training by normalizing activations; LayerNorm is common in Transformers.
Sensitivity of a function to input perturbations.
Iterative method that updates parameters in the direction of negative gradient to minimize loss.
A gradient method using random minibatches for efficient training on large datasets.
Gradients grow too large, causing divergence; mitigated by clipping, normalization, careful init.
Limiting gradient magnitude to prevent exploding gradients.
Learning where data arrives sequentially and the model updates continuously, often under changing distributions.
Configuration choices not learned directly (or not typically learned) that govern training or architecture.
Gradients shrink through layers, slowing learning in early layers; mitigated by ReLU, residuals, normalization.
How many requests or tokens can be processed per unit time; affects scalability and cost.
Model execution path in production.
Low-latency prediction per request.