Results for "convergence"
Instrumental Convergence
AdvancedTendency for agents to pursue resources regardless of final goal.
Instrumental convergence is like a student who wants to get good grades but realizes that studying hard and doing homework will help them achieve that goal. No matter what subject they are studying, they know that being organized and having enough time to study will help them succeed. In AI, this...
Uses an exponential moving average of gradients to speed convergence and reduce oscillation.
Number of samples per gradient update; impacts compute efficiency, generalization, and stability.
Adjusting learning rate over training to improve convergence.
Tendency for agents to pursue resources regardless of final goal.
Iterative method that updates parameters in the direction of negative gradient to minimize loss.
Ordering training samples from easier to harder to improve convergence or generalization.
A gradient method using random minibatches for efficient training on large datasets.
Optimization using curvature information; often expensive at scale.
Methods like Adam adjusting learning rates dynamically.
Configuration choices not learned directly (or not typically learned) that govern training or architecture.
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
Activation max(0, x); improves gradient flow and training speed in deep nets.
Nonlinear functions enabling networks to approximate complex mappings; ReLU variants dominate modern DL.
Methods to set starting weights to preserve signal/gradient scales across layers.
Gradients shrink through layers, slowing learning in early layers; mitigated by ReLU, residuals, normalization.
Techniques that stabilize and speed training by normalizing activations; LayerNorm is common in Transformers.
Training across many devices/silos without centralizing raw data; aggregates updates, not data.
A point where gradient is zero but is neither a max nor min; common in deep nets.
Variability introduced by minibatch sampling during SGD.
The shape of the loss function over parameter space.
Gradually increasing learning rate at training start to avoid divergence.
Controls amount of noise added at each diffusion step.
Vectors with zero inner product; implies independence.
Sensitivity of a function to input perturbations.
Approximating expectations via random sampling.
Visualization of optimization landscape.
Minimum relative to nearby points.
Flat high-dimensional regions slowing training.
Choosing step size along gradient direction.
Restricting updates to safe regions.