Domain: Optimization
Popular optimizer combining momentum and per-parameter adaptive step sizes via first/second moment estimates.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Minimizing average loss on training data; can overfit when data is limited or biased.
Iterative method that updates parameters in the direction of negative gradient to minimize loss.
Configuration choices not learned directly (or not typically learned) that govern training or architecture.
Penalizes confident wrong predictions heavily; standard for classification and language modeling.
Average of squared residuals; common regression objective.
Uses an exponential moving average of gradients to speed convergence and reduce oscillation.
A scalar measure optimized during training, typically expected loss over data, sometimes with regularization terms.
Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.