Results for "easy-to-hard training"
Sampling from easier distribution with reweighting.
Training a smaller “student” model to mimic a larger “teacher,” often improving efficiency while retaining performance.
Hard constraints preventing unsafe actions.
Cost of model training.
One complete traversal of the training dataset during training.
End-to-end process for model training.
Halting training when validation performance stops improving to reduce overfitting.
When information from evaluation data improperly influences training, inflating reported performance.
Minimizing average loss on training data; can overfit when data is limited or biased.
Configuration choices not learned directly (or not typically learned) that govern training or architecture.
Number of samples per gradient update; impacts compute efficiency, generalization, and stability.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Ordering training samples from easier to harder to improve convergence or generalization.
Gradually increasing learning rate at training start to avoid divergence.
Recovering training data from gradients.
Inferring sensitive features of training data.
Differences between training and inference conditions.
Model behaves well during training but not deployment.
Combining simulation and real-world data.
Training with a small labeled dataset plus a larger unlabeled dataset, leveraging assumptions like smoothness/cluster structure.
A parameterized mapping from inputs to outputs; includes architecture + learned parameters.
A scalar measure optimized during training, typically expected loss over data, sometimes with regularization terms.
When a model cannot capture underlying structure, performing poorly on both training and test data.
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.
Activation max(0, x); improves gradient flow and training speed in deep nets.
Gradients grow too large, causing divergence; mitigated by clipping, normalization, careful init.
Methods to set starting weights to preserve signal/gradient scales across layers.
Randomly zeroing activations during training to reduce co-adaptation and overfitting.
Training objective where the model predicts the next token given previous tokens (causal modeling).
Expanding training data via transformations (flips, noise, paraphrases) to improve robustness.