Results for "loss geometry"
How well a model performs on new data drawn from the same (or similar) distribution as training.
The degree to which predicted probabilities match true frequencies (e.g., 0.8 means ~80% correct).
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
One complete traversal of the training dataset during training.
A parameterized function composed of interconnected units organized in layers with nonlinear activations.
Training objective where the model predicts the next token given previous tokens (causal modeling).
Gradients shrink through layers, slowing learning in early layers; mitigated by ReLU, residuals, normalization.
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
Achieving task performance by providing a small number of examples inside the prompt without weight updates.
Fine-tuning on (prompt, response) pairs to align a model with instruction-following behaviors.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Local surrogate explanation method approximating model behavior near a specific input.
Ordering training samples from easier to harder to improve convergence or generalization.
Reducing numeric precision of weights/activations to speed inference and reduce memory with acceptable accuracy loss.
Removing weights or neurons to shrink models and improve efficiency; can be structured or unstructured.
Optimization problems where any local minimum is global.
Hidden behavior activated by specific triggers, causing targeted mispredictions or undesired outputs.
Optimization with multiple local minima/saddle points; typical in neural networks.
Optimization using curvature information; often expensive at scale.
Empirical laws linking model size, data, compute to performance.
Autoencoder using probabilistic latent variables and KL regularization.
Generator produces limited variety of outputs.
Assigning category labels to images.
Joint vision-language model aligning images and text.
Generates audio waveforms from spectrograms.
Direction of steepest ascent of a function.
Using production outcomes to improve models.
Flat high-dimensional regions slowing training.