Results for "regression loss"
Automatically learning useful internal features (latent variables) that capture salient structure for downstream tasks.
How well a model performs on new data drawn from the same (or similar) distribution as training.
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
One complete traversal of the training dataset during training.
A parameterized function composed of interconnected units organized in layers with nonlinear activations.
Training objective where the model predicts the next token given previous tokens (causal modeling).
Gradients shrink through layers, slowing learning in early layers; mitigated by ReLU, residuals, normalization.
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
Achieving task performance by providing a small number of examples inside the prompt without weight updates.
Fine-tuning on (prompt, response) pairs to align a model with instruction-following behaviors.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Ordering training samples from easier to harder to improve convergence or generalization.
Reducing numeric precision of weights/activations to speed inference and reduce memory with acceptable accuracy loss.
Optimization problems where any local minimum is global.
Removing weights or neurons to shrink models and improve efficiency; can be structured or unstructured.
Optimization with multiple local minima/saddle points; typical in neural networks.
Hidden behavior activated by specific triggers, causing targeted mispredictions or undesired outputs.
Optimization using curvature information; often expensive at scale.
Empirical laws linking model size, data, compute to performance.
Autoencoder using probabilistic latent variables and KL regularization.
Generator produces limited variety of outputs.
Assigning category labels to images.
Joint vision-language model aligning images and text.
Generates audio waveforms from spectrograms.
Using production outcomes to improve models.
Direction of steepest ascent of a function.
Flat high-dimensional regions slowing training.
Choosing step size along gradient direction.