Results for "probabilistic loss"
Reusing knowledge from a source task/domain to improve learning on a target task/domain, typically via pretrained models.
Learning where data arrives sequentially and the model updates continuously, often under changing distributions.
Training one model on multiple tasks simultaneously to improve generalization through shared structure.
Methods that learn training procedures or initializations so models can adapt quickly to new tasks with little data.
Techniques that discourage overly complex solutions to improve generalization (reduce overfitting).
Automatically learning useful internal features (latent variables) that capture salient structure for downstream tasks.
How well a model performs on new data drawn from the same (or similar) distribution as training.
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
One complete traversal of the training dataset during training.
A parameterized function composed of interconnected units organized in layers with nonlinear activations.
Training objective where the model predicts the next token given previous tokens (causal modeling).
Gradients shrink through layers, slowing learning in early layers; mitigated by ReLU, residuals, normalization.
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
Achieving task performance by providing a small number of examples inside the prompt without weight updates.
Fine-tuning on (prompt, response) pairs to align a model with instruction-following behaviors.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Local surrogate explanation method approximating model behavior near a specific input.
Ordering training samples from easier to harder to improve convergence or generalization.
Reducing numeric precision of weights/activations to speed inference and reduce memory with acceptable accuracy loss.
Removing weights or neurons to shrink models and improve efficiency; can be structured or unstructured.
Optimization problems where any local minimum is global.
Hidden behavior activated by specific triggers, causing targeted mispredictions or undesired outputs.
Optimization with multiple local minima/saddle points; typical in neural networks.
Optimization using curvature information; often expensive at scale.
Empirical laws linking model size, data, compute to performance.
Generator produces limited variety of outputs.
Assigning category labels to images.
Joint vision-language model aligning images and text.