Results for "transition function"
Minimizing average loss on training data; can overfit when data is limited or biased.
Techniques that discourage overly complex solutions to improve generalization (reduce overfitting).
When a model fits noise/idiosyncrasies of training data and performs poorly on unseen data.
Average of squared residuals; common regression objective.
A gradient method using random minibatches for efficient training on large datasets.
Uses an exponential moving average of gradients to speed convergence and reduce oscillation.
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
One complete traversal of the training dataset during training.
Gradients shrink through layers, slowing learning in early layers; mitigated by ReLU, residuals, normalization.
Networks using convolution operations with weight sharing and locality, effective for images and signals.
A high-priority instruction layer setting overarching behavior constraints for a chat model.
Achieving task performance by providing a small number of examples inside the prompt without weight updates.
Letting an LLM call external functions/APIs to fetch data, compute, or take actions, improving reliability.
Architecture that retrieves relevant documents (e.g., from a vector DB) and conditions generation on them to reduce hallucinations.
Fine-tuning on (prompt, response) pairs to align a model with instruction-following behaviors.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Local surrogate explanation method approximating model behavior near a specific input.
Removing weights or neurons to shrink models and improve efficiency; can be structured or unstructured.
Stochastic generation strategies that trade determinism for diversity; key knobs include temperature and nucleus sampling.
Scales logits before sampling; higher increases randomness/diversity, lower increases determinism.
Measures divergence between true and predicted probability distributions.
Hidden behavior activated by specific triggers, causing targeted mispredictions or undesired outputs.
Optimization with multiple local minima/saddle points; typical in neural networks.
Methods for breaking goals into steps; can be classical (A*, STRIPS) or LLM-driven with tool calls.
A point where gradient is zero but is neither a max nor min; common in deep nets.
Adjusting learning rate over training to improve convergence.
Optimization using curvature information; often expensive at scale.
Early architecture using learned gates for skip connections.
Routes inputs to subsets of parameters for scalable capacity.