Results for "function calls"
Techniques that discourage overly complex solutions to improve generalization (reduce overfitting).
When a model fits noise/idiosyncrasies of training data and performs poorly on unseen data.
Average of squared residuals; common regression objective.
Uses an exponential moving average of gradients to speed convergence and reduce oscillation.
A gradient method using random minibatches for efficient training on large datasets.
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
One complete traversal of the training dataset during training.
Networks using convolution operations with weight sharing and locality, effective for images and signals.
Gradients shrink through layers, slowing learning in early layers; mitigated by ReLU, residuals, normalization.
A high-priority instruction layer setting overarching behavior constraints for a chat model.
Achieving task performance by providing a small number of examples inside the prompt without weight updates.
Letting an LLM call external functions/APIs to fetch data, compute, or take actions, improving reliability.
Architecture that retrieves relevant documents (e.g., from a vector DB) and conditions generation on them to reduce hallucinations.
Fine-tuning on (prompt, response) pairs to align a model with instruction-following behaviors.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Local surrogate explanation method approximating model behavior near a specific input.
Stochastic generation strategies that trade determinism for diversity; key knobs include temperature and nucleus sampling.
Removing weights or neurons to shrink models and improve efficiency; can be structured or unstructured.
Scales logits before sampling; higher increases randomness/diversity, lower increases determinism.
Hidden behavior activated by specific triggers, causing targeted mispredictions or undesired outputs.
Measures divergence between true and predicted probability distributions.
Optimization with multiple local minima/saddle points; typical in neural networks.
A point where gradient is zero but is neither a max nor min; common in deep nets.
Adjusting learning rate over training to improve convergence.
Optimization using curvature information; often expensive at scale.
Early architecture using learned gates for skip connections.
Routes inputs to subsets of parameters for scalable capacity.
Chooses which experts process each token.
Empirical laws linking model size, data, compute to performance.