Results for "function approximator"
A gradient method using random minibatches for efficient training on large datasets.
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
One complete traversal of the training dataset during training.
Networks using convolution operations with weight sharing and locality, effective for images and signals.
Gradients shrink through layers, slowing learning in early layers; mitigated by ReLU, residuals, normalization.
A high-priority instruction layer setting overarching behavior constraints for a chat model.
Achieving task performance by providing a small number of examples inside the prompt without weight updates.
Letting an LLM call external functions/APIs to fetch data, compute, or take actions, improving reliability.
Architecture that retrieves relevant documents (e.g., from a vector DB) and conditions generation on them to reduce hallucinations.
Fine-tuning on (prompt, response) pairs to align a model with instruction-following behaviors.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Local surrogate explanation method approximating model behavior near a specific input.
Stochastic generation strategies that trade determinism for diversity; key knobs include temperature and nucleus sampling.
Removing weights or neurons to shrink models and improve efficiency; can be structured or unstructured.
Scales logits before sampling; higher increases randomness/diversity, lower increases determinism.
Hidden behavior activated by specific triggers, causing targeted mispredictions or undesired outputs.
Measures divergence between true and predicted probability distributions.
Methods for breaking goals into steps; can be classical (A*, STRIPS) or LLM-driven with tool calls.
Optimization with multiple local minima/saddle points; typical in neural networks.
A point where gradient is zero but is neither a max nor min; common in deep nets.
Adjusting learning rate over training to improve convergence.
Optimization using curvature information; often expensive at scale.
Early architecture using learned gates for skip connections.
Routes inputs to subsets of parameters for scalable capacity.
Empirical laws linking model size, data, compute to performance.
Chooses which experts process each token.
Fundamental recursive relationship defining optimal value functions.
Embedding signals to prove model ownership.
Detecting unauthorized model outputs or data leaks.