Results for "nonlinear function"
Automatically learning useful internal features (latent variables) that capture salient structure for downstream tasks.
Configuration choices not learned directly (or not typically learned) that govern training or architecture.
Minimizing average loss on training data; can overfit when data is limited or biased.
Techniques that discourage overly complex solutions to improve generalization (reduce overfitting).
When a model fits noise/idiosyncrasies of training data and performs poorly on unseen data.
Average of squared residuals; common regression objective.
A gradient method using random minibatches for efficient training on large datasets.
Uses an exponential moving average of gradients to speed convergence and reduce oscillation.
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
One complete traversal of the training dataset during training.
Gradients shrink through layers, slowing learning in early layers; mitigated by ReLU, residuals, normalization.
Networks using convolution operations with weight sharing and locality, effective for images and signals.
A high-priority instruction layer setting overarching behavior constraints for a chat model.
Achieving task performance by providing a small number of examples inside the prompt without weight updates.
Letting an LLM call external functions/APIs to fetch data, compute, or take actions, improving reliability.
Architecture that retrieves relevant documents (e.g., from a vector DB) and conditions generation on them to reduce hallucinations.
Fine-tuning on (prompt, response) pairs to align a model with instruction-following behaviors.
A preference-based training method optimizing policies directly from pairwise comparisons without explicit RL loops.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Local surrogate explanation method approximating model behavior near a specific input.
Removing weights or neurons to shrink models and improve efficiency; can be structured or unstructured.
Stochastic generation strategies that trade determinism for diversity; key knobs include temperature and nucleus sampling.
Scales logits before sampling; higher increases randomness/diversity, lower increases determinism.
Measures divergence between true and predicted probability distributions.
Hidden behavior activated by specific triggers, causing targeted mispredictions or undesired outputs.
Optimization with multiple local minima/saddle points; typical in neural networks.
Methods for breaking goals into steps; can be classical (A*, STRIPS) or LLM-driven with tool calls.
A point where gradient is zero but is neither a max nor min; common in deep nets.
Adjusting learning rate over training to improve convergence.
Optimization using curvature information; often expensive at scale.