Results for "per-parameter"
Popular optimizer combining momentum and per-parameter adaptive step sizes via first/second moment estimates.
How many requests or tokens can be processed per unit time; affects scalability and cost.
Using same parameters across different parts of a model.
Methods like Adam adjusting learning rates dynamically.
Measures how much information an observable random variable carries about unknown parameters.
Bayesian parameter estimation using the mode of the posterior distribution.
Probability of data given parameters.
Belief before observing data.
Assigning labels per pixel (semantic) or per instance (instance segmentation) to map object boundaries.
Tradeoffs between many layers vs many neurons per layer.
Cost to run models in production.
Maximum system processing rate.
Uses an exponential moving average of gradients to speed convergence and reduce oscillation.
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
Techniques that fine-tune small additional components rather than all weights to reduce compute and storage.
The shape of the loss function over parameter space.
Estimating parameters by maximizing likelihood of observed data.
A narrow minimum often associated with poorer generalization.
A wide basin often correlated with better generalization.
Updated belief after observing data.
Visualization of optimization landscape.
Number of samples per gradient update; impacts compute efficiency, generalization, and stability.
Hardware resources used for training/inference; constrained by memory bandwidth, FLOPs, and parallelism.
Low-latency prediction per request.
Increasing model capacity via compute.
Halting training when validation performance stops improving to reduce overfitting.
A gradient method using random minibatches for efficient training on large datasets.
PEFT method injecting trainable low-rank matrices into layers, enabling efficient fine-tuning.
Systematic error introduced by simplifying assumptions in a learning algorithm.
A point where gradient is zero but is neither a max nor min; common in deep nets.