Results for "training mismatch"
Tendency to trust automated suggestions even when incorrect; mitigated by UI design, training, and checks.
A point where gradient is zero but is neither a max nor min; common in deep nets.
The shape of the loss function over parameter space.
A wide basin often correlated with better generalization.
Limiting gradient magnitude to prevent exploding gradients.
Allows gradients to bypass layers, enabling very deep networks.
Prevents attention to future tokens during training/inference.
Capabilities that appear only beyond certain model sizes.
Controls amount of noise added at each diffusion step.
Minimum relative to nearby points.
Two-network setup where generator fools a discriminator.
Flat high-dimensional regions slowing training.
Ensuring learned behavior matches intended objective.
Maintaining alignment under new conditions.
Applying learned patterns incorrectly.
Model trained on its own outputs degrades quality.
Model relies on irrelevant signals.
Performance drop when moving from simulation to reality.
Differences between training and deployed patient populations.
Learning structure from unlabeled data, such as discovering groups, compressing representations, or modeling data distributions.
A branch of ML using multi-layer neural networks to learn hierarchical representations, often excelling in vision, speech, and language.
Training one model on multiple tasks simultaneously to improve generalization through shared structure.
A structured collection of examples used to train/evaluate models; quality, bias, and coverage often dominate outcomes.
A function measuring prediction error (and sometimes calibration), guiding gradient-based optimization.
Iterative method that updates parameters in the direction of negative gradient to minimize loss.
Popular optimizer combining momentum and per-parameter adaptive step sizes via first/second moment estimates.
Nonlinear functions enabling networks to approximate complex mappings; ReLU variants dominate modern DL.
Attention where queries/keys/values come from the same sequence, enabling token-to-token interactions.
An RNN variant using gates to mitigate vanishing gradients and capture longer context.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.