Results for "performance"
When information from evaluation data improperly influences training, inflating reported performance.
Empirical laws linking model size, data, compute to performance.
Observing model inputs/outputs, latency, cost, and quality over time to catch regressions and drift.
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.
A robust evaluation technique that trains/evaluates across multiple splits to estimate performance variability.
A table summarizing classification outcomes, foundational for metrics like precision, recall, specificity.
Harmonic mean of precision and recall; useful when balancing false positives/negatives matters.
Often more informative than ROC on imbalanced datasets; focuses on positive class performance.
Automated testing and deployment processes for models and data workflows, extending DevOps to ML artifacts.
Increasing performance via more data.
Guaranteed response times.
Control that remains stable under model uncertainty.
Stored compute or algorithms enabling rapid jumps.
Tradeoff between safety and performance.
Updating a pretrained model’s weights on task-specific data to improve performance or adapt style/behavior.
A mismatch between training and deployment data distributions that can degrade model performance.
When a model fits noise/idiosyncrasies of training data and performs poorly on unseen data.
Fraction of correct predictions; can be misleading on imbalanced datasets.
Plots true positive rate vs false positive rate across thresholds; summarizes separability.
Scalar summary of ROC; measures ranking ability, not calibration.
Average of squared residuals; common regression objective.
Halting training when validation performance stops improving to reduce overfitting.
Standardized documentation describing intended use, performance, limitations, data, and ethical considerations.
Practices for operationalizing ML: versioning, CI/CD, monitoring, retraining, and reliable production management.
Systematic review of model/data processes to ensure performance, fairness, security, and policy compliance.
Logging hyperparameters, code versions, data snapshots, and results to reproduce and compare experiments.
A broader capability to infer internal system state from telemetry, crucial for AI services and agents.
Hardware resources used for training/inference; constrained by memory bandwidth, FLOPs, and parallelism.
Exponential of average negative log-likelihood; lower means better predictive fit, not necessarily better utility.
A dataset + metric suite for comparing models; can be gamed or misaligned with real-world goals.