Results for "compute-data-performance"
Often more informative than ROC on imbalanced datasets; focuses on positive class performance.
Guaranteed response times.
Control that remains stable under model uncertainty.
Tradeoff between safety and performance.
Maliciously inserting or altering training data to implant backdoors or degrade performance.
When a model fits noise/idiosyncrasies of training data and performs poorly on unseen data.
A mismatch between training and deployment data distributions that can degrade model performance.
Logging hyperparameters, code versions, data snapshots, and results to reproduce and compare experiments.
Systematic review of model/data processes to ensure performance, fairness, security, and policy compliance.
End-to-end process for model training.
Practices for operationalizing ML: versioning, CI/CD, monitoring, retraining, and reliable production management.
Capabilities that appear only beyond certain model sizes.
Tracking where data came from and how it was transformed; key for debugging and compliance.
Halting training when validation performance stops improving to reduce overfitting.
Standardized documentation describing intended use, performance, limitations, data, and ethical considerations.
A broader capability to infer internal system state from telemetry, crucial for AI services and agents.
Exponential of average negative log-likelihood; lower means better predictive fit, not necessarily better utility.
Required descriptions of model behavior and limits.
Maximum system processing rate.
Dynamic resource allocation.
Unequal performance across demographic groups.
Human or automated process of assigning targets; quality, consistency, and guidelines matter heavily.
Shift in feature distribution over time.
Fraction of correct predictions; can be misleading on imbalanced datasets.
Plots true positive rate vs false positive rate across thresholds; summarizes separability.
Scalar summary of ROC; measures ranking ability, not calibration.
Average of squared residuals; common regression objective.
A dataset + metric suite for comparing models; can be gamed or misaligned with real-world goals.
System for running consistent evaluations across tasks, versions, prompts, and model settings.
Assigning category labels to images.