Results for "standardized evaluation"
Standardized documentation describing intended use, performance, limitations, data, and ethical considerations.
System for running consistent evaluations across tasks, versions, prompts, and model settings.
A dataset + metric suite for comparing models; can be gamed or misaligned with real-world goals.
Sum of independent variables converges to normal distribution.
Structured dataset documentation covering collection, composition, recommended uses, biases, and maintenance.
US framework for AI risk governance.
When information from evaluation data improperly influences training, inflating reported performance.
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.
A robust evaluation technique that trains/evaluates across multiple splits to estimate performance variability.
A table summarizing classification outcomes, foundational for metrics like precision, recall, specificity.
Fraction of correct predictions; can be misleading on imbalanced datasets.
Harmonic mean of precision and recall; useful when balancing false positives/negatives matters.
Plots true positive rate vs false positive rate across thresholds; summarizes separability.
Scalar summary of ROC; measures ranking ability, not calibration.
Crafting prompts to elicit desired behavior, often using role, structure, constraints, and examples.
When some classes are rare, requiring reweighting, resampling, or specialized metrics.
Practices for operationalizing ML: versioning, CI/CD, monitoring, retraining, and reliable production management.
Automated testing and deployment processes for models and data workflows, extending DevOps to ML artifacts.
Central system to store model versions, metadata, approvals, and deployment state.
Tendency to trust automated suggestions even when incorrect; mitigated by UI design, training, and checks.
Error due to sensitivity to fluctuations in the training dataset.
Quantifies shared information between random variables.
Tradeoffs between many layers vs many neurons per layer.
All possible configurations an agent may encounter.
Strategy mapping states to actions.
Pixel-level separation of individual object instances.
Pixel-wise classification of image regions.
What would have happened under different conditions.
End-to-end process for model training.
Model exploits poorly specified objectives.