Results for "evaluation protocol"
System for running consistent evaluations across tasks, versions, prompts, and model settings.
When information from evaluation data improperly influences training, inflating reported performance.
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.
A robust evaluation technique that trains/evaluates across multiple splits to estimate performance variability.
A table summarizing classification outcomes, foundational for metrics like precision, recall, specificity.
Fraction of correct predictions; can be misleading on imbalanced datasets.
Harmonic mean of precision and recall; useful when balancing false positives/negatives matters.
Scalar summary of ROC; measures ranking ability, not calibration.
Plots true positive rate vs false positive rate across thresholds; summarizes separability.
Crafting prompts to elicit desired behavior, often using role, structure, constraints, and examples.
When some classes are rare, requiring reweighting, resampling, or specialized metrics.
Standardized documentation describing intended use, performance, limitations, data, and ethical considerations.
Practices for operationalizing ML: versioning, CI/CD, monitoring, retraining, and reliable production management.
Automated testing and deployment processes for models and data workflows, extending DevOps to ML artifacts.
Central system to store model versions, metadata, approvals, and deployment state.
A dataset + metric suite for comparing models; can be gamed or misaligned with real-world goals.
Tendency to trust automated suggestions even when incorrect; mitigated by UI design, training, and checks.
Error due to sensitivity to fluctuations in the training dataset.
Quantifies shared information between random variables.
Tradeoffs between many layers vs many neurons per layer.
All possible configurations an agent may encounter.
Strategy mapping states to actions.
Pixel-level separation of individual object instances.
Pixel-wise classification of image regions.
What would have happened under different conditions.
End-to-end process for model training.
Running predictions on large datasets periodically.
Model exploits poorly specified objectives.
Model optimizes objectives misaligned with human values.
Ensuring learned behavior matches intended objective.