Results for "metrics"
A table summarizing classification outcomes, foundational for metrics like precision, recall, specificity.
When some classes are rare, requiring reweighting, resampling, or specialized metrics.
Plots true positive rate vs false positive rate across thresholds; summarizes separability.
Standardized documentation describing intended use, performance, limitations, data, and ethical considerations.
Systematic review of model/data processes to ensure performance, fairness, security, and policy compliance.
A dataset + metric suite for comparing models; can be gamed or misaligned with real-world goals.
Observing model inputs/outputs, latency, cost, and quality over time to catch regressions and drift.
Model exploits poorly specified objectives.
Required descriptions of model behavior and limits.
How well a model performs on new data drawn from the same (or similar) distribution as training.
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.
Fraction of correct predictions; can be misleading on imbalanced datasets.
When information from evaluation data improperly influences training, inflating reported performance.
The degree to which predicted probabilities match true frequencies (e.g., 0.8 means ~80% correct).
Crafting prompts to elicit desired behavior, often using role, structure, constraints, and examples.
Stepwise reasoning patterns that can improve multi-step tasks; often handled implicitly or summarized for safety/privacy.
Retrieval based on embedding similarity rather than keyword overlap, capturing paraphrases and related concepts.
Automated detection/prevention of disallowed outputs (toxicity, self-harm, illegal instruction, etc.).
Systematic differences in model outcomes across groups; arises from data, labels, and deployment context.
Controlled experiment comparing variants by random assignment to estimate causal effects of changes.
Processes and controls for data quality, access, lineage, retention, and compliance across the AI lifecycle.
Practices for operationalizing ML: versioning, CI/CD, monitoring, retraining, and reliable production management.
Automated testing and deployment processes for models and data workflows, extending DevOps to ML artifacts.
Central system to store model versions, metadata, approvals, and deployment state.
Logging hyperparameters, code versions, data snapshots, and results to reproduce and compare experiments.
A broader capability to infer internal system state from telemetry, crucial for AI services and agents.
Hardware resources used for training/inference; constrained by memory bandwidth, FLOPs, and parallelism.
System for running consistent evaluations across tasks, versions, prompts, and model settings.
A discipline ensuring AI systems are fair, safe, transparent, privacy-preserving, and accountable throughout lifecycle.
Capabilities that appear only beyond certain model sizes.