Domain: Evaluation & Benchmarking
Benchmark
Intermediate
A dataset + metric suite for comparing models; can be gamed or misaligned with real-world goals.
Brier Score
Intermediate
A proper scoring rule measuring squared error of predicted probabilities for binary outcomes.
Experiment Tracking
Intermediate
Logging hyperparameters, code versions, data snapshots, and results to reproduce and compare experiments.
Observability
Intermediate
A broader capability to infer internal system state from telemetry, crucial for AI services and agents.
Perplexity
Intermediate
Exponential of average negative log-likelihood; lower means better predictive fit, not necessarily better utility.
PR Curve
Intermediate
Often more informative than ROC on imbalanced datasets; focuses on positive class performance.
Train/Validation/Test Split
Intermediate
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.