Search: evaluation protocol

Eval Harness Intermediate

System for running consistent evaluations across tasks, versions, prompts, and model settings.

Foundations & Theory

Data Leakage Intermediate

When information from evaluation data improperly influences training, inflating reported performance.

Foundations & Theory

Train/Validation/Test Split Intermediate

Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.

Evaluation & Benchmarking

Cross-Validation Intermediate

A robust evaluation technique that trains/evaluates across multiple splits to estimate performance variability.

Foundations & Theory

Confusion Matrix Intermediate

A table summarizing classification outcomes, foundational for metrics like precision, recall, specificity.

Foundations & Theory

Accuracy Intermediate

Fraction of correct predictions; can be misleading on imbalanced datasets.

Foundations & Theory

F1 Score Intermediate

Harmonic mean of precision and recall; useful when balancing false positives/negatives matters.

Foundations & Theory

AUC Intermediate

Scalar summary of ROC; measures ranking ability, not calibration.

Foundations & Theory

ROC Curve Intermediate

Plots true positive rate vs false positive rate across thresholds; summarizes separability.

Foundations & Theory

Prompt Engineering Intermediate

Crafting prompts to elicit desired behavior, often using role, structure, constraints, and examples.

Prompting & Instructions

Class Imbalance Intermediate

When some classes are rare, requiring reweighting, resampling, or specialized metrics.

Machine Learning

Model Card Intermediate

Standardized documentation describing intended use, performance, limitations, data, and ethical considerations.

Foundations & Theory

MLOps Intermediate

Practices for operationalizing ML: versioning, CI/CD, monitoring, retraining, and reliable production management.

MLOps & Infrastructure

CI/CD for ML Intermediate

Automated testing and deployment processes for models and data workflows, extending DevOps to ML artifacts.

MLOps & Infrastructure

Model Registry Intermediate

Central system to store model versions, metadata, approvals, and deployment state.

Foundations & Theory

Benchmark Intermediate

A dataset + metric suite for comparing models; can be gamed or misaligned with real-world goals.

Evaluation & Benchmarking

Automation Bias Intermediate

Tendency to trust automated suggestions even when incorrect; mitigated by UI design, training, and checks.

Foundations & Theory

Variance Term Intermediate

Error due to sensitivity to fluctuations in the training dataset.

AI Economics & Strategy

Mutual Information Intermediate

Quantifies shared information between random variables.

AI Economics & Strategy

Depth vs Width Intermediate

Tradeoffs between many layers vs many neurons per layer.

AI Economics & Strategy

State Space Intermediate

All possible configurations an agent may encounter.

AI Economics & Strategy

Policy Intermediate

Strategy mapping states to actions.

AI Economics & Strategy

Instance Segmentation Intermediate

Pixel-level separation of individual object instances.

Computer Vision

Semantic Segmentation Intermediate

Pixel-wise classification of image regions.

Computer Vision

Counterfactual Advanced

What would have happened under different conditions.

Causal AI & Interpretability

Training Pipeline Intermediate

End-to-end process for model training.

MLOps & Infrastructure

Batch Inference Intermediate

Running predictions on large datasets periodically.

MLOps & Infrastructure

Specification Gaming Advanced

Model exploits poorly specified objectives.

AI Safety & Alignment

Value Misalignment Advanced

Model optimizes objectives misaligned with human values.

AI Safety & Alignment

Inner Alignment Advanced

Ensuring learned behavior matches intended objective.

AI Safety & Alignment

Results for "evaluation protocol"

Welcome to AI Glossary

Search

Browse

3D WordGraph