Results for "LM quality metric"
Harmonic mean of precision and recall; useful when balancing false positives/negatives matters.
Processes and controls for data quality, access, lineage, retention, and compliance across the AI lifecycle.
Penalizes confident wrong predictions heavily; standard for classification and language modeling.
Average of squared residuals; common regression objective.
A structured collection of examples used to train/evaluate models; quality, bias, and coverage often dominate outcomes.
Human or automated process of assigning targets; quality, consistency, and guidelines matter heavily.
Breaking documents into pieces for retrieval; chunk size/overlap strongly affect RAG quality.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Observing model inputs/outputs, latency, cost, and quality over time to catch regressions and drift.
Model trained on its own outputs degrades quality.
A robust evaluation technique that trains/evaluates across multiple splits to estimate performance variability.
Fraction of correct predictions; can be misleading on imbalanced datasets.
Of predicted positives, the fraction that are truly positive; sensitive to false positives.
Of true positives, the fraction correctly identified; sensitive to false negatives.
Of true negatives, the fraction correctly identified.
Scalar summary of ROC; measures ranking ability, not calibration.
A proper scoring rule measuring squared error of predicted probabilities for binary outcomes.
Time from request to response; critical for real-time inference and UX.
How many requests or tokens can be processed per unit time; affects scalability and cost.
A dataset + metric suite for comparing models; can be gamed or misaligned with real-world goals.
Reduction in uncertainty achieved by observing a variable; used in decision trees and active learning.
Ability to correctly detect disease.
Measure of vector magnitude; used in regularization and optimization.
Failure to detect present disease.
Designing input features to expose useful structure (e.g., ratios, lags, aggregations), often crucial outside deep learning.
The text (and possibly other modalities) given to an LLM to condition its output behavior.
Architecture that retrieves relevant documents (e.g., from a vector DB) and conditions generation on them to reduce hallucinations.
Measure of consistency across labelers; low agreement indicates ambiguous tasks or poor guidelines.
Structured dataset documentation covering collection, composition, recommended uses, biases, and maintenance.
Tracking where data came from and how it was transformed; key for debugging and compliance.