Results for "data mismatch"
A mismatch between training and deployment data distributions that can degrade model performance.
Train/test environment mismatch.
Ensuring AI systems pursue intended human goals.
Differences between training and inference conditions.
Processes and controls for data quality, access, lineage, retention, and compliance across the AI lifecycle.
Tracking where data came from and how it was transformed; key for debugging and compliance.
Artificially created data used to train/test models; helpful for privacy and coverage, risky if unrealistic.
Expanding training data via transformations (flips, noise, paraphrases) to improve robustness.
Training across many devices/silos without centralizing raw data; aggregates updates, not data.
Increasing performance via more data.
Protecting data during network transfer and while stored; essential for ML pipelines handling sensitive data.
Human or automated process of assigning targets; quality, consistency, and guidelines matter heavily.
When information from evaluation data improperly influences training, inflating reported performance.
Shift in feature distribution over time.
Maliciously inserting or altering training data to implant backdoors or degrade performance.
Privacy risk analysis under GDPR-like laws.
Learning structure from unlabeled data, such as discovering groups, compressing representations, or modeling data distributions.
When a model fits noise/idiosyncrasies of training data and performs poorly on unseen data.
Information that can identify an individual (directly or indirectly); requires careful handling and compliance.
Inferring sensitive features of training data.
Learning from data by constructing “pseudo-labels” (e.g., next-token prediction, masked modeling) without manual annotation.
Training with a small labeled dataset plus a larger unlabeled dataset, leveraging assumptions like smoothness/cluster structure.
Learning where data arrives sequentially and the model updates continuously, often under changing distributions.
The internal space where learned representations live; operations here often correlate with semantics or generative factors.
Designing input features to expose useful structure (e.g., ratios, lags, aggregations), often crucial outside deep learning.
Empirical laws linking model size, data, compute to performance.
Recovering training data from gradients.
Diffusion model trained to remove noise step by step.
Generative model that learns to reverse a gradual noise process.
Diffusion performed in latent space for efficiency.