Results for "data distribution"
Designing input features to expose useful structure (e.g., ratios, lags, aggregations), often crucial outside deep learning.
Empirical laws linking model size, data, compute to performance.
Sequential data indexed by time.
Artificial sensor data generated in simulation.
Combining simulation and real-world data.
A formal privacy framework ensuring outputs do not reveal much about any single individual’s data contribution.
A subfield of AI where models learn patterns from data to make predictions or decisions, improving with experience rather than explicit rule-coding.
Minimizing average loss on training data; can overfit when data is limited or biased.
A branch of ML using multi-layer neural networks to learn hierarchical representations, often excelling in vision, speech, and language.
Separating data into training (fit), validation (tune), and test (final estimate) to avoid leakage and optimism bias.
Automatically learning useful internal features (latent variables) that capture salient structure for downstream tasks.
Forcing predictable formats for downstream systems; reduces parsing errors and supports validation/guardrails.
Learning from data generated by a different policy.
Model that compresses input into latent space and reconstructs it.
Probability of data given parameters.
Running models locally.
Requirement to preserve relevant data.
Trend reversal when data is aggregated improperly.
A structured collection of examples used to train/evaluate models; quality, bias, and coverage often dominate outcomes.
A measurable property or attribute used as model input (raw or engineered), such as age, pixel intensity, or token ID.
A parameterized mapping from inputs to outputs; includes architecture + learned parameters.
The learned numeric values of a model adjusted during training to minimize a loss function.
When a model cannot capture underlying structure, performing poorly on both training and test data.
Systematic differences in model outcomes across groups; arises from data, labels, and deployment context.
Measure of consistency across labelers; low agreement indicates ambiguous tasks or poor guidelines.
Selecting the most informative samples to label (e.g., uncertainty sampling) to reduce labeling cost.
Structured dataset documentation covering collection, composition, recommended uses, biases, and maintenance.
Logging hyperparameters, code versions, data snapshots, and results to reproduce and compare experiments.
Systematic review of model/data processes to ensure performance, fairness, security, and policy compliance.
Attacks that infer whether specific records were in training data, or reconstruct sensitive training examples.