Results for "trial-and-error"
System-level design for general intelligence.
Risk threatening humanity’s survival.
A branch of ML using multi-layer neural networks to learn hierarchical representations, often excelling in vision, speech, and language.
A mismatch between training and deployment data distributions that can degrade model performance.
Designing input features to expose useful structure (e.g., ratios, lags, aggregations), often crucial outside deep learning.
How well a model performs on new data drawn from the same (or similar) distribution as training.
When information from evaluation data improperly influences training, inflating reported performance.
Penalizes confident wrong predictions heavily; standard for classification and language modeling.
Iterative method that updates parameters in the direction of negative gradient to minimize loss.
A parameterized function composed of interconnected units organized in layers with nonlinear activations.
A gradient method using random minibatches for efficient training on large datasets.
Mechanism that computes context-aware mixtures of representations; scales well and captures long-range dependencies.
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
Architecture that retrieves relevant documents (e.g., from a vector DB) and conditions generation on them to reduce hallucinations.
A datastore optimized for similarity search over embeddings, enabling semantic retrieval at scale.
Constraining outputs to retrieved or provided sources, often with citation, to improve factual reliability.
Model-generated content that is fluent but unsupported by evidence or incorrect; mitigated by grounding and verification.
Automated detection/prevention of disallowed outputs (toxicity, self-harm, illegal instruction, etc.).
Feature attribution method grounded in cooperative game theory for explaining predictions in tabular settings.
Controlled experiment comparing variants by random assignment to estimate causal effects of changes.
Training across many devices/silos without centralizing raw data; aggregates updates, not data.
How many requests or tokens can be processed per unit time; affects scalability and cost.
Techniques that fine-tune small additional components rather than all weights to reduce compute and storage.
Scales logits before sampling; higher increases randomness/diversity, lower increases determinism.
Converts logits to probabilities by exponentiation and normalization; common in classification and LMs.
Models that process or generate multiple modalities, enabling vision-language tasks, speech, video understanding, etc.
The shape of the loss function over parameter space.
Adjusting learning rate over training to improve convergence.
Tradeoffs between many layers vs many neurons per layer.
Empirical laws linking model size, data, compute to performance.