Domain: Foundations & Theory
The physical system being controlled.
Injects sequence order into Transformers, since attention alone is permutation-invariant.
Of predicted positives, the fraction that are truly positive; sensitive to false positives.
Attacks that infer whether specific records were in training data, or reconstruct sensitive training examples.
Attacks that manipulate model instructions (especially via retrieved content) to override system goals or exfiltrate data.
Removing weights or neurons to shrink models and improve efficiency; can be structured or unstructured.
Reducing numeric precision of weights/activations to speed inference and reduce memory with acceptable accuracy loss.
Architecture that retrieves relevant documents (e.g., from a vector DB) and conditions generation on them to reduce hallucinations.
Of true positives, the fraction correctly identified; sensitive to false negatives.
Techniques that discourage overly complex solutions to improve generalization (reduce overfitting).
Activation max(0, x); improves gradient flow and training speed in deep nets.
Ability to replicate results given same code/data; harder in distributed training and nondeterministic ops.
Model trained to predict human preferences (or utility) for candidate outputs; used in RLHF-style pipelines.
Control that remains stable under model uncertainty.
Plots true positive rate vs false positive rate across thresholds; summarizes separability.
Flat high-dimensional regions slowing training.
Automated detection/prevention of disallowed outputs (toxicity, self-harm, illegal instruction, etc.).
Stochastic generation strategies that trade determinism for diversity; key knobs include temperature and nucleus sampling.
Methods to protect model/data during inference (e.g., trusted execution environments) from operators/attackers.
Retrieval based on embedding similarity rather than keyword overlap, capturing paraphrases and related concepts.
Fine-tuning on (prompt, response) pairs to align a model with instruction-following behaviors.
Feature attribution method grounded in cooperative game theory for explaining predictions in tabular settings.
Converts logits to probabilities by exponentiation and normalization; common in classification and LMs.
Of true negatives, the fraction correctly identified.
System returns to equilibrium after disturbance.
Optimization under uncertainty.
A gradient method using random minibatches for efficient training on large datasets.
Forcing predictable formats for downstream systems; reduces parsing errors and supports validation/guardrails.
Artificially created data used to train/test models; helpful for privacy and coverage, risky if unrealistic.
Scales logits before sampling; higher increases randomness/diversity, lower increases determinism.