Results for "long context"
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
Extending agents with long-term memory stores.
Mechanism that computes context-aware mixtures of representations; scales well and captures long-range dependencies.
An RNN variant using gates to mitigate vanishing gradients and capture longer context.
Persistent directional movement over time.
Networks with recurrent connections for sequences; largely supplanted by Transformers for many tasks.
Mechanisms for retaining context across turns/sessions: scratchpads, vector memories, structured stores.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
Stores past attention states to speed up autoregressive decoding.
Number of steps considered in planning.
Attention mechanisms that reduce quadratic complexity.
Balancing learning new behaviors vs exploiting known rewards.
Identifying speakers in audio.
Predicting future values from past observations.
Competitive advantage from proprietary models/data.
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
Techniques to handle longer documents without quadratic cost.
Systematic differences in model outcomes across groups; arises from data, labels, and deployment context.
Samples from the smallest set of tokens whose probabilities sum to p, adapting set size by context.
Flat high-dimensional regions slowing training.
Willingness of system to accept correction or shutdown.
Multiple examples included in prompt.
AI used in sensitive domains requiring compliance.
Ability to correctly detect disease.
Using markers to isolate context segments.
Of predicted positives, the fraction that are truly positive; sensitive to false positives.
Of true positives, the fraction correctly identified; sensitive to false negatives.
Of true negatives, the fraction correctly identified.
The degree to which predicted probabilities match true frequencies (e.g., 0.8 means ~80% correct).
Number of samples per gradient update; impacts compute efficiency, generalization, and stability.